Abstract
The multi-scale object detection algorithm based on deep learning is a common method for safety monitoring in current power operation scenarios, which is of great significance to ensuring the safety of power operations. For certain power applications with high real-time requirements, the computational complexity of deep learning models may become a bottleneck. Deploying deep learning models requires high-performance hardware support, such as GPUs, which might not be easily achievable in some power field environments. According to the characteristics of common safety monitoring tasks in power operation scenarios, this paper proposes an automatic structured pruning method for multi-scale object detection algorithms. This method is designed for the safety monitoring of single and fixed-scale targets, effectively reducing the complexity of inference calculations and improving the frames per second (FPS) of real-time object detection without compromising accuracy. Furthermore, the proposed method can adaptively perform automatic structured pruning for targets of different scales. Experiments conducted using the YOLOv5 model demonstrate that the proposed method improves inference speed by approximately 20% for safety monitoring tasks without reducing detection accuracy.
Introduction
In the electrical power operation scenario, there are numerous safety hazards, and artificial intelligence technology can reduce risks by detecting abnormal events. Additionally, improving the model’s detection accuracy and inference speed further enhances safety. In the safety monitoring tasks in electrical power operation scenarios, the distance between the surveillance devices and their monitoring targets is mostly fixed, making fixed-scale object detection tasks prevalent. However, directly using a pre-trained model for inference may lead to computational and storage space wastage. This wastage primarily stems from predicting object detection results using multi-scale feature maps. In the prediction of fixed single-scale targets, during the inference process, the prediction results are mainly generated by predicting the corresponding single feature map of that scale. Using multiple-scale feature maps for predicting target detection results is challenging to improve detection accuracy. Additionally, it increases the network parameters and computational load in the output section, leading to a decrease in inference speed.
To address these issues, this paper proposes a two-stage method to structure pruning for single, fixed-scale object detection models automatically. In the first stage, after deploying cameras on-site, an unpruned model is used for trial detection. Coordinate information and confidence levels of predicted bounding boxes for the detected targets are collected. When the collected targets reach a set quantity, the process proceeds to the second stage. In the second stage, the stored detection targets are analyzed. Based on the ratios of object detection boxes to image width and height, targets are categorized into three classes: large-scale targets, medium-scale targets, and small-scale targets. If all stored targets belong to one class, the corresponding network layers in the multi-scale object detection model output section that matches the target scale are retained, and layers that do not match the target scale are pruned. This completes the model pruning, resulting in a new object detection model, which is then used for formal detection.
Experiments in this paper validate the performance of this method on the YOLOv5s and YOLOv5n. The results demonstrate that the proposed method improves inference speed by over 20% while maintaining model detection accuracy.
Major Contributions:
Development of an Automatic Structured Pruning Method: This method targets multi-scale object detection algorithms, optimizing them for safety monitoring in power operation scenarios.
Improvement in Inference Speed: The proposed method enhances the inference speed by about 20%, crucial for real-time applications, without compromising detection accuracy.
Adaptability to Different Scales: The pruning method can adaptively adjust for targets of various scales, making it versatile for different monitoring tasks.
Empirical Validation: Extensive experiments with YOLOv5 models confirm the method’s effectiveness in improving performance metrics.
Related work
This article focuses on the safety monitoring task of single and fixed-scale targets in electrical power operation scenarios. It introduces the pruning optimization of the output part of a single-scale object detection model, with the YOLO [1] series algorithm being a typical representative of single-stage object detection algorithms.
Object detection algorithms
Mainstream object detection algorithms based on deep learning are categorized into two types based on the generation of candidate boxes: two-stage object detection methods and one-stage object detection methods [2].
The core idea of two-stage object detection methods is generating a large number of candidate regions in the image and then predicting categories and target boxes for each candidate region individually. This approach requires the use of CNN for each candidate region, resulting in significant computational redundancy and slow inference speed. Classic two-stage object detection methods include R-CNN [3], fast R-CNN [4], and faster R-CNN [5].
One-stage object detection methods directly classify regions of interest as background or target objects [6]. These methods, including YOLO, SSD [7], and RetinaNet [8], integrate considerations of detection accuracy and inference speed, making them widely used in real-time object detection.
YOLO is an advanced one-stage object detection framework that has evolved from version 1 to version 4. YOLOv1 [1] divides the entire image into a certain number of grids, with each grid responsible for a target category. It generates two prediction boxes per grid as target prediction boxes and optimizes both target category loss and target position loss in the same loss function, balancing speed and accuracy. Subsequent versions, such as YOLOv2 [9], adopted the Darknet-19 network, and YOLOv3 [10] used Darknet-53 as its base network, incorporating residual structures and the design of Feature Pyramid Network (FPN) [11]. Compared to YOLOv3, YOLOv4 [12] introduced the CSP DarkNet-53 structure and drew inspiration from PANET [13]. PANet further strengthened feature fusion by adding a bottom-to-top feature fusion route, enhancing the network’s prediction capabilities in terms of position and shape. While PANet retained the design of using fused feature maps at all scales for target result prediction from FPN, YOLOv4 improved upon it.
Shortly after the release of YOLOv4, YOLOv5 [14] was introduced and implemented entirely in PyTorch. Unlike versions 1 through 4, YOLOv5 has no accompanying paper, but its code reveals that the network structure is fundamentally similar to YOLOv4 [15]. However, YOLOv5 has weights files nearly 90% smaller than YOLOv4, making it more convenient for deployment on embedded devices. Moreover, thanks to the PyTorch ecosystem, it can be easily compiled into ONNX and CoreML, simplifying the deployment process on mobile devices. YOLOv5 is user-friendly, easy to deploy, and has gained popularity among researchers. Subsequently, researchers have introduced more models in the YOLO series, and many have made improvements to YOLO series models, such as multimodal fusion, NMS optimization, feature map enhancement, detection box modification, resolution adjustment, target function replacement, etc. [16].
YOLOv6 [17] was introduced by Meituan in 2022 and introduced features such as a RepVGG-style backbone for better feature extraction and a more efficient post-processing step. YOLOv7 [18], released by the same team that developed YOLOv4, brought further advancements in efficiency and accuracy. Introduced Extended Efficient Layer Aggregation Networks (E-ELAN) for better model scaling. YOLOv8 [19], developed by Ultralytics, Emphasized both large-scale and small-scale object detection, with improvements in the detection of tiny objects.
The YOLO series has seen substantial advancements post-YOLOv5, with each new version building upon the strengths of its predecessors and introducing novel innovations, and has taken a dominant position in the field of object detection.
Model pruning and lightweight processing
In the field of object detection, there is a significant amount of research dedicated to improving the inference speed of models while maintaining their accuracy. These efforts primarily take two approaches: 1) After model training, optimize the model through techniques such as model pruning, distillation, and quantization. 2) Training models with lightweight structures to obtain deployable lightweight models directly. The former involves challenges like repetitive training, special hardware requirements, and potential accuracy reduction. The proposed automatic structured pruning method in this paper avoids the need for retraining after model completion, simplifying the structured pruning process for specific scenarios. The latter is mainly used to optimize the backbone network in the model architecture. It can be combined with the proposed optimization method in this article for the output section to further improve inference speed.
Common model pruning optimization methods are categorized into non-structured pruning and structured pruning [20]. Non-structured pruning sets weight parameters meeting certain conditions to zero, enhancing the model’s sparsity to accelerate inference, especially when hardware supports sparse computation. However, non-structured pruning does not fundamentally reduce the model’s parameter count. It merely specifies the model. If the hardware does not support acceleration for this sparsity, the model’s inference time may not decrease after non-structured pruning. Structured pruning effectively reduces the model’s parameter count, but the detection accuracy of the pruned model may decline. It requires retraining the pruned model to approach the detection accuracy of the pre-pruning model. Many researchers have conducted extensive studies [21, 22, 23]. Moreover, retraining after structured pruning does not guarantee that the model can fully recover to its pre-pruning accuracy. Therefore, multiple attempts may be needed to find suitable pruning positions during structured pruning.
To address inference challenges on low-computing platforms, studies [24] based on depth separable convolution and 1
Adopting a lightweight model structure effectively reduces the model’s parameter count and avoids the need for secondary training. However, lightweight model structures mainly focus on designing a lightweight structure for the model’s backbone network. The optimization method proposed in this paper targets the output part of object detection models, which can be combined with existing lightweight model structures to further enhance the inference speed of lightweight models.
Method details
This paper proposes an automatic structured pruning method to enhance the inference speed of single and fixed-scale object detection in the safety monitoring of power operation scenes. The method automatically prunes the output of a multi-scale object detection model based on the target object’s identified scale in images from fixed cameras at power operation sites. This aims to reduce the inference time while maintaining detection accuracy.
The method consists of four modules and two operational stages. The four modules are the Initialization Module, Object Detection Module, Target Scale Analysis Module, and Network Model Replacement Module. The two operational stages are the “Warm-up” stage and the “Detection” stage. After the system starts, it initially enters the “Warm-up” stage. Once specific conditions are met, the system switches to the “Detection” stage, which represents the formal operating state of the system.
Initialization Module: This module reads and loads a multi-scale object detection network model from local or online files for subsequent object detection.
Object Detection Module: Utilizing the loaded network model and input video data stream, this module performs object detection, generates detection results, and determines whether the system is in the “Warm-up” stage. If in the “Warm-up” stage, the Target Scale Analysis Module is executed; if not, the system continues with object detection.
Target Scale Analysis Module: This module calculates the size distribution of detected targets during the “Warm-up” stage and facilitates the transition between the “Warm-up” and “Detection” stages. Based on the computed target scale distribution, it determines the scale category of the targets that the fixed camera needs to recognize. Subsequently, it automatically prunes the output part of the original multi-scale object detection model.
Network Model Replacement Module: This module replaces the multi-scale object detection model with the pruned single-scale object detection model and performs formal detection.
Upon system startup, the default operating state is the “Warm-up” stage. In this stage, the system completes the collection of target information. The Object Detection Module utilizes the multi-scale object detection network model to perform object detection on real-time video stream information from the fixed camera. After each object detection, the system assesses its current state. If in the “Warm-up” stage, the module passes the coordinates of the detected target bounding boxes to the Target Scale Analysis Module.
The Target Scale Analysis Module is responsible for storing the coordinate information of detected targets’ bounding boxes. When the number of stored targets reaches the set quantity, it analyzes and determines the scale category of targets during the “Warm-up” stage. By extrapolating from the target category during the “Warm-up” stage, it can approximate the scale category of targets in the future formal detection stage. The main analysis process includes the following steps: First, obtain the pixel values range for the width and height of all predicted boxes during the “Warm-up” stage. Assuming that
The criteria for categorizing targets into small, medium, and large classes are derived from the width and height ranges of predicted targets’ outputs from three feature layers in YOLOv5. In the network model, different-scale feature maps, due to the allocation of predefined anchor box sizes, have distinct width and height ranges for the predicted boxes. YOLOv5 utilizes three feature maps for the final output: shallow feature map, middle feature map, and deep feature map. Typically, the shallow feature map handles predictions for small targets, the middle feature map handles predictions for medium targets, and the deep feature map handles predictions for large targets. The predicted target’s width and height ranges differ for each feature map. Derived from the decoding process of nine preset anchor boxes and prediction boxes in YOLOv5, the shallow feature map, responsible for predicting small targets, outputs target prediction boxes with a width (w) less than or equal to 176 and height (h) less than or equal to 376. The intermediate feature map, responsible for predicting medium targets, outputs target prediction boxes with a width (w) less than or equal to 720 and height (h) less than or equal to 608. Beyond these ranges, other predicted targets are handled by the deep feature map, responsible for predicting large targets.
The YOLOv5 model, based on clustering analysis of the MS COCO dataset, uses nine preset anchor boxes grouped into three sets. These sets are assigned to three different feature maps, each dedicated to predicting small, medium, and large targets. Specifically, three anchor boxes with relatively smaller dimensions handle decoding output predictions for the shallow feature map, three with moderate dimensions for the intermediate feature map, and three with larger dimensions for the deep feature map. The coordinate decoding process during the generation of target prediction boxes in YOLOv5 allows the estimation of the length and width size ranges of predicted boxes for different hierarchical feature map outputs.
Following the analysis by the Target Scale Analysis Module to determine the target size category in the current scene, the method automatically prunes the model based on target size categories. Structured pruning entails retaining the part of the network structure’s output corresponding to the current scene’s size category and removing the part that does not match the current scene’s size category. Specifically, if the analysis indicates that the scene’s targets are small, the output path corresponding to the shallow feature map is retained, while the output paths corresponding to the middle and deep feature maps are removed. A similar process is applied if the scene’s targets are medium or large, as illustrated in Fig. 1. The figure shows models with retained output channels for deep, middle, and shallow features from top to bottom respectively, retain one output branch based on actual conditions.
Network model after structured pruning(The figure shows models with retained output channels for deep, middle, and shallow features from top to bottom, retaining one based on actual conditions.
After the structured pruning is completed, replace the original model with the pruned model and start the formal detection. At this point, the entire optimization process of the model is completed.
Based on the proposed automatic structured pruning method in this paper, the YOLOv5 network structure may have three different outcomes after pruning. These three different structured pruning results operate within the output section of the network and simulate scenarios in actual detection tasks where the target to be detected, of a single, fixed scale, may be large, medium, or small. The output section of the YOLOv5 network structure receives three feature maps of different scales calculated from the backbone network and feature fusion section. In specific pruning operations, the backbone network and feature fusion section remain unchanged, and the output section is divided into three sub-models. These sub-models are then reconnected with the backbone network and feature fusion section, resulting in three differently pruned models, each used for detecting large, medium, and small targets.
The inference speed of the models was evaluated based on the same device. The detailed specifications of the experimental equipment are presented in Table 1.
Detailed specifications of device
Detailed specifications of device
To validate the impact of the proposed automatic structured pruning method on model parameters and inference speed, YOLOv5 system models of three different sizes, YOLOv5 nano, YOLOv5 small, and YOLOv5 large, were selected for experimentation. The output section of each original model underwent structured pruning, resulting in three different models for each original model.
Comparisons between the 12 models before and after pruning are presented in Table 2. Taking the example of retaining the output channels of the middle feature, from the table, it can be observed that after pruning, the parameter count of the YOLOv5 nano model decreased from 1.9M to 1.6M, representing a 15% reduction. The inference speed increased from 42 FPS to 50 FPS, indicating an approximate 25% improvement over the original model. The pruned YOLOv5 small model had a reduced weight file size of only 6.3M, a 13% reduction compared to the original weight file. The inference speed increased from 20 frames per second to 25 frames per second, a 25% improvement. The pruned YOLOv5 large model experienced a weight reduction of 6.4M, with a 20% increase in inference speed. Models that retained the output channels of both the shallow and deep features also exhibited significant improvements in inference speed, along with a noticeable reduction in parameters.
The experimental results demonstrate that, regardless of the model’s size, the proposed method effectively reduces the model’s parameter count and enhances its inference speed. Furthermore, pruning large models yields a more pronounced reduction in parameter count.
Comparison of the speed among different sizes of YOLOv5 models after pruning
Comparison of the speed among different sizes of YOLOv5 models after pruning
To validate the performance of the pruned model in detecting single and fixed-scale targets, this study collected a dataset for detecting human bodies of construction workers at work sites. The entire dataset was divided into three subsets based on the varying proportions of workers in the scene: Large Target Detection Dataset, Medium Target Detection Dataset, and Small Target Detection Dataset. Each subset contained an equal number of samples, with 3000 images in the training set and 500 images in both the validation and test sets. Samples and annotations from the dataset are depicted in Fig. 2. From top to bottom, the figure shows typical data samples and annotation examples from the Small Target, Large Target, and Medium Target subsets, respectively.
Experimental results on sub-datasets at different scales
Experimental results on sub-datasets at different scales
Examples of labels for different scale targets.
The original model is first trained on three different sub-datasets separately. After training, the automatic structured pruning is applied to obtain three distinct sub-models. Both the original and pruned models are then evaluated on the test sets of the three sub-datasets to obtain experimental results.
The experimental results, as shown in Table 2, indicate that the pruned models, compared to the non-pruned model, exhibit comparable mAP on the same sub-datasets. This implies that the pruning method proposed in this paper has minimal impact on detection accuracy.
Results from Section 3.1 indicate that the proposed method effectively reduces the model’s parameter count, leading to a significant enhancement in inference speed. Specifically, the parameter count was reduced by 13%, resulting in a 20% increase in frames per second (FPS) during real-time object detection. This demonstrates the method’s capability to streamline the model without compromising performance.
Results from Section 3.2 demonstrate that the proposed method performs comparably to the original model in terms of accuracy. The detection accuracy remained within a margin of 0.1% compared to the unpruned model, highlighting the method’s ability to maintain high accuracy despite the reduction in model complexity.
In summary, the method proposed in this paper not only improves the model’s inference speed by approximately 20% but also preserves detection accuracy. This balance between speed and accuracy makes it highly suitable for safety monitoring tasks in power operation scenarios, where real-time performance and precision are crucial. The adaptive nature of the pruning ensures that the model can handle targets of different scales efficiently, making it a robust solution for various safety monitoring applications.
This paper introduces an automatic structured pruning method specifically designed for single-scale object detection in power operation scenarios. When applied to the YOLOv5 model, the proposed method demonstrates a notable improvement in inference speed, achieving over a 20% enhancement without compromising detection accuracy. While these results are promising, the paper could benefit from a more detailed analysis and discussion of the underlying factors contributing to these improvements.
The improvements observed in the pruned YOLOv5 model can be attributed to several factors. Firstly, the structured pruning method effectively removes redundant computations, thus streamlining the inference process. Certain layers or filters within the model, which contribute less to the overall detection performance, are pruned, leading to a more efficient architecture. Secondly, the pruning process focuses on maintaining the critical features necessary for accurate object detection, ensuring that the model’s performance remains robust despite the reduced complexity.
The potential application scenarios for this pruning method are vast, especially in environments where real-time processing is crucial, such as power operation monitoring, autonomous vehicles, and surveillance systems. The ability to enhance inference speed without sacrificing accuracy makes this approach highly valuable for applications where quick and reliable detection is paramount.
However, there are limitations to consider. The current method is tailored for single-scale object detection and may require modifications to handle multi-scale detection tasks effectively. Additionally, the performance gains might vary depending on the initial architecture and the specific characteristics of the dataset used for training and evaluation.
Furthermore, future research could explore the integration of this pruning technique with other model optimization strategies, such as quantization and knowledge distillation, to achieve even greater efficiency. Comparative studies with existing state-of-the-art techniques across different object detection algorithms would provide a more comprehensive validation of the proposed method’s effectiveness. Additionally, investigating the impact of pruning on other performance metrics, such as model robustness and adaptability to varying input conditions, could offer valuable insights.
Compared to traditional structured pruning methods, the key advantage of the proposed approach lies in its ability to boost inference speed without necessitating retraining to regain detection accuracy post-pruning. Moreover, the method facilitates targeted and automated pruning based on the scale of targets present in safety monitoring applications. This feature eliminates the need for iterative adjustments during the model deployment optimization phase, streamlining the implementation process.
