Abstract
Ensuring the stable and safe operation of the power system is an important work of the national power grid companies. The power grid company has established a special power inspection department to troubleshoot transmission line components and replace faulty components in a timely manner. At present, assisted manual inspection by drone inspection has become a trend of power line inspection. Automatically identifying component failures from images of UAV aerial transmission lines is a cutting-edge cross-cutting issue. Based on the above problems, the purpose of this article is to study the component identification and defect detection of transmission lines based on deep learning. This paper expands the dataset by adjusting the size of the convolution kernel of the CNN model and the rotation transformation of the image. The experimental results show that both methods can effectively improve the effectiveness and reliability of component identification and defect detection in transmission line inspection. The recognition and classification experiments were performed using the images collected by the drone. The experimental results show that the effectiveness and reliability of the deep learning method in the identification and defect detection of high-voltage transmission line components are very high. Faster R-CNN performs component identification and defect detection. The detection can reach a recognition speed of nearly 0.17 s per sheet, the recognition rate of the pressure-equalizing ring can reach 96.8%, and the mAP can reach 93.72%.
Introduction
With the development of the second industrial revolution in the last century, power technology has been widely used and has brought great progress to modern society [1]. In today’s fast-developing era of the information Internet, the demand for electricity is constantly and rapidly developing. Computers, mobile phones, smart appliances, and other electrical equipment make us increasingly dependent on electricity. Electricity has become a necessity for people’s lives. Because the country has a wide area, the economically developed areas are not consistent with the power-intensive areas, so grid companies need to expand the lines to solve the problem of unbalanced power consumption. During the “Thirteenth Five-Year Plan” period, China will make every effort to promote industrial innovation, strive to promote the close integration of smart grid construction with “Made in China 2025” and “Internet+”, promote the industrial competitiveness of smart grid equipment, and strive for “One Belt One Road” Guided by the expansion of smart grid equipment to overseas markets, promoting the country’s leading position in smart grid technology [2]. At present, China has invested 100 billion yuan to build three major transmission channels connecting the west and the east and has significantly expanded the ultra-high-voltage and large-capacity power lines. Large-scale power systems continuously supply power to users through high-voltage transmission lines.
Transmission lines can be basically divided into two categories: one is a cable transmission line, and the other is an overhead transmission line [3]. The former is mainly buried underground, its cost is relatively expensive, and failures are not easy to detect [4]. The components of the overhead transmission line include wires, overhead ground wires, insulator strings, pole towers, and grounding devices. The grid company transmits electrical energy through the transmission wires, and the wires are consolidated at the top of the tower by insulators [5, 6]. The latter, because of its lower production cost and relatively simple construction is also a method often used in China’s current power transmission [7, 8]. Because the transmission lines span different regions and different climatic environments, the natural environment is very harsh [9]. In addition, due to the long-term exposure of these transmission components to harsh outdoor environments, the components may suffer from flashover, aging, and mechanical damage. Once these problems are not dealt with in time, it is very easy to cause major transmission accidents [10, 11]. These transmission accidents will lead to very serious consequences such as large-scale power outages and line damage, which will adversely affect the stable operation of the power system and cause serious economic losses to society [12, 13]. Therefore, we need to regularly check the transmission lines. However, due to the great differences in the terrain across China, many areas of the transmission line need to pass through mountains and rivers, and through some communication blind spots, these factors bring great inconvenience to the inspection and maintenance of the transmission line [14]. At present, it is difficult to meet the routine maintenance requirements of high-voltage transmission lines with traditional manual maintenance methods [15]. This backward method will also bring great hidden dangers to the safe operation of the power grid [16, 17]. Grid companies in developed regions at home and abroad use helicopters or drones to check transmission lines. This method relies on advanced equipment and has the advantages of high efficiency, fastness, stability, and safety. This method has gradually become an important supplementary method for the inspection of transmission lines in China [18].
In recent years, deep learning has achieved very good results in the field of image recognition and detection [19]. Priyadarshini proposed an image recognition algorithm for semantic segmentation of cracks and leaks in subway shield tunnels based on the feature layer of a full convolutional network (FCN). The self-developed mobile tunnel inspection image acquisition device (MTI-200a) was used to collect the defect images in the training data set and the test data set. After the image data set is established, the FCN model of cracks and leaks is trained through multiple iterations of forwarding reasoning and backward learning, respectively [20]. Through the corresponding FCN model, the two-stream algorithm is used to achieve the semantic segmentation of the defect image [21]. The crack is identified through the sliding window assembly operation, and the other stream is identified by adjusting the size of the interpolation operation. Compared with the commonly used region growth algorithm (RGA) and adaptive threshold algorithm (ATA), this method has obvious advantages in recognition results, inference time, and error rate. Two defect non-overlapping (TDN) images, two Overlapping Defective (TDO) images. This method can quickly and accurately identify defects in the structural health monitoring and maintenance of subway shield tunnels [22, 23]. Over the past few years, interest in motion and gesture recognition has increased dramatically. Tao reviews current deep learning methods for sequential image motion and gesture recognition. They introduced a taxonomy that summarizes important aspects of deep learning to deal with these two tasks. They reviewed the details of the proposed architecture, fusion strategy, main data set, and competition. They summarized and discussed the main work proposed so far, paying special attention to how to deal with the time dimension of data, discussing their main characteristics, and identifying opportunities and challenges for future research [24, 25]. Due to the unrestricted marine environment, underwater target recognition is a challenging task [26]. As the number of data increases, deep learning methods have been successfully applied in aerial target image recognition. However, Tim’s research shows that deep neural networks (DNNs) are susceptible to overfitting of small samples. Underwater image acquisition often requires a lot of manpower and material resources, and it is difficult to obtain sufficient sample images for dnn training. In addition, images taken by underwater cameras are often degraded by noise. Taking live fish recognition as an example, Tim proposed an underwater image recognition framework for small samples. First, an improved median filtering method is proposed to suppress the noise of fish school images. Then, the convolutional neural network is used to pre-train the images from the world’s largest image recognition database Image Net. Finally, the pre-trained fish image is used to fine-tune the trained neural network and test the classification performance. The experimental results show that this method can effectively identify fish species and provide an effective method for solving the problem of recognition in the case of small samples [27, 28].
This article mainly studies the identification and detection methods of power components in transmission lines. Starting from sample preprocessing and deep learning network framework improvement, target detection is performed on key equipment in aerial inspection images of power inspection to achieve improved detection rate and detection accuracy. The main work includes: introducing the current identification technology, the identification of key equipment of transmission lines, and the current status of deep learning research, and analyzing the advantages and disadvantages of some existing methods; then, all the image sample predictions are discussed processing methods and combined with the actual project requirements, an innovative self-cutting algorithm is proposed to expand the sample more effectively; secondly, the existing deep learning algorithms are improved to realize the improvement of transmission line component recognition accuracy; finally, the improved algorithm proposed in this paper is compared with other deep learning detection methods, and the experimental results are displayed and analyzed.
Proposed method
Traditional target detection method
The traditional target detection method usually uses a sliding window to obtain a large number of potential candidate regions [29]. The size of the candidate regions generated by sliding frames is fixed, and a large number adds a large burden to subsequent calculations [30]. In addition, in the aspect of feature extraction, traditional object detection algorithms will consider designing many very complicated manual features to perform feature extraction on images. When encountering complex and changeable data sets, they are often poorly robust. Deep learning has certain universality, so this chapter mainly discusses related research on object detection based on deep learning [31].
Object recognition and localization is an important research direction in the field of machine vision and pattern recognition. The subject involves multiple research contents such as data processing, feature extraction, classifiers, and so on. The data needs to be pre-processed before it enters the network. It is the processing of the detected image in the early stage of object recognition, such as image denoising, image cropping, brightness transformation, normalization, and other operations [32]. After data pre-processing, feature extraction operations are performed. There are many traditional feature extraction methods, such as SIFT and SURF for local feature extraction; Haar and LBP for face feature extraction [33]; HOG for pedestrian detection, and so on. The role of the classifier is to reasonably classify the extracted features. Common classifiers include SVM, Bayes. [34].
However, in practical applications, the background of the image is diverse, the shape of the object is variable, and the angle of view of the camera changes a lot, which makes the task of object detection more difficult. In recent years, deep learning has made rapid progress in the field of security, especially in the areas of object recognition, detection, and tracking. This chapter mainly studies how to use the results of deep learning in computer vision to achieve the target detection of transmission lines. Generally speaking, object recognition and positioning are divided into two parts. First, find the positions of all foreground targets in the scene to get the position information of the candidate frame of the target; the other part is to determine which object the object belongs to based on the found object category information. As shown in Fig. 1, this process is target detection based on deep learning algorithms [35].

Deep learning-based object detection steps.
Deep learning-based object detection algorithms usually have four steps.
The first step is to perform image preprocessing on the input image or video frame, including: normalizing the data to reduce the difference between different pictures; and performing noise reduction, rotation, and scaling operations on the image.
The second step is to generate candidate frames that potentially contain targets by sliding frames or RPN networks.
The third step is feature extraction to obtain the feature vector, and the fourth step is to use the feature vector to classify the candidate frames.
The order of the second and third steps directly affects the entire calculation of the model. If the candidate frames are generated directly first since the image will contain many objects with different sizes and shapes, the general frame selection method needs to traverse the entire image, so many redundant frames will be generated. If feature extraction is performed on these redundant frames separately, the calculation pressure will be greatly increased. So the idea of the RPN network is to first use a convolutional network for feature extraction, then share features, and perform sliding frame selection on the extracted feature layer. Common candidate region methods include selective search, edge boxes, and RPN based on convolutional neural networks.
The first use of convolutional neural networks for feature extraction in target detection is in the R-CNN network the scholar Rossi. Girshick proposed an R-CNN network using SS plus CNN, which opened up the use of deep learning to solve target detection of new ideas. In this method, R-CNN uses a selective search algorithm to extract 2000 candidate frames from the original image. Then each candidate box is scaled to a fixed size, all feature maps uniformly input to a CNN convolutional network to extract features, and the fourth step is to input the extracted features to an SVM classifier for classification. R-CNN’s subsequent SPP-Net, FastRCNN, FasterRCNN, and other networks, the basic process is similar to the above process, except that the region proposal extraction method, and the region selection and feature extraction order are different.
At present, deep learning-based object detection methods are mainly divided into two categories. The first category is one-stage deep learning methods, such as SSD, YOLO. The second category is based on two-stage methods, mainly represented by FasterR-CNN.
(1) Framework introduction of Faster R-CNN
The framework flow of FasterR-CNN is shown in Fig. 2. It is mainly composed of the following parts: The first part is the RPN network. This module is responsible for generating the coordinates of the candidate box and whether it is a foreground score. The second part is the FastR-CNN detection module responsible for detection. In the overall structure, the previous convolutional network is used to extract features, and the later RPN network and FastR-CNN detection network share the previous features. The RPN network will generate candidate boxes of different sizes on the anchor points of each feature map. The detection network will detect these candidate regions and identify the target category in the candidate regions.

Introduction to Faster R-CNN framework.
(2) Regional generation network of Faster R-CNN
The innovation of FasterR-CNN network lies in the RPN network. The target detection network before this method first uses the SS algorithm to perform sliding frame selection on the entire feature map. For example, the SPP-Net network and FastR-CNN network both improve the traditional sliding marquee detection network and improve the efficiency of the detection network. However, both of these methods take a long time to generate candidate region boxes. But the region generation network (RPN) method shares the features generated by the convolutional network, so the total detection region generation consumes very little time. The basic idea of the area generation network is shown in Fig. 3.

Faster R-CNN generated anchor schematic.
The network generates anchors for each anchor point on the feature map and performs regression and classification of the anchors and foreground backgrounds. Anchor bounding boxes are some candidate regions defined in advance: each anchor point on the feature map will generate 9 different candidate regions with an area of {128 * 128, 256 * 256, 512 * 512} and three ratios of length and width {1:1, 1:2, 2:1}. These candidate regions are called anchors.
The structure of the regional generation network is a full convolution structure. This structure can make the entire network do end-to-end training. Its operation process is as follows: Generate anchors, and use border regression for all anchors. The anchors generated here are exactly the same as during training. The anchors are sorted according to the input candidate frame score, and the first 6000 anchors are extracted, that is, the foreground candidate frame after the correct position is extracted. Map the candidate anchors back to the original image to determine whether the target frame exceeds the boundary in a large range, and remove candidate frames that seriously exceed the boundary. Perform a non-maximum suppression operation. The target boxes are sorted according to the scores, and the first 300 results are extracted as the output of the RPN network.
The RPN network outputs the coordinates of the last foreground box and the score value of the probability.
(3) Loss function design
The loss during network training consists of two parts, one is the loss of the regression position, and the other is the classification loss. The total loss function can be expressed as:
Where N is the number of matches between the default box and the real box, α is the weight factor and is generally set to 1, C is the confidence of each class, l and g are the parameters of the default box and the real box, including the center position coordinates and width and height.
Conf
(x, c) is the loss of classification confidence, using multi-class loss Softmax, L
loc
is used to return the center position and width and height of bounding boxes, and using SmoothL1 loss, SmoothL1 is calculated as follows:
The image size of aerial images is very large, and the deep learning object detection framework often performs a compression process on the images [36]. Therefore, before the picture is input into the deep learning framework, the image is generally compressed.
A typical SSD training sample size is 300×300. However, the insulator aerial images are often ultra-clear pictures with large pixels, so it is necessary to compress the pictures to improve the training speed. This article compresses the original image to 1024×1024. Image compression refers to reducing the original image into a new image at a specified ratio.
If you directly scale the images by scaling the images, the coordinates of many points are calculated as decimals. Therefore, this paper uses the interpolation algorithm to approximate the coordinates. Interpolation algorithms usually include nearest neighbor interpolation, bilinear interpolation, and higher-order interpolation. Nearest neighbor interpolation is the simplest interpolation method. The output pixel value of this method is the pixel value of the nearest sampling point in the input image. Higher-order interpolation can save more details of the original image than neighboring interpolation, but this method requires longer calculation time. Bilinear interpolation is the algorithm used in this paper. It specifies the output pixel value as the weighted average of the pixel gray values of the sample points in the neighborhood nearest to it in the input image [37]. Considering that it takes a lot of time to process large batches of samples, this paper adopts the bilinear interpolation compression method.
The steps for bilinear interpolation are as follows:
(1) Calculating the scaling factor
First, calculate the horizontal compression factor H
f
and the vertical compression factor V
f
. Assuming that the pixel point before scaling is P0(x0,y0), the corresponding pixel point of the scaled image is P1(i,j). The scaling formula is as in formula (3):
Where srcWidth and srcHeigh represent the length and width of the original image; dstWidth and dstHeigh represent the length and width of the new scaled image.
(2) Calculate the target point in the original image
Since the coordinates of the scaled points are not necessarily integers after they are mapped back to the original image, the original pixel coordinates P0(x0, y0) corresponding to the scaled pixels is first obtained here.
(3) Calculate the coordinates of the 4 points closest to P0
The third step needs to calculate the coordinates (x1, y1), (x1, y2), (x2, y1), (x2, y2) of the four pixels closest to P0(x0, y0) in the original image.
(4) Calculate weight
(5) Calculate P0 point coordinates
According to the above formula, all the sample pictures are traversed in order to obtain the compressed image.
This article does not crop all the sample pictures when processing the samples, so we must first make a judgment on the proportion of the objects in the sample pictures. This article divides the goals into small goals, medium goals, and large goals according to the target occupying less than 5%, between 5% 12%, and greater than 12%. Finally, only the pictures classified as the small target and medium target are operated. For the pictures containing samples of the large target, no cropping operation is performed in this article.
The details of the sample pretreatment process used in the actual experiments in this paper are shown in Fig. 4 below:

Sample expansion and preprocessing process.
First, calculate the size of the target frame, and then crop the small and medium targets. The processed image and the original image are subjected to conventional operations such as image compression, brightness conversion, and horizontal flip. Then this part of the expansion and the original picture set together form a training sample library.
Experimental data preparation
The data set in this article is mainly based on the photos taken during the sampling inspection. This photo covers the four seasons of spring, summer, autumn, and winter. The shooting locations are diverse and the resolution is high. This article focuses on the identification and defect diagnosis of transmission line components and chooses five different types of components: equalizing ring 1, equalizing ring 2, complete vibration hammer, bad vibration hammer, and bird’s nest model training and sampling are selected as the default values. Among them, the model training samples have 1,000 training samples in each category, a total of 5000 training samples, and the image size of each sample is 6000×4000.
Experimental data set processing
First, the training samples are uniformly scaled to 1200×800, and then the complete power component target displayed in each image of the training set is labeled as a frame, the frame coordinates are recorded, and a classification label is provided.
In the test sample, 500 images of each category were selected as the test set, and the test set did not contain training samples. When identifying and classifying the test group, all electrical components (including incomplete and electric components) included in the training sample must be marked and classified and scored. Think of this as a successful identification.
Experimental model training
This paper is based on the MXNET framework and uses Faster-RCNN to implement network model training. The VGG16 network and the ResNet-101 network were used to initialize the pre-trained ImageNet network to obtain three models with different network layers. The data set of region 2.2 is used as a training sample. The training number of each model training is 20, the batch size is 128, the learning rate is 0.001, the weight attenuation rate is 0.0005, and the number of candidate regions before and after non-maximum suppression is 6000 and 300, respectively. In this experiment, as a criterion for evaluating the quality of the model, the correct rate, recall rate, and missing recognition rate were used. The correct percentage indicates the ratio of the number of targets that are correctly identified to the number of all targets that are identified. The ratio of the number of sample targets. The unrecognized recognition rate indicates the ratio of the number of unrecognized targets to the number of all sample targets.
Discussion
Comparison of model recognition effect
In the experiment, the test set in section 3.3 was used as the test set for 3 different models. The two models obtained in Section 3.3 were used as the test models. The accuracy rate, recall rate, and missed recognition rate were used as the criterion for evaluating the quality of the model. The application of the two models in the identification of transmission line components is shown in Tables 1 and 2.
VGG16 model verification results
VGG16 model verification results
ResNet-101 model verification results
The accuracy of the five types of targets detected under two different models is shown in Fig. 5.

Accuracy of 5 types of targets detected under two different models.
According to the experimental results shown in Tables 1 and 2 and Fig. 5, the recall rate, correct rate, and missed the recognition rate of the 5 types of targets detected under 2 different models can be obtained. For the recall rate, the ResNet-101 model recall rate is the highest, followed by the VGGl6 model. In terms of accuracy, the ResNet-101 model is the best, followed by the VGGl6 model. For the leak recognition rate, the ResNet-101 model has the lowest leak recognition rate, followed by the VGGl6 model. It can be seen from Tables 1 and 2 that the recognition effect for the pressure-equalizing ring 1, the pressure-equalizing ring 2, and the bird’s nest is far superior to that of the shockproof hammer. The reasons may be as follows: 1) the structural characteristics of the vibration hammer are not obvious enough; 2) the training sample data of the vibration hammer is insufficient, which leads to the model’s generalization ability is not good enough.
According to the experimental results in Tables 1 and 2, aiming at the problem of unsatisfactory seismic hammer recognition and single picture recognition time optimization, the optimal model ResNet-101 model is selected for experiments. The experimental direction is to modify the convolution kernel of its convolution structure. It is mainly to modify the size of the first layer convolution kernel. According to the content in Section 1, it can be known that different convolution kernel sizes have an impact on recognition accuracy and recognition time.
The size of the convolution kernel in the first convolution layer of the ResNet-l01 model network is 7, according to the experimental idea, gradually reduce the size of the convolution kernel for model training, and the parameter settings are consistent with the parameter settings in Section 3.3, and then use the trained model, the samples were tested, and the recall rate and recognition time were used as experimental indicators. The number of all samples was 500. The experimental results are shown in Table 3.
Recall rates for different convolution kernel sizes
Recall rates for different convolution kernel sizes
The recall rate of the detected 5 types of targets at different convolution kernel sizes is shown in Fig. 6.

Recall rates for different convolution kernel sizes.
According to Fig. 6, it can be seen that different convolution kernel sizes have an impact on the accuracy of the detection, and it can be obtained that as the size of the convolution kernel decreases, the recall rate continues to decrease. Kernels have large receptive fields, and convolution kernels with receptive fields have high recognition accuracy. As the size of the convolution kernel decreases, the recognition time of a single picture also decreases. This is because the number of parameters obtained by the recognition of different convolution kernels is different, and the parameters obtained by small convolution kernels. The number is small, and the number of parameters obtained by the large convolution kernel is large. Therefore, the model can be optimized by adjusting the size of the convolution kernel to meet its actual needs-whether a higher recall rate or a lower recognition time is required.
In this experiment, the images of the five different components in the training set in section 3.2 were rotated by 90°, 180°, and 270°, respectively, and a total of 20,000 training samples were obtained. The set is used as a training sample for the ResNet-101 model. The model is trained. The training parameter settings are consistent with the parameter settings in Section 3.3. Then the tested model is used to test the test set in Section 3.2. The recall rate, correct rate, and missed the recognition rate of the images of 5 different parts are shown in Fig. 7.

ResNet-101 model verification results for 5 types of targets after data expansion.
According to the experimental results in Table 4 and Fig. 7, it can be obtained that the training model obtained by using the expanded data set for model training, the recall rate of any category identified using this training model is better than the recall rate before expansion, and the correct rate, the sum recognition rate is also better than that before the expansion, so it can be obtained that expanding the data set by the rotation transformation method can improve the recognition accuracy rate.
ResNet-101 model verification results after data expansion
In the experimental part, this paper experimentally verifies the adaptive cropping algorithm proposed in the previous article. The self-cropping algorithm crops the sample picture into one or more small pictures according to the position of the target frame marked by the sample. In this algorithm, the crop coefficient K needs to be calculated based on the annotation information of the image XML. The coefficient K needs to be compared with the threshold Q to determine whether the original image is cut into one or two independent small images. Therefore, the value of the threshold Q directly affects the experimental results. In this set of comparative experiments, the basic network used in the experiment is resnet101 + FPN-SSD, and the iteration algebra is set to 100,000 generations. The experimental results are shown in Fig. 8.

Experimental results at different thresholds Q.
It can be found from Fig. 8 that the experimental results increase first and then decrease with the change of the threshold. The reason for this phenomenon is that when the threshold is increased, more small images of small-scale targets are cropped. However, because the detection samples are scattered at different scales, improving the accuracy of target recognition at small scales does not necessarily change the targets at other scales. The final experimental results show that the threshold Q is better than 0.24 [38].
The problem of external force destruction of transmission lines seriously threatens the safe and stable operation of the power system. How to use the existing power grid video surveillance equipment to develop a transmission line system that can automatically identify targets and implement alarm functions is an urgent task at present, and it is also an important part of the country’s strategy for implementing smart grids. Aiming at the specific scenario of transmission line component identification and defect detection, this paper uses the Faster-RCNN network structure to mainly study the recall rate and time of different network model detection results.
In this paper, the two mainstream target detection algorithms FasterR-CNN and SSD are introduced and the advantages and disadvantages are analyzed, and the directions for improvement and optimization are proposed. At the same time, the semantic segmentation method MaskR-CNN is introduced, and this method is applied to the detection of string drop of insulators. Based on the previous research, an object detection framework based on FPN-SSD is proposed. The framework is based on the SSD algorithm and adds an FPN feature pyramid structure. This structure can effectively improve the fusion of context and semantic information, and this multi-scale structure can improve the accuracy of small target recognition. In addition, the feature extraction network was replaced with the resnet101 network. Finally, the average accuracy of the algorithm on this sample set reached 89.3%.
This article only targets the common 5 types of transmission widgets, but the deep learning-based algorithms are universal, and more categories can be added later for research. The sample pictures used in this paper are from aerial inspection photos of power grid drones, but the environment of the transmission line changes very much, so it is necessary to add sample pictures of different backgrounds and different lines to verify the effectiveness of the algorithm. Moreover, this article mainly focuses on the target detection of aerial photography components of transmission lines and does not conduct deeper research on component failure detection. In the later stage, fault detection can be carried out on the basis of target detection.
Declarations
Ethical Approval and Consent to participate: Approved.
Consent for publication: Approved.
Availability of supporting data: We can provide the data.
Competing interests
These are no potential competing interests in our paper. And all authors have seen the manuscript and approved it to submit to your journal. We confirm that the content of the manuscript has not been published or submitted for publication elsewhere.
Funding
This work was supported by Key Project of Natural Science Basic Research Plan in Shaanxi Province of China (Grant No. 2018ZDXM-GY-169); Key project of Natural Science Basic Research Plan in Shaanxi Province of China (Grant No. 2019ZDLGY18-03).
Author’s contributions
All authors take part in the discussion of the work described in this paper.
Footnotes
Acknowledgments
The authors thank the editor and anonymous reviewers for their helpful comments and valuable suggestions.
