Abstract
Traditional methods for detecting damage of electric power line insulators are often limited to improve the accuracy of detection because of poor datasets. To enhance the performance of insulator detectors, in this paper, we propose a semi-supervised object detection method by combining a two-stage proposal-connection detection net (TPD-Net) with an enhanced network structure generative adversarial network (En-GAN), termed the noise self-training insulator-defect detection network (NS-IDNet). Firstly, the En-GAN approach is utilized to synthesize a large number of class-balanced samples as unlabeled data, controlled by a coefficient. Secondly, according to the TPD-Net method, a teacher model is trained to employ the labeled data, and then this teacher model is used to predict the sample labels for the unlabeled data. Finally, noise self-training is conducted. The student model and teacher model are repeatedly trained until convergence, while noises are introduced into the above models and samples. Diagnostic results on the test set from the original dataset reveal that the proposed NS-IDNet outperforms the traditional supervised model. Additionally, comparative experiments demonstrate the diagnostic accuracy of the proposed NS-IDNet is superior to traditional semi-supervised models.
Introduction
In the power system, the transmission of electricity is a critical link connecting power generation on the grid side to electricity consumption on the user side. The safety of power transmission is directly related to the stability of the power system and the quality of electricity consumption. Therefore, to ensure the safe and reliable transmission of electrical energy to the end-users, a series of inspection and protective measures need to be taken for the transmission lines, including fault monitoring of the transmission lines. 1 Traditional fault detection is mainly conducted through manual field surveys, where the presence of faults and anomalies are judged based on experience and are real-time recorded and reported. However, manual methods are often time-consuming, labor-intensive, and inefficient. In recent years, with the development and popularization of drone technology, 2 traditional inspection and monitoring tasks have begun to be gradually replaced by drones. Generally, power transmission lines are often installed in complex terrains in the wild, where the surrounding environment is uncontrollable. The small and flexible characteristics of drones can well avoid this risk. For drone inspection, it only needs to be controlled to fly along the designated route and take pictures of the predetermined targets to achieve the purpose of information collection, which greatly improves the efficiency of inspection and reduces costs.
In the inspection tasks performed by drones, it includes the defect detection of insulators on transmission lines. 3 Insulators are essential equipment in the power system, mainly serving to support and fix the transmission lines and to prevent the current from entering the earth, ensuring the safe and stable operation of the power system. Due to the material properties of insulators and the instability of the environment where they are located, insulators often have problems such as cracks, damage, and contamination, which can lead to reduced performance or even failure and bring potential risks of faults to the power system. Therefore, it is necessary to detect and locate insulator defects so that they can be replaced during subsequent maintenance. The challenges in insulator fault detection are not only high in technical requirements but also influenced by various factors. First, environmental conditions can significantly affect the accuracy of detection. For instance, the presence of humidity, temperature, and pollutants may either mask or mimic the defects in insulators. Secondly, the continuous operation of the power system requires that the detection work does not interrupt the power supply, necessitating inspections to be conducted without power outages, which adds to the complexity of detection. Lastly, the advancement of insulator fault detection algorithms affects the outcomes of detection; a superior insulator detection algorithm is key to achieving effective insulator inspections.
Related work
Before the rise of drone technology, insulator detection was mainly carried out by measuring the electrical parameters of insulators, such as the resistance, current, and dielectric constant of insulators, 4 to judge the insulation performance and loss. However, the accuracy of this method is generally not high. As the target detection and infrared imaging technologies carried by drones gradually mature, their application in the field of insulator detection is increasing, and detection algorithms based on machine learning and deep learning have begun to replace traditional detection methods. At present, extensive research has been carried out in this field both domestically and internationally, and certain achievements have been made. Literature 5 proposes an insulator detection algorithm based on the GAN model, using a generator and multiple discriminators to adapt to complex detection tasks and scenarios, and ultimately, the model outperforms other algorithms in terms of the resolution and quality of generated images as well as detection position. Literature 6 considers issues such as complex environmental backgrounds, small insulator sizes, and inconspicuous faults, and proposes a fault detection network for railway insulators based on convolutional neural networks, and through cascade detection networks and fault classification networks, it effectively improves the accuracy of fault classification. Literature 7 proposes an auto-burst insulator detection algorithm based on an improved YOLO v4, which uses hybrid data augmentation methods to increase the number of defective samples of auto-burst insulators and their diversity, and combines the channel attention mechanism with the YOLO v4 algorithm to improve the feature extraction capability and detection accuracy of the algorithm for auto-burst insulators. Literature 8 uses the Zeiler Fergus (ZF) network to achieve feature extraction of insulator images and optimizes the selection of anchor points using the k-means clustering method. At the same time, a nonlinear penalty factor is introduced to adapt to the detection of multi-scale, overlapping occluded insulators. Finally, through the region convolutional neural networks (RCNNs) model for insulator detection, experimental results show that this insulator detection method can accurately obtain the coordinate frame of the insulator object and the corresponding probability values, and the average accuracy is improved by 10.43%. Literature 9 adopts a two-stage data augmentation strategy, the first stage is based on image combination, and the second stage includes methods such as random affine transformation, Gaussian blur, and brightness and contrast adjustments. At the same time, a densely connected convolutional network (DenseNet) is introduced for detection, which ensures robustness and accuracy in the insulator detection process.
Based on existing research, this paper proposes a semi-supervised automatic detection algorithm for insulator defects based on an improved GAN, considering the issues of insufficient detection accuracy caused by few image features of defective insulators and interference from background environmental noise. The algorithm improves the feature extraction capability and achieves precise localization and identification of defects. Experiments prove that the algorithm in this paper has better settings of model parameters and accuracy, and has practical application value (Figure 1). Insulator damage dataset.
Methods
Dataset
One of the main limiting factors currently hindering the detection of faults in insulators within the electrical power sector is the lack of data. The raw materials used in the study are from the CPLI on GitHub. The data source website is https://github.com/InsulatorData/InsulatorDataSet. The full name of this dataset is “Insulator DataSet Chinese Power Line Insulator Dataset,” which includes images of 600 insulators under normal working conditions and 248 images of insulators with drop-out faults. The resolution of the images is 1152 × 862. To classify and detect insulators, a large number of relevant images are required for analysis. As shown in Figure 2, these images contain different backgrounds such as towers, rivers, skies, grasslands, and farmlands, basically covering all possible scenarios encountered during image collection. Analysis of the 848-image dataset revealed that the shapes of insulators and their curvature differ from normal, with these faults often being caused by self-explosion or skirt detachment. The location of insulator faults is variable, and the fault is typically caused by the absence of a single shed. The categories of the dataset are shown in Table 1. Network architecture of CNN1. Dataset categories.
TPD-Net
The proposed two-stage detection net 10 for connection modules is an object detection algorithm based on CNN and proposal connection methods. The model achieves an efficient and accurate target detection process through a well-designed model structure and optimization strategies. This paper’s detection net used a two-stage detection net with a proposal connection module (TPD-Net), which has high detection accuracy and a small model magnitude. The algorithm is generally divided into two stages: extraction 11 and proposal. In the first stage, the input image is first passed through a CNN network to extract feature maps, and candidate regions are extracted through selective search or a region proposal network. In candidate region feature extraction, a proposal connection module first improves the proposal selection of the traditional two-stage detection net. Excluding background proposals allows the second-stage network to focus more on the classification of the foreground, increasing the accuracy of detection. Considering that the insulators in the dataset are tightly arranged and non-overlapping, we chose a frameless encoding as the network encoding of the first stage to lighten the network structure. In the second stage, the feature maps are further processed to achieve proposal connection of the predicted targets. Specifically, for feature maps of different scales, the proposal connection module aggregates prior detection results to determine their connectivity, thereby constructing the complete target. Finally, the optimized proposal model is applied to test images to obtain the predicted target detection results.
The structure of CNN1 is shown in Figure 2. The purpose of CNN1 is to locate the damaged area of the insulator in the panoramic images, similar to a single-stage detector. Different depths, widths, and structures have an impact on the model’s receptive field, the number of features, and ultimately the effect on the learning model. Therefore, when designing the network, the appropriate depth, width, and structure suitable for the task should be considered. CNN1 uses a structure with downsampling and upsampling. 12 The downsampling part obtains the minimum feature maps with a width of 512 through multi-level downsampling to ensure a sufficient receptive field. To avoid information loss, a residual structure was set up to broaden the model’s perception range. The upsampling part adopts upsampling combined with feature map fusion to implement multi-scale learning. This design considers the requirements of the detection task and can effectively locate the insulator area. During implementation, network parameters can be adjusted, and the detection effects of different settings can be analyzed experimentally to obtain the optimal network structure. At the same time, in the training of CNN1, heat maps are decoded to obtain foreground candidate regions, and only regions with accurate localization are retained for the next stage of processing. Additionally, by applying small random shifts to the position of the foreground, sample diversity has been increased. In this study, shifts are controlled within a specific range for sampling, which expands the sample distribution and further improves the model’s generalization ability.
The goal of CNN2 is to classify and correct location regression for independent ROI
13
regions, serving as a feature extractor for individual objects. CNN2 adopts ResNet-34
14
as the network backbone structure. After passing through the flattening layer and fully connected layer, the output is divided into two branches: classification and location regression. The final output is obtained through the softmax activation function. This structural design extracts discriminative features of the ROI region for precise classification and regression localization. Figure 3 shows the input-output connection structure of CNN2. The mature ResNet-34 network is used as the feature extractor, and flattening and fully connected layers are added to obtain classification and regression results. This end-to-end approach fully utilizes the network’s modelling capabilities and extracts expressive features of the ROI area. This design ensures classification accuracy and improves coordinate localization precision. Input and output of CNN.
En-GAN
The training objective of the Generative Adversarial Network (GAN) 15 model is the dynamic game between the generator and the discriminator. Ideally, the discriminator constantly improves its ability to identify real and fake samples, while the generator continually learns to deceive the discriminator. However, during actual training, it’s often difficult to coordinate the progress of training the generator and the discriminator, frequently leading to over-training of the discriminator while the generator struggles to improve. To achieve collaborative training of the GAN model, this study proposes an improved discriminator network structure.
As shown in Figure 4, the structure contains two branches: upper and lower. The upper branch uses residual convolutional downsampling, constituting the main parameter count of the network, and is used to extract detailed features from high-resolution images. The lower branch employs a fast downsampling method, which allows for rapid convergence early in training. The two branches merge at a specific layer, forming modeling for multi-resolution images. Overall, residual convolution is responsible for feature extraction and fast downsampling speeds up convergence, with both combining to enable progressive training of the discriminator. This design balances the training depth of the discriminator and helps with the collaborative optimization of the GAN model. Figure 5 shows the traditional structure. In comparison with the traditional structure, the improved discriminator structure includes a residual convolutional branch and a fast downsampling branch. The former extracts fine details while the latter accelerates convergence, with both branches merging to handle multi-resolution images. This design balances the training of the discriminator and is conducive to the collaborative optimization of GANs. Network structure of cross-layer fusion discriminator. Conventional discriminator network structure.

In our proposed generative adversarial network, the structure of the generator is carefully designed, as shown in Figure 6. This structure draws the essence of the control strategies in StyleGAN.
16
The model explicitly differentiates the input latent vector from the upsampling process, further transforming it into multi-level style control vectors through a series of fully connected layers. At the top architecture layer, the latent vector is first transformed through a mapping of fully connected layers. By creating optimized vector distributions, this approach significantly improves the continuous interpolation capabilities of the outcomes, thereby ensuring the images generated are remarkably realistic. The mapped style vectors
17
S are further processed and incorporated into the convolution operations as single-layer transformations. They act as adjustment variables for filter parameters, thus having a significant impact on the generation process during the upsampling feature extraction phase and the style mapping output phase. The middle layers of the architecture are primarily made up of a series of upsampling feature extraction modules. Based on input constants and style vectors, they produce residual outputs of various levels. The bottom of the architecture consists of residual input layers formed by various upsampling modules and style mapping outputs. Their transformation accumulates to form the final composite output. The network structure of traditional generators is shown in Figure 7. The main disadvantage is the uncontrollability of results due to the random vector input. Our proposed generator, during the inference stage, can simply manipulate the characteristics of generated images by using the input random seed vector as the style vector, thus avoiding the additional computational cost in the mapping process. With such a generator architectural setting, it is only necessary to adjust the style vectors S at the appropriate levels to finely control the stylistic attributes of the final generated image. The training process of En-GAN is also based on the adversarial learning framework of generators and discriminators, where the generator is responsible for producing realistic images, and the discriminator strives to distinguish these generated images from real images. En-GAN enhances the fine control over the style and details of the generated images by introducing an improved mapping network and style modulation mechanism. During the training process, En-GAN employs a progressive learning strategy, which originated from StyleGAN, that starts with training at low resolutions and gradually increases to higher resolutions, effectively improving the network’s training efficiency and stability. Network structure of style-controlled generator. Traditional generator’s network structure.

Overall architecture
The implementation of semi-supervised learning requires three elements: labeled data, a teacher model, and unlabeled data. 18 In this context, the labeled data refers to the original dataset, the teacher model is the insulator detector trained by the two-stage detection net mentioned above, and the unlabeled data is synthesized using the Generative Adversarial Network (En-GAN). Semi-supervised learning is a method that utilizes unlabeled samples to train models, allowing for the student model’s network structure to be consistent with that of a conventional detector.
In the self-training semi-supervised detection net, noise is introduced into both the model and the samples during training. 19 The purpose of this is to increase the robustness of the model while also leveraging the regression signals from the detection through consistency regularization. By employing semi-supervised learning methods, the model can learn more features from the unlabeled data, thereby improving the accuracy and robustness of the diagnostic model.
This combined network is referred to as the Noise-Self-Training Insulator Detection Network (NS-IDNet), and the model structure is shown in Figure 8. The NS-IDNet takes advantage of additional information available in unlabeled data to train a more effective and robust detection net than could be achieved with labeled data alone, improving both the performance and generalizability of the system in the context of insulator diagnosis and detection. Overall architecture of NS-IDNet.
Compared to traditional semi-supervised learning methods, the advantages of the self-training model are as follows:
First, it utilizes a generative model to systematically address the issue of data imbalance in the samples. This means that it can create synthetic data to augment the original dataset, especially in classes or cases where the available labeled data is scarce.
Secondly, by combining self-training and consistency regularization learning objectives, it enables a more comprehensive utilization of feature information in unlabeled data. This dual approach strives to ensure that the predictions of the model are consistent across both labeled and unlabeled data, leveraging the unlabeled data to reinforce what the model can predict from the labeled data, and thus improving overall learning outcomes.
In summary, the self-training model not only mitigates the imbalance in data distribution but also capitalizes on the intrinsic value of unlabeled data through an integrated approach to enhance feature extraction and model robustness.
Self-training process
In the context of datasets containing both labeled and unlabeled data, the workflow for semi-supervised object detection tasks via sample noise self-training proceeds as follows: 1. Train a teacher model 2. Allow the teacher model 3. Introduce noise to both the student model and samples to achieve: 4. Replace the converged student model as the teacher model, iterate back to step 2 for self-training, and then repeat the process several times.
The procedure for noise self-training is shown in Figure 9. The detection method uses the TPD-Net detection net, and the sample synthesis method uses the En-GAN generation model. In the NS-IDNet trained student model, the number of channels in the classification network structure of TPD-Net is proportionally expanded to twice its size to accommodate the training demands of a larger dataset. During the cyclic self-training, the student model’s capacity is proportionally increased through repeated self-training sessions. During the training of the student model, sample noise is introduced through image augmentation techniques such as translation, rotation, and flipping, while model noise is introduced by randomly dropping connections between the output layers of the classification (Dropout).
20
Random noise helps to filter out weaker supervision signals from the original supervised information, ensuring greater robustness in the student model. When the noise-augmented student model uses soft labels from self-training, it selects propositions with higher confidence levels based on a confidence threshold. Compared to using hard labels or not filtering at all, this approach is more beneficial for the optimization of the detection net. Workflow of noise self-training method.
During the training of the student model, consideration is given to introducing a consistency regularization
21
term into the training metrics. The purpose is to ensure that the model not only fits the pseudo-labels but also maintains a consistent distance between the judgment results of the same sample with noise added and those of the original sample. As shown in Figure 10, the consistency regularization contains two components: regression regularization and classification regularization, which act on different stages of the model training. Content of contrastive consistency regularization.
The calculation item for regression regularization is the pixel-wise L2 distance,
22
which is the regularity calculation formula for the two foreground predictive heat maps obtained after the RPN judgment for the sample pair, whether noise is added or not:
In the formula,
The classification regularization requires matching noise-augmented sample pairs with specific detection targets for implementation. The ROI is calculated based on the predicted positions of both parties during the matching process. Consistency regularization is computed only for the matched target pairs, and there is no loss calculated for targets that are not matched. The classification regularization term is the cross-entropy between the predicted classification vectors of the matched target pairs:
In the formula,
Loss function
In addition to the regularization terms, the loss function for the semi-supervised detection net also includes the respective loss functions corresponding to the two stages of the network. Similar to TPD-Net, the network in the first stage mainly serves as an RPN (Region Proposal Network) to regress foreground proposals. For classification loss, the Focal Loss
23
is used to improve the detection rate of foreground objects. The localization loss is computed using the standard L1 loss, as shown in equations (6) and (7), respectively:
The classification loss in the second stage comes from the confidence loss of the prediction results for each detected target in the sample. Compared to PCM-Net which uses an improved class-correlation loss function, the training set environment created for semi-supervised learning has balanced sample categories. Therefore, the classification loss uses cross-entropy loss, as shown in equation (8):
The regression loss in the second stage is the position prediction refinement for positive sample proposals, using the smooth L1 loss, which is defined in equations (9) and (10).
In conclusion, the overall loss function of NS-IDNet is defined as shown in equation (11).
Experiment
Dataset and implementation details
The dataset for semi-supervised learning is divided into two parts: labeled data comes from an annotated dataset, and unlabeled data comes from data generated by En-GAN. The unlabeled dataset consists of 8000 samples, which is approximately ten times the size of the original dataset’s training set. Its Shape Complexity (SC) is set to a uniform distribution sampling from (0, 0.6) for 1000 times, with each SC value synthesis producing eight samples, to meet the diversity need for a uniformly distributed number of different categories as much as possible. As shown in Figure 11, the distribution of the SC settings for the unlabeled dataset and the prediction distribution graph, the blue histogram represents the SC set values, and the red line represents the discriminator’s prediction of the sample SC values. The SC set values in the figure basically satisfy uniform sampling, but the SC prediction values are mainly concentrated in the middle with fewer higher values, which might be the result of both the synthesis model and the discriminator model avoiding negative outcomes for regression losses. Overall, compared to the long-tail distribution of real data, the synthetic dataset is much better balanced in terms of sample severity. Unlabeled dataset SC setting distribution versus predicted distribution chart.
En-GAN synthesizes unlabeled data samples only, as illustrated in Figure 12. Restored synthetic samples.
The experimental hardware environment consists of an Intel i9-9900K CPU and two NVIDIA GeForce RTX 2080 Ti graphics cards. The implementation language and deep learning framework are Python version 3.7 and TensorFlow 1.15.0, respectively. In terms of implementation details, a teacher model is utilized to perform predictive inference on the unlabeled data, obtaining non-one-hot soft labels. When training the model, consistency regularization noise is introduced by perturbing images through translation, flipping, and scaling. On the top of the TPD-Net, noise and consistency regularization are added for training on large datasets. Network parameters are optimized using the Adam optimizer, with the training regimen set for 50 epochs, and a batch size of 2, iterating over both the labeled training set samples and the synthesized samples at each epoch. A diminishing schedule is also established for the training learning rate, which is set at
Evaluation metrics
To quantify the model’s performance on insulator detection, the paper uses the following three metrics to assess model performance: Average Precision (AP), Accuracy (ACC), and Frames Per Second (FPS).
ACC is a metric commonly used in the field of machine learning and refers especially to classification tasks. Accuracy measures the capability of the model to make correct predictions, that is, the ratio of correctly predicted samples over the total number of samples.
The formula for accuracy is as follows:
AP is a measure widely used for evaluating the performance of a model in information retrieval and computer vision tasks. To calculate AP, it is necessary to understand and apply several basic formulas that involve precision and recall calculations.
AP is a measure widely used for evaluating the performance of a model in information retrieval and computer vision tasks. To calculate AP, it is necessary to understand and apply several basic formulas that involve precision and recall calculations:
FPS is another key indicator of model performance, representing the speed at which the model processes images, that is, how many frames per second the model can process. FPS is essential for real-time applications such as video monitoring and autonomous driving, as these require the target detection nets to analyze video frames in real time and make quick responses. FPS reflects the speed of algorithmic image processing and is directly related to the real-time performance of the model.
Comparative experiment
Performance comparison of NS-IDNet and benchmark models on insulator test set.
From the comparative results in the analysis table, it can be seen that overall, NS-IDNet exhibits the best accuracy performance, followed by the semi-supervised Noisy-Student method. It is evident that semi-supervised models are generally superior to supervised models with only limited data. The Noisy-Student shows weaker performance in terms of AP and ACC than the proposed method, and its processing speed is also slower. This is because, in semi-supervised detection nets, the constraint of consistency regularization is also very important in addition to self-training. In terms of the speed metric, all models are capable of meeting real-time requirements, with TPD-Net achieving the highest FPS index due to the advantage of its smaller size model. In the table, the accuracy of NS-IDNet for various kinds of targets is higher than the other models. This is likely due to the semi-supervised samples increasing the quantity of each class of samples, which leads to better training effects of the model.
To more intuitively reflect the effectiveness of the algorithm, the paper selects insulator string faults under four different backgrounds, including iron towers, lawns, rivers, and forests. As shown in Figure 13, these are the results of fault localization for insulators. NST-PDNet can fully recognize extremely small defect areas such as insulator drop faults and can precisely locate the drop positions against complex natural backgrounds, with no missed detections or false alarms occurring. Faulty insulator detection result.
Ablation study
Ablation of self-training labels
Influence of soft labels on the detection performance of NS-IDNet.
The table indicates that models trained with soft labels have superior AP and ACC metrics compared to the comparison model. A likely reason for the decline in performance due to hard labels is the accumulation of overconfident mistakes from self-training, leading the model to learn in the wrong direction and ultimately resulting in lower overall performance. Soft label supervision is gentler and more tolerant of incorrect labels, thus maintaining higher performance levels.
Ablation of model capacity
Impact of different capacity configurations on the detection performance of NS-IDNet.
Ablation of cyclic training
Influence of the number of self-training cycles on the detection performance of NS-IDNet.
Conclusion
According to the issue of insufficient data in existing supervised methods, we propose a semi-supervised object detection net, named NS-IDNet, to achieve higher accuracy for detecting damage of electric power line insulators. Specifically, NS-IDNet is built on the TPD-Net model and the En-GAN model. Firstly, NS-IDNet utilizes an object detection net to predict weak labels for unlabeled data generated by the synthetic model. Secondly, by noise self-training and consistency regularization, NS-IDNet is enabled to learn thoroughly from the features of unlabeled data. Finally, the parameters of NS-IDNet are stabilized by the replacement of multiple models and cyclic training. Experiments show that NS-IDNet significantly outperforms the traditional supervised models for the insulator fault dataset, while it is superior to traditional semi-supervised models. Additionally, the ablation experiments prove the effectiveness of our proposed model. In practical work, training the insulator fault detection model using the NS-IDNet model can effectively solve the problem of insufficient existing datasets, while also assisting in expanding the insulator detection dataset.
Statements and declarations
Footnotes
Conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors acknowledge the China Southern Power Grid Company Limited (Grant: 090000KK52222180).
