Intelligent detection for industrial electrical automation based on stereo vision and feature fusion

Abstract

Industrial electrical automation fault detection is a guarantee for the stable operation of industrial systems. Focusing on issues of low detection accuracy and insufficient depth perception ability in the current research, this paper establishes a multimetric stereo vision measurement model of electrical equipment, and stereo-corrects the images of electrical equipment. Then a two-channel convolutional neural network (CNN) is designed to capture global and local characteristics of the electrical equipment image. Finally, the YOLOv8 approach was adopted as the baseline model to optimize the design of the feature fusion part of the network. The recognition head network was superseded by Dyhead dynamic recognition head. The loss function part is redesigned using Focaler-IoU to solve the fitting problem caused by sample imbalance. Experimental outcome indicates that the proposed technique boosts the mean Average Precision (mAP) by no less than 5.42%, which significantly improves the detection efficiency and reliability.

Keywords

industrial electrical automation intelligent inspection stereo vision YOLOv8 feature fusion

Introduction

Industrial electrical automation systems encompass a large number of electrical equipment, circuits, and complex control systems, and the integrity of their operation is a determinant factor in production stability and safety.¹ Currently, industrial electrical automation testing mainly relies on traditional testing methods. The inefficacy of these approaches is compounded by their human-dependence, creating unacceptable risks for testing timelines and data integrity.² As the industrial production scale and the increasing complexity of production processes expanding continuously, the integration level of electrical equipment is getting higher and higher, and the coupling between systems is also becoming stronger, making the manifestations of electrical faults more complex and diverse. Traditional testing methods are difficult to quickly and accurately locate the cause and position of the fault, which affects the timeliness of fault handling and may lead to production interruptions, equipment damage, or even safety accidents, causing significant economic losses to enterprises.³ Therefore, in-depth research on efficient intelligent testing technologies for industrial electrical automation has important practical value for promoting the intelligent upgrading of industrial testing technology, enhancing the level of industrial automation, and improving quality control capabilities.⁴

Early industrial electrical automation detection algorithms mainly used template matching detection technology. This technology calculates the similarity between the two using methods such as color histogram,⁵ structural similarity (SSIM),⁶ and cosine similarity algorithm,⁷ and sets a reasonable threshold to distinguish defective samples from normal samples. Luo et al.⁸ adopted a distance function as a descriptor and used geometric perimeter to calculate the shape distance between the template contour and the contour of the test image, finally achieving defect detection of electrical equipment. Ren et al.⁹ used a geometric invariant moment matching method to achieve fault detection in industrial electrical systems, overcoming the shortcomings of poor shape matching recognition capability. Zhang and Zhu¹⁰ adopted image segmentation and calculated the position information of white line contours in the template and the test image, achieving defect detection of electrical equipment.

Compared to template matching, machine learning methods have stronger robustness and better accuracy. They create complex function relationships between data features and labels, and achieve predictions for new sample features through the mapping between features and labels. Machine learning mainly includes two stages: characteristic vectorization and categorization. Characteristic extraction, which includes gradient histograms,¹¹ local binary patterns,¹² scale-invariant characteristic transform (SIFT),¹³ and accelerated robust feature,¹⁴ is an important step in machine learning. Commonly employed classification techniques span from SVM to K-nearest neighbors and decision trees.¹⁵ Because machine learning highly depends on the feature extraction methods designed by engineers, this method does not require too much data.¹⁶ achieved an accuracy of 99.8% for electrical fault patterns by combining gradient histograms with SVM. Ullah et al.¹⁷ first extracted component features through SIFT, and finally implemented defect detection in industrial electrical automation by combining image pyramid matching with SVM.

Traditional machine learning algorithms are difficult to fully cover the features in these complex scenarios by designing one or several fixed feature extraction methods. Deep learning methods use data-driven approaches to adaptively extract inconspicuous features, and can learn complex relationships within large amounts of data and corresponding labels. Tang and Jian¹⁸ used a binocular stereo vision algorithm to collect fault images of electrical equipment, and built a dual-branch network architecture by a CNN, and explored the role of feature fusion in improving detection accuracy. Zhang et al.¹⁹ demonstrated that coupling Faster R-CNN with a one-class neural network effectively detects defects in industrial electrical systems. Thomas et al.²⁰ applied Transformer to feature fusion, extracting spectral features and spatial features from images in a dual-branch network, exchanging information between modalities to achieve attention to global feature information, but insufficient local feature extraction of images. Yoon and Yoon²¹ developed a high-precision electrical equipment fault image collection system using a binocular camera, and obtained multi-scale features of images through Transformer, achieving feature fusion through cross-attention mechanisms to improve detection accuracy. Bellou et al.²² introduced a cross-stage local module into the feature pyramid network derived from YOLOv8, introduced a path aggregation model into the feature fusion structure, and completed effective detection of electrical equipment images through improved regression loss.

According to the analysis of existing research on industrial electrical automation detection, the fault detection of industrial electrical automation systems has long relied on manual inspections, which are inefficient and unsafe. With the continuous development of image recognition technology, it is possible to determine whether there are faults in electrical equipment based on image information. However, the fault areas of power equipment generally have various practical problems, such as complex patterns and difficult feature extraction. To this end, this article puts forward an intelligent detection technology for industrial electrical automation in light of stereo vision and feature fusion. First, aiming at the problem that traditional visual imaging systems cannot accurately obtain the 3D information of the target, a multi-view stereo vision measurement model is established, and the projection transformation relationship between the target point in space and the picture points in the stereo picture pair is analyzed. The tensor-based planar calibration method is used to calibrate the equipment parameters, and the stereo correction of the image pair of the electrical equipment is performed in combination with the calibration parameters. Then, a CNN structure with parallel branches is designed based on ResNet-50. After each convolutional structure layer, it is divided into two branches. The right computational path continues employing the native feature extraction backbone, performing successive convolutions to aggregate global features, while the left branch serves as the local feature extraction branch. Finally, The YOLOv8 architecture serves as the foundation for optimizing feature fusion components, strengthening the algorithm’s multi-scale feature aggregation performance. The detection head undergoes architectural refinement through Dyhead integration, strengthening its capacity to process occluded characteristics. The loss function part is redesigned using Focaler-IoU, which solves the fitting problem caused by sample imbalance. Experimental outcome indicates that the Precision and Recall of the suggested technology are 93.64% and 91.52%, respectively, which can accurately detect faults in industrial electrical equipment.

Related theory

Binocular camera imaging principle

Unlike single-camera systems, binocular vision employs two monocular cameras to achieve stereoscopic perception. It acquires multi-perspective images of the target via dual cameras mounted on the same horizontal plane, and obtain information about the object in the 3D world by analyzing the disparity information between images, simulating the process of human eyes capturing external information.²³ Monocular stereo vision uses a single camera to infer depth information from a single image or a series of consecutive frames, combined with prior knowledge or deep learning models. It is suitable for cost-sensitive scenarios with low requirements for depth accuracy, where motion or prior information can be relied upon. Multi-camera stereo vision extends the principles of binocular vision by using multiple cameras to capture images from different angles. It leverages multi-view geometry to enhance the robustness and accuracy of depth estimation, making it suitable for scenarios requiring high precision, a wide depth range, or complex environments. Stereo vision simulates human vision by using two cameras to capture the same scene from different angles. It calculates the displacement difference between pixels using the principle of parallax and combines triangulation to restore depth information. This is suitable for scenarios requiring moderate accuracy, high real-time performance, and moderate depth ranges. The camera imaging model is divided into linear and nonlinear models, and the specific introduction is as follows.

(1) Linear model. Camera imaging is the process of mapping a 3D object in reality into a 2D image through the camera’s projection. The imaging process is divided into a linear model without distortion and a nonlinear model with lens distortion. The principle of the linear model can be simplified to pinhole imaging. The distance between the object $x^{'}$ and the pinhole after imaging is called the focal length f, and the target $x$ is inverted on the picture plane.

(2) Nonlinear model. Camera lens is usually composed of multiple groups of optical lenses. These lenses affect the propagation of light, resulting in image distortion and changes in position when a three-dimensional object in reality is projected through the lens into a two-dimensional image. This phenomenon is called distortion. The geometric aberrations of camera lenses are classified as either radial or tangential distortion. The second-order radial distortion coefficient typically produces barrel distortion, while higher-order terms may lead to pincushion distortion.²⁴ Pincushion distortion is concave inward, while barrel distortion is convex outward. The distortion arises due to non-ideal lens curvature violating the paraxial approximation. When light passes through the center of the lens, it causes a radial distortion distributed along the radius of the lens, making straight lines appear curved. Image deformation exhibits a positive correlation with radial distance from the optical center.

YOLOv8 target detection algorithm

Industrial electrical automation detection has high requirements for the precision and calculational burden of the model, so it is necessary to select an appropriate baseline object detection framework for the design of the object detection model. YOLOv8 network is a new generation of object detection algorithm, with advantages of high speed, light weight, and high accuracy.²⁵ YOLOv8 incorporates large-scale datasets such as Objects365 for pre-training during training, significantly enhancing the model’s generalization capabilities. It expands to instance segmentation and image classification tasks through a modular design. Compared to other YOLO models, YOLOv8 stands out with its core competitive advantages: a more efficient architecture, a more generalized detection mechanism, higher accuracy, and stronger task scalability. It particularly excels in small object detection, robustness in complex scenes, and ease of engineering deployment, making it the mainstream choice in both industry and academia. The YOLOv8 network framework inherits the tradition of the YOLO series, and is partitioned into four discrete components: input end, Backbone, Neck, and Head.

The design paradigm of YOLOv8’s backbone and neck modules follows YOLOv5’s proven framework.²⁶ The input pipeline of YOLOv8 integrates three core components: Mosaic data augmentation, adaptive anchor computation, and adaptive grayscale padding. The backbone network is based on the idea of segmenting and independently processing feature maps using CSP. This backbone network structure integrates convolution (Conv), cross-stage feature fusion (C2f), and simplified spatial pyramid pooling (SPPF) modules. This architecture simultaneously augments characteristic representation capacity while maintaining computational efficiency.

The Neck module of YOLOv8 is responsible for feature fusion operations, improving cross-hierarchical feature integration capability. The Head module uses a decoupled head to decouple the classification and detection processes and performs detection result calculation and output. In terms of loss functions, the network employs binary cross-entropy (BCE) for classification tasks, while adopting Distribution Focal Loss (DFL) and Complete Intersection over Union (CIoU) for bounding box regression.²⁷ YOLOv8 further improves the model’s detection capability while inheriting the lightweight and efficient tradition of the YOLO series and has found extensive applications across diverse domains with high real-time requirements.

Industrial electrical equipment image acquisition based on stereo vision technology

The multi-view stereo vision system based on visual imaging principles is mainly composed of multiple cameras arranged in the same direction and at the same distance. In the detection of industrial electrical automation equipment, it is necessary to collect 3D information of the surface of electrical equipment. This system is used to take photos of the electrical equipment to be inspected. 3D measurement based on multi-view stereo vision refers to the 3D stereoscopic imaging of the target from multiple angles by multiple cameras, and the stereo projection on the camera image plane forms a pair of stereo images. The multi-angle matching algorithm is used to obtain the pixel coordinates of corresponding picture points, and then spatial feature points are measured using the 3D measurement model, thereby correcting the image of industrial electrical equipment.

The rays corresponding to multiple cameras must intersect at the same spatial point, but due to the actual imaging problems of multiple cameras, especially point measurement errors and camera calibration errors, the 3D images cannot be completely concentrated at one point, but can only cross roughly at one point. Owing to the fact that the separation from the imaging system to the target is substantially larger than the focal length, based on the basic principle of central perspective, the position error extracted from the camera parameters or pixel points will cause a slight deviation in the imaging rays, leading to a large spatial point positioning error. To improve the precision of the imaging relationship, the initial value is usually computed first, and then the bundle adjustment method is used for optimization. In the constrained adjustment algorithm, it is assumed that the measurement noise of the pixel points has a Gaussian independent distribution characteristic and is isotropic with zero mean. Generally, by solving the camera parameters and spatial point coordinates, the corresponding image coordinates are obtained from the imaging model. The association among the coordinates of spatial points and those of picture pixel points is described as below.

l I_{i j} = φ_{j} [r_{j} | t_{j}] I_{i}

(1)

where represents the Euclidean distance from the spatial point to the camera’s projection center,

I_{i j}

is the i-th image point at the intersection of multiple line segments corresponding to point I in j cameras,

φ_{j}

is the equivalent focal length,

r_{j}

is the radius matrix from each point to the visual center,

t_{j}

is the time matrix.

For the spatial points of the detection model composed of multi-view stereo vision, the solution method usually uses the collinearity equation intersection method. Once the intrinsic and extrinsic parameters of multiple cameras have been ascertained, let the spatial point $I_{i} (X_{i}, Y_{i}, Z_{i})$ be the related picture point captured by the i-th camera. Under ideal conditions, the actual observed pixel coordinates, that is, the visual center point, pixel point, and spatial point are on the same straight line. Based on this, the collinearity equation can be expressed as follows.

{\begin{cases} \frac{x_{i}}{η_{x}} = \frac{r_{11} X_{i} + r_{12} Y_{i} + r_{13} Z_{i} + t_{x}}{r_{31} X_{i} + r_{32} Y_{i} + r_{33} Z_{i} + t_{z}} \\ \frac{y_{i}}{η_{y}} = \frac{r_{21} X_{i} + r_{22} Y_{i} + r_{23} Z_{i} + t_{y}}{r_{31} X_{i} + r_{32} Y_{i} + r_{33} Z_{i} + t_{z}} \end{cases}

(2)

After organizing and converting equation (2), the linear equation system for point $I_{i} (X_{i}, Y, Z_{i})$ is as follows.

{\begin{cases} [x_{i} r_{31} - η_{x} r_{11}] X_{i} + [x_{i} r_{32} - η_{x} r_{12}] Y_{i} + [x_{i} r_{31} - η_{x} r_{13}] Z_{i} + [x_{i} t_{z} - η_{x} t_{x}] = 0 \\ [y_{i} r_{31} - η_{y} r_{21}] X_{i} + [y_{i} r_{32} - η_{y} r_{22}] Y_{i} + [y_{i} r_{31} - η_{y} r_{13}] Z_{i} + [y_{i} t_{z} - η_{y} t_{y}] = 0 \end{cases}

(3)

When there are multiple cameras, the number of equations in equation (3) will become twice the original. Through this formula, the acquisition results of industrial electrical equipment can be obtained.

Feature extraction of industrial electrical equipment based on multi-scale CNN

Based on the industrial electrical equipment images obtained in the previous section, this paper designs a multi-scale CNN to extract features. The commonly used convolutional neural networks for image classification are AlexNet,²⁸ VGG-16,²⁹ ResNet,³⁰ etc. Among them, ResNet-50 can achieve the highest detection accuracy with a low number of parameters, even exceeding the VGG-16 with the most parameters. Therefore, this paper uses ResNet-50 as the basic network structure for the algorithm research.

This paper designs a CNN structure with parallel branches based on ResNet-50. After each convolutional structure, it is divided into two branches. The right branch is the original feature extraction main network, continuing to perform convolution operations to extract features. The left branch is the scale feature extraction branch, flattening the features of this scale into a one-dimensional feature vector and saving it. After all scale feature extraction branches have extracted their respective scale features, they are concatenated into a combined feature for the next step of fusion.

The network structure used for multi-scale extraction is called the feature extraction branch module (FEM). Its structure is composed of convolution modules and pooling modules. The convolution module consists of multiple $3 \times 3$ convolution kernels and batch normalization layers. Finally, it is expanded into a one-dimensional feature vector through a global average pooling layer. Before expanding the feature maps, FEM still uses multiple convolution layers. These convolution layers are to make the features at this scale have a better distribution. Through the final concatenation layer, the features at this scale are learned through backpropagation. Therefore, these convolution layers use smaller $3 \times 3$ convolution kernels to substitute $7 \times 7$ and $5 \times 5$ large convolution kernels. On the one side, it is capable of diminishing the quantity of parameters, while on the other side, it can concentrate more effectively on local fine-grained characteristics, thus retaining the maximum spatial context information. After the convolution layer is a batch normalization layer (BN), which normalizes the features output by the convolution layer, standardizing the feature distribution between 0 and 1. It effectively reduces the impact of overfitting while accelerating the convergence speed of network training. After multiple convolution layers and batch normalization, the features at this scale have already had a good feature distribution. Through a global average pooling operation, the feature maps at this scale are directly expanded into a one-dimensional feature vector. This facilitates subsequent concatenation with features from other scales, combining them into a fused feature.

Industrial electrical automation intelligent detection based on improved YOLOv8

SD-FPN feature fusion module

In view of the challenges such as scale variation and occluding targets that are difficult to solve in industrial electrical automation detection, this paper proposes to improve the network based on the basic framework of YOLOv8 to meet the requirements of intelligent electronic automation detection in complex industrial environments. The structure of the intelligent detection model for industrial electrical automation based on the improved YOLOv8 is shown in Figure 1. To address the defects in the characteristic integration part, an SD-FPN characteristic integration part is designed to replace the Neck part of the initial network, significantly enhancing the ability to learn cross-scale and small target features; using a dynamic detection head Dyhead to enhance the feature representation capability of the network backbone, and integrating the latest dynamic convolution part to improve feature detection efficiency. The introduced Focal-PIoU loss replaces CIoU in the baseline, addressing class imbalance via dynamic gradient modulation.

Figure 1.

The structure of the intelligent detection model based on the improved YOLOv8.

Intending to the issue that YOLOv8 has not fully solved the cross-scale feature fusion, this paper proposes a cross-layer dynamic fusion module SD-FPN to boost performance in detecting objects across varying scales, especially tiny instances, replacing the original network’s PAN-FPN structure. The design idea of the SD-FPN module is based on the cross-layer connection idea proposed by BiFPN,³¹ and a dedicated small-object detection head is incorporated into the network architecture to improve distant low-resolution obstacle identification, the DySample lightweight dynamic upsampling module³² is used for upsampling, improving the feature propagation efficiency of the module. For the feature information between different layers, the Semantics and Detail Infusion Module (SDI) multi-level feature fusion module is used for explicit fusion.

Due to the large down-sampling factor of YOLOv8, deeper characteristic pictures encounter difficulties in capturing the characteristic information of small objects, and the original BiFPN structure has limited improvement in feature extraction for small targets. This paper restructures the Neck part of YOLOv8 by referring to the idea of BiFPN, and augments the BiFPN with a dedicated small-target detection branch to compensate for the baseline network’s deficiency in micro-object characteristic extraction. By cross-layer connections, the feature information of each level is fully fused, augmenting hierarchical characteristic abstraction for improved micro-target detection across scales. The lightweight design of BiFPN also effectively reduces the computational pressure brought through the micro-scale characteristic extraction head, improving the overall efficiency of the feature fusion module. For the scenario of industrial electrical automation detection, which is information-intensive and has deployment requirements, this paper proposes using the DySample lightweight dynamic upsampling module as the upsampling tool in the Neck structure. The DySample module replaces dynamic kernel generation and dynamic convolution with point sampling technology. This method reduces the computational burden while simplifying the entire upsampling process. DySample maintains or exceeds the performance of traditional methods while achieving substantial reductions in both parameter count and computational latency. The main process of DySample is to generate sampling points through a sampling point generator and resample the input features through a grid sampling function. The specific process is detailed in Ref. 33.

In the original feature fusion operations of YOLOv8, the network uses the concat feature concatenation operation to concatenate feature map tensors along a specified dimension to achieve information fusion between feature maps of different layers. Although this method is effective in simple scenarios, it has insufficient information fusion capabilities in complex scenarios. During the feature fusion process, the input feature resolutions of different layers are different, and their contributions to the feature fusion part are unequal. This work substitutes the conventional concatenation operation in characteristic fusion with the proposed SDI module, significantly improving the joint representation of high-level semantics and low-level details. The SDI module was first proposed in U-net V2, which finely adjusts characteristics at various levels through spatial and channel attention mechanisms, and explicitly achieves hierarchical characteristic integration through element-wise Hadamard multiplication of semantic and detail characteristic pictures. This design enhances the interaction between features while further optimizing the efficiency of feature fusion.

The feature fusion of the SDI module is specifically implemented through the following steps. Among the hierarchical characteristic pictures generated through the encoder, spatial and channel attention schemes are deployed on the characteristics at every level $f_{i}^{0}$ . The processed characteristics achieve joint encoding of fine-grained spatial details and holistic channel relationships, as shown below.

f_{i}^{2} = ϕ_{c i} (ϕ_{s i} (f_{i}^{0}))

(4)

where

f_{i}^{1}

stands for the characteristic picture after processing at the i-th layer,

ϕ_{s i}

and

ϕ_{c i}

, respectively, denote the parameters of location-sensitive and dimension-specific attention at the i-th layer. Through a

1 \times 1

convolutional operation, the amount of channels of

f_{i}^{1}

is reduced to c, in which c is a hyperparameter. The resulting characteristic picture is represented as

f_{i}^{2}

, in which

H_{i}

W_{i}

and c, respectively, stands for the width, height and amount of channels of

f_{i}^{2}

. The refined feature representations are propagated to the decoder network. At every layer i of the decoder, this paper uses

f_{i}^{2}

serves as the ground-truth benchmark, and adjusts the size of the characteristic pictures at each j layer to achieve resolution consistency with

f_{i}^{2}

, as shown below.

f_{i j}^{3} = {\begin{cases} D (f_{j}^{2} (H_{i}, W_{i})) i f j < i \\ I (f_{j}^{2}) i f j = i \\ U (f_{j}^{2} (H_{i}, W_{i})) i f j > i \end{cases}

(5)

where D, I, and U stand for the operations of dynamic mean pooling, identity transformation, and bilinear interpolation to the

H_{i} \times W_{i}

resolution,

1 \leq i, j \leq M

. Thereafter, to smooth each resized feature map

f_{i j}^{3}

, a

3 \times 3

convolution is applied, as shown below.

f_{i j}^{4} = θ_{i j} (f_{i j}^{3})

(6)

where

θ_{i j}

is the parameter of the leveling convolution, and

f_{i j}^{4}

is the j-th smoothed characteristic picture at the i-th level. After adjusting each characteristic pictures at the i-th level to a uniform resolution, this paper applies element-wise Hadamard product to each resized characteristic picture to improve the semantic and detail information of the i-th level characteristics, as shown below.

f_{i}^{5} = H [f_{i 1}^{4}, f_{i 2}^{4}, . . ., f_{i M}^{4}]

(7)

Detection head Dyhead based on dynamic convolution and attention mechanism

The detection head of the industrial electrical equipment detection model based on YOLOv8 does not fully utilize the feature information at various scales, and has limited capability in processing complex spatial information, potentially resulting in false negatives or false positives for occluded object detection. The output of the YOLOv8 backbone is characterized by a rank-3 tensor of shape (L, S, C), where L and S represent spatial extents, and C denotes feature channels, it is possible to optimize the recognition capability of the recognition head for occluded objects by combining attention mechanisms with the three feature tensor dimensions. The dynamic detection head Dyhead³⁴ integrates attention mechanisms across all feature tensor dimensions of the detection head through unified scale-aware attention, spatial-aware attention, and task-aware attention.

Dyhead unifies the object recognition head by fusing various self-attention mechanisms, substantially improves the representational capacity of the detection head while maintaining computational efficiency. Dyhead first constructs a feature pyramid from the features obtained from the SD-FPN part, and align hierarchical characteristic representations through scale harmonization, Constructing a three-dimensional feature tensor F, where L, S, and C stand for the dimensions of scale, space, and channel, individually. Dyhead integrates the detection head via sequential attention mechanisms encompassing scale perception, spatial localization, and task-specific adaptation.

W (F) = π_{C} (π_{S} (π_{L} (F) \cdot F) \cdot F) \cdot F

(8)

where

π_{L} (\cdot)

π_{S} (\cdot)

π_{C} (\cdot)

are three single attention schemes applied to the dimensions L, S, and C, individually, which are scale-aware attention, space-aware attention, and task-aware attention. The scale-aware attention module dynamically recalibrates multi-scale features along the pyramidal dimension of the tensor, and is represented by the attention function

π_{L} (\cdot)

π_{L} (F) \cdot F = σ (f (\frac{1}{S C} \sum_{s, c} F)) \cdot F

(9)

where

σ (x) = \max (0, \min (1, x + 1 / 2))

is the hard Sigmoid operation, while

f (\cdot)

is equivalent to the linear transformation implemented by

1 \times 1

convolution.

The space-aware attention is used for information discrimination among spatial locations and feature layers. For the goal of further optimizing the characteristic extraction ability for occluded objects, this article uses the deformable convolution DCNv4³⁵ in the space-aware attention mechanism to induce sparsity in attention mechanisms, and then aggregates features from various levels at identical spatial coordinates, represented by the attention function $π_{S} (\cdot)$ , where $p_{k}$ is the probability of the current characteristic extraction, l and c are the length and channel number, respectively.

π_{s} (F) \cdot F = \frac{1}{L} \sum_{l = 1}^{L} \sum_{k = 1}^{K} w_{l k} \cdot F (l; p_{k} + Δ p_{k}; c) \cdot Δ m_{k}

(10)

The task-conditioned attention module selectively activates feature channels in a task-dependent manner.

π c (F) \cdot F = \max (α^{1} (F) \cdot F_{c} + β^{1} (F), α^{2} (F) \cdot F_{c} + β^{2} (F))

(11)

where

θ (\cdot) = {[α_{1}, α_{2}, β_{1}, β_{2}]}^{T}

is a hyperfunction used to study the activation threshold, while

F_{c}

corresponds to the characteristic slice for the c-th channel. The tripartite attention modules—scale-aware, spatial-aware, and task-aware can be sequentially integrated into the detection head, significantly improving robustness against occlusion and small defect detection.

Focaler-PIoU loss function

The original YOLOv8 adopts the CIoU algorithm as the basis for the metric. Although the CIoU algorithm can reflect the model’s prediction accuracy in most scenarios, this metric still has shortcomings such as gradient disappearance and sensitivity defects. This article introduces the Powerful-IoU³⁶ loss function to substitute the original CIoU loss function. Powerful-IoU (PIoU) achieves faster convergence and higher accuracy than CIoU by combining a size-adaptive penalty factor for the target and a gradient adjustment function in light of the anchor box quality, and further enhances the attention to medium-quality anchor boxes by introducing a non-monotonic attention.

The PIoU loss function comprises a size-adaptive penalty factor and a gradient-modifying function based on the anchor box’s quality. To start with, a penalty factor P that is adjusted according to the target size is brought in.

P = \frac{1}{4} (\frac{| x_{1} - x_{1^{'}} |}{w} + \frac{| x_{2} - x_{2^{'}} |}{w} + \frac{| x_{3} - x_{3^{'}} |}{h} + \frac{| x_{4} - x_{4^{'}} |}{h})

(12)

where

x_{1}

x_{2}

x_{3}

x_{4}

are the four boundaries of the predicted box,

x_{1^{'}}

x_{2^{'}}

x_{3^{'}}

x_{4^{'}}

are the four boundaries of the target box, w and h represent the width and height of the target box. This penalty factor avoids dispensable expansion of the anchor box during the regression process by considering the size of the target box. Then set the gradient adjustment function

Φ

Φ (x) = 1 - e^{- x^{2}}

(13)

This function provides a small gradient when the anchor box quality is poor, and the maximum gradient when the quality is moderate, to promote the anchor box to regress quickly and accurately to the target box. The final expression of PIoU is as follows.

L_{P I o U} = 1 - Φ (P)

(14)

To enhance the loss function’s adaptive regression capability across samples with varying complexity levels, PIoU is nested with Focaler-IoU to form Focaler-PIoU. Focaler-IoU reformulates the IoU loss through a piecewise linear transformation of the intersection-over-union metric, so as to intend to samples of various difficulties during the bounding box regression, as shown in equation (15).

I o U_{f o c a l e r} = {\begin{cases} 0, i f I o U < d \\ \frac{I o U - d}{u - d}, i f d ≪ I o U \leq u \\ 1, i f I o U > u \end{cases}

(15)

where d and u are preset thresholds used to define the intervals of samples with different difficulties, with a range of [0,1], used to define the intervals of samples with different difficulties. When

I o U < d

, it encourages the model to improve the overlap with the real bounding box, while when

I o U > u

it puts higher requirements on the model’s improvement, to further enhance the detection accuracy.

Experimental design and analysis of results

Analysis of industrial electrical automation detection results

This paper uses the industrial electrical cabinet fault dataset collected in Ref. 37, which includes periodic vibration signals collected under constant speed 800r/min no-load conditions, with a sampling frequency of 15 kHz, as well as 8637 electrical cabinet fault image data. In the experiment, the study selected Ubuntu 18.04 as the operating system, Intel Core i7-8700K as the processor, 16 GB of memory, and NVIDIA GeForce GTX 1080 Ti as the graphics card. The deep learning framework used was TensorFlow 2.0. In the experiment, the Epoch was set to 500, the Batch size was 12, the studying rate was 0.001, the optimizer was SGD, and the training mode was parallel training.

Three groups were set up, with the group containing the approach suggested in this article as the experimental group. By plotting the vibration waveforms of the fault detection results of the three groups, the vibration waveforms under different sample data volumes were intuitively displayed, as shown in Figure 2. The vibration amplitudes of the vibration waveforms of the two control groups were generally larger, which increased the complexity and uncertainty of fault detection to some extent. Further, during the detection process, the data of the two control groups showed various abnormal detection results and some cases where the abnormalities could not be clearly detected. These phenomena directly led to a decline in the overall detection effect, indicating that traditional detection methods have limited ability to capture and identify subtle faults without specific processing or optimization. However, when the experimental group reached 600 sample points, the vibration waveform amplitude suddenly dropped abnormally to below −2 mm, which clearly indicated a fault at this point. This significant change demonstrates the effectiveness of the approach suggested in this article and its superiority in accurately identifying faults in complex vibration data.

Figure 2.

Vibration waveforms for electrical equipment fault detection under different data quantities.

In the analysis of detection effects in industrial electrical automation, the Loss curve represents the difference between the actual target value and the forecasting target value. All parameters in the proposed method OURS were further optimized and improved, allowing the loss function value to continuously decrease, thus obtaining a network model with more accurate performance. As shown in Figure 3(a), this model performed well in terms of learning rate due to the optimized loss function. IOU is another vital evaluation indicator for measuring the detection effect of the model. The closer the difference between the forecasting box and the annotated box, the higher the overlap between the annotated box and the predicted box. As shown in Figure 3(b), as the number of batches increased, the rectangular box could perfectly overlap with the target device.

Figure 3.

Loss curve and IOU curve.

Quantitative analysis of detection performance

To further verify the detection performance of the proposed method OURS, this paper selects quantitative evaluation indicators such as Precision, Recall, mean average precision (mAP), and AUC value to objectively compare the detection effects of OURS, SIFT-SVM,¹⁷ SV-DCNN,¹⁸ CA-Trans,²¹ and YOLOv8,²² as shown in Table 1. The Precision and Recall of OURS are 93.64% and 91.52%, respectively, which are at least 4.26% and 4.38% higher than those of the baseline method. The mAP of OURS is 93.76%, which is 13.65%, 10.07%, 8.49%, and 5.42% higher than those of SIFT-SVM, SV-DCNN, CA-Trans, and YOLOv8, respectively. Comparing the AUC values of the ROC curves, the AUC value of OURS is 0.9848, which is closest to 1, and it is higher than the baseline methods, indicating that the detection accuracy of OURS is high. The OURS method not only obtains relatively clear images of industrial electrical equipment through stereo vision algorithms, but also uses an improved CNN to extract multi-scale features of the images. In addition, the YOLOv8 algorithm was improved, and the proposed SDFPN module effectively compensates for the defects of the YOLOv8 model in cross-layer feature fusion, which can meet the needs of high-precision detection in complex industrial electrical automation environments.

Table 1.

Comparison of detection performance of different technologies.

Method	Precision/%	Recall/%	mAP/%	AUC
SIFT-SVM	82.59	79.57	80.11	0.8926
SV-DCNN	83.61	81.48	83.69	0.9153
CA-Trans	87.12	85.93	85.27	0.9268
YOLOv8	89.38	87.14	88.34	0.9629
OURS	93.64	91.52	93.76	0.9848

Conclusion

Intending to the issues of low detection accuracy and insufficient characteristic extraction in existing industrial electrical automation detection technologies, this article suggests an intelligent detection technology for industrial electrical automation in light of stereo vision and feature fusion. First, aiming at the problem that traditional visual imaging systems cannot accurately obtain three-dimensional information of the target, a multi-view stereo vision measurement model is established, and the projection transformation relationship between the target point in space and the picture points in the stereo picture pair is analyzed; the tensor-based planar calibration method is used to calibrate the equipment parameters, and the stereo correction of the image pairs of electrical equipment is performed in combination with the calibrated parameters. Then, a CNN structure with parallel branches is designed based on ResNet-50. After each convolutional structure, it is divided into two branches. The right branch continues the convolution operation of the original feature extraction backbone network to extract global features, while the left branch serves as a local feature extraction branch. Finally, the improved YOLOv8 algorithm is used for fault detection of industrial electrical equipment. Several improvements are made to the YOLOv8 algorithm, which has disadvantages in fault detection of electrical equipment. The characteristic integration part of the network is redesigned to enhance the model’s ability to fuse characteristics of targets of various scales. The Dyhead dynamic detection head is introduced to improve the efficiency of the network in extracting occluded features. The loss function was redesigned using Focaler-IoU, effectively solving the problem of sample imbalance. Experimental outcome indicates that the mAP of the proposed technology is 93.76%, which is improved by 5.42%-13.65% compared to the comparative technology, and can achieve relatively accurate fault detection of industrial electrical automation, meeting the demand for high-precision detection in complex industrial electrical automation environments.

The industrial electrical automation fault detection algorithm proposed in this paper significantly improves the detection accuracy on the test dataset compared to the original model, but the lightweight work of the model is insufficient, which may affect the deployment on mobile devices. In the future, model lightweight methods such as pruning and distillation can be considered to reduce the model parameters with minimal precision loss.

Footnotes

ORCID iD

Yun Liu

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Babayigit

Abubaker

. Industrial internet of things: a review of improvements over traditional scada systems for industrial automation. IEEE Syst J 2023; 18: 120–133.

Muttaqi

Aghaei

Ganapathy

, et al. Technical challenges for electric power industries with implementation of distribution system automation in smart grids. Renew Sustain Energy Rev 2015; 46: 129–142.

Cheng

. Thermal fault detection and severity analysis of mechanical and electrical automation equipment. International Journal of Heat & Technology 2022; 40: 15–22.

Kareem

Kalra

, et al. Optimization of electric automation control model based on artificial intelligence algorithm. Wireless Commun Mobile Comput 2022; 20: 62–77.

Han

J-H

Yang

Lee

B-U

. A novel 3-D color histogram equalization method with uniform 1-D gray scale histogram. IEEE Trans Image Process 2010; 20: 506–512.

Wang

Zhang

, et al. An intelligent detection method for approach distances of large construction equipment in substations. Electronics 2023; 12: 35–43.

Dang

Liu

Lee

S-K

. State evaluation of electrical equipment in substations based on data mining. Appl Sci 2024; 14: 48–56.

Luo

, et al. Image recognition technology with its application in defect detection and diagnosis analysis of substation equipment. Sci Program 2021; 2: 21–30.

Ren

Zheng

Yang

, et al. A cross-condition fault disentangling matching network for industrial early fault enhancement. IEEE Trans Ind Electron 2024; 72: 7585–7594.

10.

Zhang

Zhu

. Research on algorithm for improving infrared image defect segmentation of power equipment. Electronics 2023; 12: 15–28.

11.

Zuo

Zhang

Song

, et al. Gradient histogram estimation and preservation for texture enhanced image denoising. IEEE Trans Image Process 2014; 23: 2459–2472.

12.

Ke-Chen

Yun-Hui

Wen-Hui

. Research and perspective on local binary pattern. Acta Autom Sin 2013; 39: 730–744.

13.

Van Noord

Postma

. Learning scale-variant and scale-invariant features for deep image classification. Pattern Recogn 2017; 61: 583–592.

14.

Jeong

S-H

Park

J-W

Kim

H-S

. Deep neural network-based lifetime diagnosis algorithm with electrical capacitor accelerated life test. J Power Sources 2024; 599: 23–41.

15.

Naeem

Ali

Anam

, et al. An unsupervised machine learning algorithms: comprehensive review. International Journal of Computing and Digital Systems 2023; 4: 71–78.

16.

Manikonda

Gaonkar

. Islanding detection method based on image classification technique using histogram of oriented gradient features. IET Gener Transm Distrib 2020; 14: 2790–2799.

17.

Ullah

Khan

Yang

, et al. Deep learning image-based defect detection in high voltage electrical equipment. Energies 2020; 13: 39–40.

18.

Tang

Jian

. Thermal fault diagnosis of complex electrical equipment based on infrared image recognition. Sci Rep 2024; 14: 47–55.

19.

Zhang

Chang

Meng

, et al. Equipment detection and recognition in electric power room based on faster R-CNN. Procedia Comput Sci 2021; 183: 324–330.

20.

Thomas

Chaudhari

Shihabudheen

, et al. CNN-based transformer model for fault detection in power system networks. IEEE Trans Instrum Meas 2023; 72: 1–10.

21.

Yoon

D-H

Yoon

. Development of a real-time fault detection method for electric power system via transformer-based deep learning model. Int J Electr Power Energy Syst 2024; 159: 12–25.

22.

Bellou

Pisica

Banitsas

. Aerial inspection of high-voltage power lines using YOLOv8 real-time object detector. Energies 2024; 17: 25–35.

23.

Tian

Liu

Wang

, et al. High quality 3D reconstruction based on fusion of polarization imaging and binocular stereo vision. Inf Fusion 2022; 77: 19–28.

24.

Tang

Von Gioi

Monasse

, et al. A precision analysis of camera distortion models. IEEE Trans Image Process 2017; 26: 2694–2704.

25.

Liu

, et al. Layn: lightweight multi-scale attention yolov8 network for small object detection. IEEE Access 2024; 12: 29294–29307.

26.

Dong

. YOLO-SE: improved YOLOv8 for remote sensing object detection and recognition. Appl Sci 2023; 13: 12–27.

27.

Vakili

Karimian

Shoaran

, et al. Valid-IoU: an improved IoU-based loss function and its application to detection of defects on printed circuit boards. Multimed Tool Appl 2024; 2: 21–24.

28.

. Improvement of the AlexNet networks for large-scale recognition applications. Iran J Sci Technol Trans Electr Eng 2021; 45: 493–503.

29.

Tammina

. Transfer learning using vgg-16 with deep convolutional neural network for classifying images. International Journal of Scientific and Research Publications (IJSRP) 2019; 9: 143–150.

30.

Y-L

Zhu

. ResNet and its application to medical image processing: research progress and challenges. Comput Methods Progr Biomed 2023; 240: 10–16.

31.

Doherty

Gardiner

Kerr

, et al. BiFPN-yolo: one-stage object detection integrating Bi-directional feature pyramid networks. Pattern Recogn 2025; 160: 12–29.

32.

Liu

Wang

, et al. Bearing-detr: a lightweight deep learning model for bearing defect detection based on rt-detr. Sensors 2024; 24: 1–14.

33.

Zhou

Yang

Wang

, et al. QCF-YOLO: a light weight model of surface defect detection for quick-connect fittings. IEEE Sens J 2024; 25: 1716–1731.

34.

Shen

Lang

Song

. Infrared object detection method based on DBD-YOLOv8. IEEE Access 2023; 11: 145853–145868.

35.

Hua

Chen

. A new lightweight network for efficient UAV object detection. Sci Rep 2024; 14: 13–19.

36.

Nai

, et al. Siamese target estimation network with AIoU loss for real-time visual tracking. J Vis Commun Image Represent 2021; 77: 10–21.

37.

Verma

Nagpal

Desai

, et al. An efficient neural-network model for real-time fault detection in industrial machine. Neural Comput Appl 2021; 33: 1297–1310.