Performance optimization of rail inspection robot system based on deep vision and machine learning

Abstract

The inspection of railway infrastructure faces significant challenges due to heterogeneous environmental conditions and non-uniform illumination patterns, leading to suboptimal detection performance in conventional robotic systems. This study develops a multi-stage image enhancement pipeline incorporating adaptive target segmentation and stereoscopic correspondence matching. A cross-sensor calibration protocol establishes precise spatial coordinates for defect localization through binocular disparity analysis. The proposed framework integrates an enhanced YOLOv5 architecture with context-aware attention modules, developing a hierarchical feature learning architecture that combines pyramidal representation with bidirectional multi-scale feature fusion layers. Experimental validation demonstrates 91.5% precision in fastener absence detection with optimized computational efficiency, indicating substantial improvements in automated rail defect diagnostics compared to baseline systems.

Keywords

rail inspection robot machine learning deep vision YOLOv5 model attention mechanism

Introduction

In recent years, with the rapid development of rail transit, the safe operation and maintenance of railway and urban rail systems have faced increasingly severe challenges. Defects such as cracks, wear, and loose bolts in the track structure, as well as abnormal conditions of critical equipment such as contact lines and switches, can all lead to major safety accidents.¹ Traditional manual inspection methods rely on inspectors walking along track lines at regular intervals or using simple tools to conduct visual inspections, making it difficult to identify and locate defects comprehensively, promptly, and accurately.² In addition, manual inspections mostly rely on paper documents or basic electronic spreadsheets, lacking efficient information management methods,³ and are unable to meet the refined and intelligent requirements of modern rail transit for safe operation and maintenance. Breakthroughs in deep learning and computer vision technologies have provided new solutions for rail inspection.⁴ Deep learning-based image recognition algorithms can efficiently detect subtle defects on track surfaces, effectively overcoming the limitations of traditional manual inspections, significantly improving inspection efficiency and defect identification accuracy,⁵ and providing strong technical support for the safe operation of rail transit systems.

The track inspection robot system is primarily designed to detect defects such as cracks, wear, and loose bolts in track structures. Hongbo et al.⁶ located the fastener area based on grayscale and gradient features, extracted the track image harr features, and used the Adaboost algorithm⁷ to judge and divide the fastener images, but the detection results were easily affected by lighting. Fan et al.⁸ extracted HOG features from fastener images, trained a support vector machine (SVM) classifier to identify fastener images, and improved the detection rate, but this method requires high positioning accuracy for fastener images. Biswas et al.⁹ extracted track images using Shi-Tomasi and Harris-Stephen fusion features, and classified and identified fasteners using an improved SVM, achieving a detection success rate of 81.25%. Manikandan et al.¹⁰ combined local binary pattern features, grayscale co-occurrence matrix features, and discrete wavelet transform as image features, and used decision trees for fastener recognition and classification, but only for track crack detection. Han et al.¹¹ adopted a top-down approach, utilizing edge density maps and the RANSAC algorithm¹² to perform coarse localization of the fastener region, and then used support vector regression (SVR) to determine fastener missing, achieving a fastener detection success rate of 85.6%.

Deep learning-based track image detection methods can extract more generalized track image features, offering higher algorithm robustness and detection accuracy. The first type is a second-order detection network for target localization and classification. This type of network first generates a series of candidate regions as detection target samples and feature information, and then predicts the localization information and region classification of the targets.¹³ Chen et al.¹⁴ used a material classification and semantic segmentation algorithm based on deep convolutional neural networks (CNN) to achieve a material classification accuracy of 93.35%. Due to the limited number of defective samples in track images, Gibert¹⁵ applied multiple detectors such as SVM and CNN in a multi-task learning framework to improve detection accuracy. Compared with traditional feature analysis techniques, this approach achieved better generalization detection performance and improved detection efficiency. Wei et al.¹⁶ used the Fast R-CNN method to identify and detect defects in track images, but the detection error was relatively high. Aydin et al.¹⁷ utilized CNN and restricted Boltzmann machines¹⁸ to extract edge features and texture features from track fastener images, then performed feature fusion, and finally detected defects using the Mahalanobis distance similarity measurement method, achieving a fastener defect detection rate of 85.06%.

The second type integrates target detection and localization into a unified detection network. This type of network generally does not require the generation of candidate regions, but instead converts the detection targets into regression problems for processing, and directly predicts the detection target bounding boxes and category information through the network. Guo et al.¹⁹ proposed a track image detection method based on the Transformer architecture, which achieved crack detection accuracy of 86.93% through an encoder and decoder structure. Brintha et al.²⁰ trained YOLO as a deep learning model using the Darknet-53 prediction structure to improve the network’s classification and recognition capabilities, achieving an average detection success rate of 87.08%. Cai et al.²¹ used a residual CNN structure and introduced visual image color mixing enhancement technology to improve detection speed.

In summary, under complex and changing track conditions and unstable lighting conditions, the current intelligent level of track inspection robot systems is insufficient and has inadequate environmental sensing and adaptation, making it difficult to efficiently complete the task of detecting track structural defects. In response to the above issues, this paper proposes an optimization method for track inspection robot systems based on deep learning and image processing, and conducts performance analysis experiments. First, this paper selects the threshold for image segmentation and uses discriminant analysis to determine the optimal threshold. Through binarization processing, the background information of the image is removed to make the target information more prominent. After obtaining the optimal segmentation threshold, compare the gray values of all pixels with this value to reassign the gray values of the pixels, ultimately obtaining a high-quality track structure image. Subsequently, binocular vision calibration was performed on the camera, and the depth of the corresponding points was calculated to obtain the three-dimensional spatial information of the defective parts on the track, thereby determining the specific location of the track defects. Based on this, this paper designed a track defect image detection model based on an improved YOLOv5 model and attention mechanism. This model uses an improved FPN structure combined with an attention mechanism as the backbone network, an improved FPN combined with a path aggregation network as the neck structure, and a YOLO decoupled dual-path detection head for detection. The main network, neck structure, and detection head convolution modules of this model have all been designed to be lightweight. Finally, performance analysis experiments were conducted. The comparison experiments showed that the proposed method achieved a detection accuracy of over 90% for track fastener defects, outperforming the comparison methods. Additionally, ablation experiments confirmed the effectiveness of all components in the model.

Methods and materials

Overview of the rail inspection robot inspection system

The rail inspection robot system is an intelligent detection platform that integrates mechanical automation, sensor technology, computer vision, and artificial intelligence algorithms. The existing deep learning-based track inspection robot system can automatically extract features and accurately identify potential risks such as track surface defects and foreign object intrusion compared to traditional manual inspection methods. By collecting a large amount of inspection data, deep learning is utilized for trend analysis and predictive maintenance. The system mainly consists of edge computing devices, algorithm models, databases, and web applications,²² as shown in Figure 1.

(1) Edge computing devices. These are computer systems that provide data communication services and various detection algorithms locally.²³ During inspections, they perform image recognition and corresponding logical operations, and transmit the inspection results via the enterprise local area network for data transmission.

(2) An intelligent algorithm used for image recognition of sampled images during patrol inspections, mainly used for knob switch status detection, track structure damage detection, and other functions.²⁴

(3) Web application. Presents daily inspection data and results to users in a visual manner, and provides interactive features such as viewing inspection progress and querying historical inspection results.

(4) Database. Used to store and manage data, save the results of each inspection and the sample images that generated error identifications, for later data tracing and model iteration updates.

Figure 1.

Orbital inspection robot system architecture.

Convolutional neural network

CNN is a deep learning model specifically designed to process data with a grid structure (such as images, videos, and speech).²⁵ Traditional neural networks use a fully connected architecture, whereas CNN neurons are only linked to local regions of the input data, thereby reducing parameters’ number and capturing spatial local features. Neurons in the convolutional layer in a CNN are connected only to localized regions (i.e., local receptive fields) of the input data, rather than to the entire input data. This localized connectivity reduces the number of parameters in the network, and weight sharing makes the network robust to the recognition of similar features at different locations. In contrast, traditional neural networks usually use a fully connected approach, where each neuron is connected to all the neurons in the previous layer, with a large number of parameters, which can easily lead to overfitting, especially when dealing with high-dimensional data. CNNs mainly consist of convolution levels, pooling levels, and fully connected levels,²⁶ as shown below.

(1) Convolution level. By introducing local receptive fields and weight sharing,²⁷ the number of parameters required for feature extraction has been significantly reduced. If the input of the convolution level is an N-dimensional feature map, the output features are calculated as follows.

y_{j}^{l} = f (\sum_{j}^{M} w_{j}^{l} * y^{l - 1} + b_{j}^{l}), j = 1, 2, . . ., M

(1)

where

y_{j}^{l}

is the k-dimensional output feature obtained from the lth layer of convolution,

f ()

is the activation function,

w_{j}^{l}

is the j-dimensional parameter matrix in the lth layer of convolution,

y^{l - 1}

is the input feature map of the

l - 1

-th layer,

*

is the convolution operation,

b_{j}^{l}

is the j-dimensional deviation value corresponding to the lth layer of convolution, and M is the number of output feature channels.

(2) Pooling level. By reducing the size of network features, redundant image features are eliminated based on various image downsampling rules. The output features corresponding to the pooling layer are as follows, where $d o w n ()$ is the downsampling operation.

y^{l} = d o w n (y^{l - 1})

(2)

(3) Fully connected level. This level has the characteristics of feature integration and dimension transformation, and can eliminate redundant information features to the greatest extent possible, retaining only the required target information. The fully connected level is calculated as follows.

y_{i}^{l} = f (\sum_{j = 1}^{N} w_{i j}^{l} \times x_{j}^{l - 1} + b_{i}^{l})

(3)

where

y_{i}^{l}

is the ith feature value in the lth layer output features,

x_{j}^{l - 1}

is the jth feature value in the

l - 1

-th layer input features,

w_{i j}^{l}

is the parameter weight corresponding to each feature value, and

b_{i}^{l}

is the bias term.

Rail image preprocessing and spatial location acquisition of defects

Track image preprocessing

To achieve precise detection of defects in rail-related components, this paper first implements preprocessing such as target segmentation and binocular target matching²⁸ on the collected images, effectively extracting and enhancing the useful information in the images, so as to obtain the three-dimensional information of the spatial position of the track defect fasteners by performing binocular vision calibration on the camera and calculating the depth of the points with the same name, thus laying the groundwork for the subsequent detection of defective track structures. Track structure image preprocessing flow is shown in Figure 2.

Figure 2.

Track structure image preprocessing flow.

Before performing binarization on the collected orbital environment images, select the image segmentation threshold. Selecting a threshold that is too large will result in the loss of necessary information, while selecting a threshold that is too small will result in the extraction of unnecessary information. Use discriminant analysis to determine the optimal threshold. The specific steps are as follows.

(1) Analyze the corresponding gray distribution information of all pixels in the image. Let m represent the total number of pixels in the image, and $m_{i}$ represent the number of pixels with gray value i obtained by histogram analysis. Then, the 0-th moment gray distribution²⁹ of gray level l is defined in equation (4), where S is the set of gray level values. The 1-st moment gray distribution of gray level l is defined in equation (5).

φ (l) = \sum_{i = 1}^{s} \frac{m_{i}}{m}

(4)

τ (l) = \sum_{i = 1}^{s} \frac{i m_{i}}{m}

(5)

(2) Calculate the interclass variance values based on the 0-th-order moment gray distribution and the 1-st order moment gray distribution of gray level l. When the interclass variance reaches its maximum value, the corresponding l value is the optimal segmentation point, that is, the optimal threshold, as shown below.

l^{″} = {[φ (l) - τ (l)]}^{2}

(6)

By applying binarization processing,³⁰ background information in the image can be removed, making the target information more prominent. After obtaining the optimal segmentation threshold, the grayscale values of all pixels are compared with this value, and the grayscale values of the pixels are reassigned accordingly. Finally, the entire image is redrawn. The assignment rules are as follows.

T_{f} (a^{″}, b^{″}) = {\begin{cases} 1, s (a^{″}, b^{″}) ⩾ l^{″} \\ 0, s (a^{″}, b^{″}) < l^{″} \end{cases}

(7)

where

s (a^{″}, b^{″})

is the original pixel grayscale and

T_{f} (a^{″}, b^{″})

is the grayscale of the processed image.

Obtaining the spatial location of track structural defects

In the preprocessed images mentioned above, obtain the spatial positions of the track defect components. Through image segmentation and dual-object matching operations in image preprocessing, useful information in the images can be effectively extracted and enhanced to obtain the spatial positions of track defects more accurately. By implementing binocular vision calibration on the camera and introducing the calculation of depth of homonymous points, the three-dimensional spatial information of the track defect parts is obtained. In the depth calculation of the same-named point,³¹ first set the upper left corner of the image as the origin. The horizontal extension to the right from this point is the positive direction of the X-axis, and the vertical extension downward is the positive direction of the Y-axis. At this point, every pixel in the image has a unique coordinate, and the target point also has coordinates.

The first step in the calculation is to obtain the coordinates of points with the same name in the image, denoted by $L_{1} (x_{1}, y_{1})$ and $L_{2} (x_{2}, y_{2})$ , respectively. Let w be the angle between the line connecting the physical point in the relative coordinate system and the pixel point and the optical axis. Calculate the vertical angle v and horizontal angle $\tilde{w}$ of the target point based on the calibration results. v and $\tilde{w}$ are shown in equations (8) and (9), respectively.

v = \arctan (\frac{(y - \frac{E}{2}) \cdot w}{u \cdot α \cdot β})

(8)

\tilde{w} = \arctan (\frac{(x - \frac{Q}{2}) \cdot w}{u \cdot α \cdot β})

(9)

where y is the vertical coordinate of the target point, E is the height at which the image is captured, u is the calibrated focal length of the camera; Q is the width of the track image; x is the horizontal coordinate of the target point;

α

is the calibrated horizontal field of view of the camera, and

β

is the calibrated vertical field of view of the camera.

Based on the triangular relationship formed by two pixel points and physical points, the preprocessed image information is input into the system to obtain the final relative coordinates of the target point, as shown below, where N is the distance between the two cameras, ${\tilde{ω}}_{1}$ is the horizontal angle of $L_{1} (x_{1}, y_{1})$ , ${\tilde{ω}}_{2}$ is the horizontal angle of $L_{2} (x_{2}, y_{2})$ , and v is the vertical angle of $L_{1} (x_{1}, y_{1})$ and $L_{2} (x_{2}, y_{2})$ . Substituting $L_{1} (x_{1}, y_{1})$ and $L_{2} (x_{2}, y_{2})$ into this equation yields the three-dimensional spatial position of the target point, thereby enabling the location of damaged components in the orbit.

{\begin{cases} z^{'} = \frac{T_{f} (a^{″}, b^{″}) N}{\tan {\tilde{ω}}_{1} - \tan {\tilde{ω}}_{2}} \\ x^{'} = z^{'} \tan \tilde{ω} - \frac{N}{2} \\ y^{'} = z^{'} \tan v \end{cases}

(10)

Optimization of a rail inspection robot system based on deep learning and image processing

Optimized model design for rail inspection robot system

Under complex track conditions and uneven lighting, existing track inspection robot systems are unable to meet intelligent inspection requirements, resulting in low efficiency in detecting track structural defects. To this end, after locating the spatial position of the track defect, we propose a detection and recognition method for track inspection robots based on improved YOLOv5³² and attention mechanism to improve the accuracy of existing inspection robot systems in detecting track structural defects. The overall model is shown in Figure 3.

Figure 3.

Orbital inspection robot inspection model.

First, a feature extraction network based on improved FPN³³ and attention mechanism was constructed to fuse different levels of features extracted by traditional methods. The traditional FPN is unidirectional, that is, transferring information from the high-level feature map to the low-level feature map, the feature fusion method is relatively single, and the information circulation is not sufficient, which leads to insufficient and efficient fusion between features of different scales, and restricts the model’s ability to detect multi-scale targets. Therefore, in this paper, the convolution module in FPN is designed to be lightweight so that the detection ability of the model can be improved. Combined with the attention mechanism, the data collected by the track inspection robot was processed to extract image features, further enhancing the feature extraction function. Finally, an improved YOLOv5 model was used to build a track structure defect detection model for track inspection robots. Therefore, the Neck part adopts an improved FPN combined with a path aggregation network design to improve detection speed and accuracy.

Rail image feature extraction based on improved feature pyramid network and attention mechanism

Since traditional FPN uses multiple convolution modules for feature extraction, it consumes a large amount of computing resources. To address this issue, the convolution modules of FPN were designed to be lightweight, using a reverse residual structure as the convolution module. First, use $1 * 1$ 's convolutional level to increase the number of channels, then apply depth-separable convolutional integration to process spatial convolutions and channel mixing, and finally reduce the number of channels through another $1 * 1$ convolutional level to achieve dimension reduction. The equation of the standard reverse residual structure is as follows.

\tilde{x} = {\begin{cases} I (H (G (x))) + x, \dim (x) = \dim (\tilde{x}) \\ I (H (G (x))), \dim (x) \neq \dim (\tilde{x}) \end{cases}

(11)

where x is the input feature,

\tilde{x}

is the output feature;

G (\cdot)

H (\cdot)

, and

I (\cdot)

are the dimension-increasing transformation, feature extraction transformation, and dimension-reducing transformation, respectively.

In the inverse residual structure, dimensionality increase transformation utilizes point convolution operations to fuse low-dimensional channel information and achieve channel dimensionality increase. Feature extraction transformation utilizes $3 * 3$ 's deep convolution to extract features from the feature map after channel dimension increase. Dimensionality reduction is the process of integrating high-dimensional feature information obtained from feature extraction and mapping it to a low-dimensional space. The standard inverse residual structure has high redundancy in dimensionality increase, so a low-cost dimensionality increase transformation is used to replace it, as shown below.

C (J (x)) = [ϕ_{1} (x), ϕ_{2} (x), . . ., ϕ_{s} (x)]

(12)

where

J (\cdot)

is the fusion transformation;

C (\cdot)

is the low-cost expansion transformation;

ϕ_{s} (x)

is the low-cost operation; s is the low-cost dimension-increasing transformation expansion ratio.

To ensure that low-cost dimensionality extension can effectively replace the standard reverse residual structure’s dimensionality changes. To address this issue, one feature extraction branch was added to the lightweight reverse residual structure. This branch contains part of the output results from the point convolution operation. The lightweight reverse residual structure is shown in equation (13).

F_{I R B}^{'} (x) = [x_{1 C_{i} / 2, :, :}, I (H (C (x^{'})))]

(13)

where

x^{'}

is the multiplexing feature in the branch line;

F_{I R B}^{'} (x)

is the lightweight residual structure.

For the goal of enhancing the expression of image features, SAM is used to enhance the output results of FPN. The FPN output enhancement module is divided into layer attention (LAM) units, spatial attention (SPAM) units, and channel attention (CAM) units. LAM is used to distinguish shallow features, middle features, and deep features in fusion features. The operation process is as follows.

φ_{H} (T) = σ^{*} (δ (f_{1 \times 1} (T_{G A P}^{H})))

(14)

where

φ_{H} (T)

is the LAM feature;

T

is the output result of the feature fusion module;

σ^{*} (\cdot)

is the sigmoid function;

f_{1 \times 1} (\cdot)

is the

1 * 1

convolution linear function;

δ (\cdot)

is the ReLU function;

T_{G A P}^{H}

is the fusion feature average pooling.

LAM first performs pooling on the output results of the feature fusion module, then uses $f_{1 \times 1} (\cdot)$ to reduce the dimensions, and finally processes them through $σ^{*} (\cdot)$ to obtain $φ_{H} (T)$ and $δ (\cdot)$ to ensure that the reduced dimension output results are positive.

The SPAM unit is used to distinguish the spatial locations of different features in images. The output results of LAM need to be first processed by average pooling and maximum pooling. After the pooling operation, the processed results are merged and dimension reduced using, and then used to generate SPAM features, as shown in equation (15).

φ_{s} (T^{'}) = σ (f_{3 \times 3} ([T_{G A P}^{' S}; T_{G M P}^{' S}]))

(15)

where

T^{'}

is the refined feature of the LAM unit,

φ_{s} (T^{'})

is the SPAM feature,

[T_{G A P}^{S}; T_{G M P}^{' S}]

is the merged pooling result,

T_{G M P}^{' S}

is the maximum pooling of hierarchical refined features, and

T_{G A P}^{' S}

is the average pooling of hierarchical refined features.

The SPAM output features first undergo average pooling, then are processed by a fully connected layer to further fuse all features, and finally, the CAM features are obtained using the function, as shown below.

φ_{C} (T^{″}) = σ (Q_{2} δ (Q_{1} T_{G A P}^{″ C}))

(16)

where

T^{″}

is the spatial refinement feature,

φ_{C} (T^{″})

is the CAM feature,

Q_{i}

is the fully connected layer, and

T_{G A P}^{″ C}

is the average pooling.

Rail structure defect detection based on improved YOLOv5

The general structure of the detection algorithm based on YOLOv5 is divided into three parts: Backbone, Neck, and Head.³⁴ Among them, Backbone is the lightweight feature extraction module designed in the previous section. Therefore, the Neck part also adopts the design of improved FPN combined with path aggregation network to achieve the lightweight detection model, as shown in Figure 4. In the Neck structure, the Detection Block is replaced with a lightweight convolution module designed for this purpose. The lightweight Neck structure employs cross-stage connections to further reduce the computational load when reducing feature channels’ number.

Figure 4.

Improved YOLOv5 model.

In the improved YOLOv5 Head structure, there are two paths: one path is responsible for feature classification tasks, and the other path is responsible for feature localization tasks. When assigning labels in the result output, the similarity-based optimal transport allocation (SimOTA) algorithm³⁵ is used to adapt to different target requirements by dynamically allocating the number of positive samples, thereby improving target detection performance. When designing the model loss function, three loss functions were designed: confidence loss, prediction box regression loss, and classification loss.

(1) Confidence loss. Focal Loss³⁶ is used to address the class imbalance problem by reducing the weight of easily classified samples, thereby enabling the model to focus on hard-to-classify samples, as shown below.

F L (p_{t}) = - α_{t} {(1 - p_{t})}^{γ} \lg (p_{t})

(17)

where

F L (p_{t})

is the confidence loss,

p_{t}

is the probability of the sample being predicted as positive,

α_{t}

is the balancing factor,

p_{t}

is the adjustment factor.

(2) Prediction box regression loss. An efficient intersection-over-union loss function is used, which is an improvement on the traditional intersection-over-union loss function, taking into account the overlapping area, center point distance, and similarity of the width-to-height ratio, as shown below.

E I o U_{L o s s} = 1 - I o U + \frac{ρ^{2} (w, w^{g t})}{C_{w}^{2}} + \frac{ρ^{2} (h, h^{g t})}{C_{h}^{2}}

(18)

where

E I o U_{L o s s}

is the regression loss of the predicted box,

I o U

is the intersection-to-union ratio between the predicted bounding box and the true bounding box,

ρ^{2} (w, w^{g t})

is the Euclidean distance between the width of the predicted bounding box and the true bounding box,

C_{w}

is the width of the minimum bounding region,

ρ^{2} (h, h^{g t})

is the Euclidean distance between the height of the predicted bounding box and the true bounding box, and

C_{h}

is the height of the minimum bounding region.

(3) Classification loss. Binary cross-entropy loss is used to measure the difference between the probability distribution of the model’s predictions and the probability distribution of the true labels.³⁷ Therefore, this loss function is used as the classification loss for the target detection model, as shown in equation (19), where $B C E_{L o s s}$ is the classification loss and y is the true label. The final loss function L of the track structure defect detection model is shown in equation (20).

B C E_{L o s s} = - (y \cdot \lg (p_{t}) + (1 - y) \cdot \lg (1 - p_{t}))

(19)

L = F L (p_{t}) + E I o U_{L o s s} + B C E_{L o s s}

(20)

Experimental results and analyses

This paper uses the image data collected by the track robot platform designed in Reference 38 on the VTIII-2 model subway track fasteners as the experimental dataset. This dataset contains 2094 fastener images with a resolution of

2048 \times 2048

. The training set, validation set, and test set in the dataset are produced in a ratio of 7:1.5:1.5, and each type of defect in the fastener images is also divided according to this ratio. The distribution of different types of defects in the dataset is shown in Table 1.

Table 1.

Distribution of fastener defect image data sets.

Defect type	Training set	Validation set	Test set
Missing spring clip	398	85	85
Missing nut	392	73	72
Insulation block missing	234	69	69
Normal state	695	437	439
Total number	1719	664	665
Data set size ratio	70%	15%	15%

This experiment uses the Linux operating system Ubuntu 18.04. The hardware configuration includes an Intel i7-8700K processor and an NVIDIA GTX 2080 Super graphics card. The system is installed with CUDA 10.1, and OpenCV 3.4.9, and the graphics card can accelerate model training and use. The software uses TensorFlow, a Python-based framework, for model development. The experiment was conducted with 50 iterations, using the Adam optimizer and a learning rate of 0.01.

For the sake of analysis, this paper selects CNN-RBM [17], TransAE [19], and FOD-YOLO [20] as comparison methods. The confusion matrices of the detection results of the four types of track fastener missing using different methods are shown in Figure 5. As shown in Figure 5, CNN-RBM has the lowest detection accuracy, with detection accuracy for all types of track fastener defects below 85%. The detection accuracy of TransAE and FOD-YOLO is above 85%, but neither exceeds 90%. The proposed DLIP method achieves the highest detection accuracy for detecting missing insulation blocks on rail fasteners, reaching 92.8%, and a detection accuracy of 90.8% for detecting normal rail conditions. Overall, the DLIP method achieves detection accuracy above 90% for detecting defects in rail fasteners, demonstrating high detection accuracy.

Figure 5.

Improved YOLOv5 model.

The precision-recall (PR) curves and AUC values for different methods are shown in Figure 6. In the evaluation metrics, precision is negatively correlated with recall, which is positively correlated with positive samples, while precision is positively correlated with negative samples. The PR curve directly shows the specific detection accuracy and overall average detection accuracy. As can be seen from the figure, the area enclosed by DLIP and the coordinate axes is greater than that of the baseline method, indicating that DLIP has better overall detection performance than the comparison method. In addition, the AUC values of CNN-RBM, TransAE, FOD-YOLO, and DLIP were 0.866, 0.903, 0.950, and 0.963, respectively, with DLIP having the highest AUC value, indicating its superior detection performance.

Figure 6.

PR curves for different methods.

To further analyze the effectiveness of each component in the DLIP method, ablation experiments were conducted on each component of the proposed method, and the evaluation metrics were selected as accuracy, F1, and mean average precision (mAP). Remove the reverse residual structure from FPN and use only ordinary convolution modules, denoted as DLIP/RRC. Remove LAM from the output enhancement module, retain SPAM and CAM, and denote it as DLIP/LAM. Remove CAM from the output enhancement module, retain SPAM and LAM, and denote the result as DLIP/CAM. Remove SPAM from the output enhancement module, retain CAM and LAM, and denote the result as DLIP/SPAM. The ablation experiment results for each component are shown in Table 2.

Table 2.

Ablation results.

Method	Accuracy	F1	mAP
DLIP/RRC	81.6%	83.7%	84.5%
DLIP/LAM	85.7%	84.9%	85.2%
DLIP/CAM	88.5%	86.3%	88.1%
DLIP/SPAM	86.2%	87.8%	87.3%
DLIP	91.5%	93.6%	91.9%

The detection accuracy of DLIP/RRC is the lowest, at only 81.6%, indicating that the introduction of reverse residual structure can significantly improve the performance of the detection method. The detection accuracy of DLIP/LAM, DLIP/CAM, and DLIP/SPAM is similar, indicating that removing any of the attention mechanisms in the proposed method has a significant impact on the performance of the detection method. DLIP has the highest detection accuracy rate of 91.5%, indicating that DLIP, which integrates all components, has the best detection performance. When comparing the detection accuracy metric mAP, DLIP achieved the mAP of 91.9%, representing improvements of 7.4%, 6.7%, 3.8%, and 4.6% compared to DLIP/RRC, DLIP/LAM, DLIP/CAM, and DLIP/SPAM, respectively. This demonstrates that all components in DLIP have a decisive impact on detection performance.

Conclusion

Under harsh working conditions such as complex and changing track environments and uneven lighting conditions, the intelligent inspection capabilities of existing track inspection robot systems face significant challenges, directly resulting in track structure defect detection efficiency that is unable to meet actual needs. To address the above issues, we propose an optimization method for track inspection robot systems based on deep learning and image processing, and conduct performance analysis experiments. This paper first selects the threshold for segmenting the original track image, using discriminant analysis to determine the optimal threshold. Through binarization processing, the background information of the image is removed, and the grayscale values of all pixels are compared with this value, thereby reassigning the grayscale values of the pixels and ultimately obtaining a high-quality track structure image. By implementing binocular vision calibration on the camera, we introduce the calculation of depth of homologous points to obtain three-dimensional information on the spatial location of track defects. On this basis, we optimize FPN based on a reverse residual structure to improve the backbone part of the improved YOLOv5 model as a lightweight feature extraction network that integrates FPN and attention mechanisms. Therefore, the Neck part adopts an improved FPN combined with a path aggregation network design to improve detection speed and accuracy. Performance analysis experiments show that the AUC value of the proposed method is 0.963, which can improve the accuracy and efficiency of track defect detection in complex environments.

Footnotes

ORCID iD

Hongming Shen

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research received financial assistance from the research and development service contract of unmanned intelligent inspection system for offshore wind farm booster station and metering station of Huaneng Jiaxing No. 2 offshore wind power project [HN-52CO-202200035-FWQT00015].

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Jing

Qin

Wang

, et al. Developments, challenges, and perspectives of railway inspection robots. Autom ConStruct 2022; 138: 1–24.

Ryu

Lee

. Development of an inspection robot operating on a single square rail track. J Korea Robot Soc 2022; 17: 216–220.

Rahman

Liu

Masri

, et al. A railway track reconstruction method using robotic vision on a mobile manipulator: a proposed strategy. Comput Ind 2023; 148: 1–11.

Yang

Wang

, et al. Deep learning and machine vision-based inspection of rail surface defects. IEEE Trans Instrum Meas 2021; 71: 1–14.

Chen

Qin

, et al. Automatic railroad track components inspection using hybrid deep learning framework. IEEE Trans Instrum Meas 2023; 72: 1–15.

Hongbo

Weiming

Jing

, et al. Detection system of multi-track intelligent inspection robot for urban rail vehicles. Electr Drive Locomot 2024; 1: 45–52.

Ding

Zhu

Chen

, et al. An efficient AdaBoost algorithm with the multiple thresholds classification. Appl Sci 2022; 12: 23–36.

Fan

Jiao

Shuai

, et al. Application research of image recognition technology based on improved SVM in abnormal monitoring of rail fasteners. J Comput Methods Sci Eng 2023; 23: 1307–1319.

Biswas

Khan

Islam

, et al. A novel approach to detect and classify the defective of missing rail anchors in real-time. International Journal of Emerging Technology and Advanced Engineering 2016; 6: 270–276.

10.

Manikandan

Balasubramanian

Palanivel

. Machine vision based missing fastener detection in rail track images using SVM classifier. Int J Smart Sens Intell Syst 2017; 10: 574–589.

11.

Han

Wang

Madan

, et al. Intelligent detection of loose fasteners in railway tracks using distributed acoustic sensing and machine learning. Eng Appl Artif Intell 2024; 134: 20–39.

12.

Raguram

Chum

Pollefeys

, et al. USAC: a universal framework for random sample consensus. IEEE Trans Pattern Anal Mach Intell 2012; 35: 2022–2038.

13.

Sun

Zhao

, et al. Semantic‐segmentation‐based rail fastener state recognition algorithm. Math Probl Eng 2021; 20: 1–15.

14.

Chen

Liu

Wang

, et al. Automatic defect detection of fasteners on the catenary support device using deep convolutional neural network. IEEE Trans Instrum Meas 2017; 67: 257–269.

15.

Gibert

Patel

Chellappa

. Deep multitask learning for railway track inspection. IEEE Trans Intell Transport Syst 2016; 18: 153–164.

16.

Wei

Yang

Liu

, et al. Railway track fastener defect detection based on image processing and deep learning techniques: a comparative study. Eng Appl Artif Intell 2019; 80: 66–81.

17.

Aydin

Sevi

Salur

, et al. Defect classification of railway fasteners using image preprocessing and alightweight convolutional neural network. Turk J Electr Eng Comput Sci 2022; 30: 891–907.

18.

Zhang

Ding

Zhang

, et al. An overview on restricted Boltzmann machines. Neurocomputing 2018; 275: 1186–1199.

19.

Guo

Liu

Qian

, et al. Rail surface defect detection using a transformer-based network. J Indus Inform Intg 2024; 38: 117–126.

20.

Brintha

Joseph Jawhar

. FOD-YOLO NET: fasteners fault and object detection in railway tracks using deep yolo network. J Intell Fuzzy Syst 2024; 46: 8123–8137.

21.

Cai

Tao

, et al. Fast rail fastener screw detection for vision-based fastener screw maintenance robot using deep learning. Appl Sci 2024; 14: 16–30.

22.

Tang

Zhou

Gao

, et al. A novel rail inspection robot and fault detection method for the coal mine hoisting system. IEEE Intell Transport Syst Mag 2019; 11: 110–121.

23.

Iyer

Velmurugan

Gandomi

, et al. Structural health monitoring of railway tracks using IoT-based multi-robot system. Neural Comput Appl 2021; 33: 5897–5915.

24.

Niu

. Railway train inspection robot based on intelligent recognition technology. Int J Syst Assur Eng Manag 2023; 14: 648–656.

25.

Kuo

C-CJ

. Understanding convolutional neural networks with a mathematical model. J Vis Commun Image Represent 2016; 41: 406–413.

26.

Namatēvs

. Deep convolutional neural networks: structure, feature extraction and training. Inf Technol Manag Sci 2017; 20: 40–47.

27.

Cong

Zhou

. A review of convolutional neural network architectures and their optimizations. Artif Intell Rev 2023; 56: 1905–1969.

28.

Hui

Yang

Hui

, et al. Research on identify matching of object and location algorithm based on binocular vision. J Comput Theor Nanosci 2016; 13: 2006–2013.

29.

Demi

. On the gray-level central and absolute central moments and the mass center of the gray-level variability in low-level image processing. Comput Vis Image Understand 2005; 97: 180–208.

30.

Kang

Iwana

Uchida

. Complex image processing with less data—document image binarization by integrating multiple pre-trained U-Net modules. Pattern Recogn 2021; 109: 25–38.

31.

Xia

Mao

Wang

, et al. A subway tunnel image stitching method based on point cloud mapping relationships and high-resolution image. Eng Res Express 2024; 6: 025102.

32.

Qiu

Wei

, et al. Online rail fastener detection based on YOLO network. Comput Mater Continua (CMC) 2022; 12: 59–72.

33.

Zhou

Zhang

. SA-FPN: an effective feature pyramid network for crowded human detection. Appl Intell 2022; 52: 12556–12568.

34.

. Target tracking and detection based on YOLOv5 algorithm. Appl Comput Eng 2023; 16: 75–85.

35.

Terven

Córdova-Esparza

D-M

Romero-González

J-A

. A comprehensive review of yolo architectures in computer vision: from yolov1 to yolov8 and yolo-nas. Mach Learn Knowl Extr (2019) 2023; 5: 1680–1716.

36.

Mukhoti

Kulharia

Sanyal

, et al. Calibrating deep neural networks using focal loss. Adv Neural Inf Process Syst 2020; 33: 15288–15299.

37.

Zhang

Sabuncu

. Generalized cross entropy loss for training deep neural networks with noisy labels. Adv Neural Inf Process Syst 2018; 31: 12563–12571.

38.

Cui

, et al. Real-time inspection system for ballast railway fasteners based on point cloud deep learning. IEEE Access 2019; 8: 61604–61614.