Optimization of target detection model based on Yolov5 network with small samples

Abstract

As an important part of automotive shock absorber, the columnar parts in automotive shock absorber will inevitably have machining defects during the process, which will not only degrade the performance of the parts, but also degrade or even fail the performance of the final shock absorber after assembly. Yolov5, as a target detection algorithm, has received much attention due to its high accuracy and fast operation speed. However, the algorithm faces some challenges when applied in a practical industrial environment. In this paper, improvement measures are proposed to address the limitations of sample collection and the high speed of pipeline recognition in industrial environments. The network model is optimized and designed. Firstly, the ASPP module is replaced by the SPP module thus improving the viewability throughout the process providing recognition accuracy. Secondly, the Conv and C3 layers of Yolov5s are replaced by Transformer to obtain higher recognition accuracy. By improving and optimizing the above methods, we can better cope with the improvement of detection accuracy under small sample conditions. Experiments show that the method can significantly improve the detection accuracy and operation speed of Yolov5s under the hardware condition of lower computing power, which is more suitable for industrial scenario application scenarios.

Keywords

Target detection YOLOv5 SPP module APPF module Transformer

1. Introduction

The production process of columnar parts in automotive shock absorbers requires multiple machining processes. Due to the special structure of the cylindrical surface, the manual observation method was used in the previous quality inspection for the full-side quality inspection, which is slow and strongly influenced by subjective consciousness. Therefore, it is important to realize a high precision algorithm for detecting surface defects of columnar parts in automotive shock absorbers to improve the product surface quality. The research methods of computer vision technology for defect detection by domestic and foreign scholars are mainly divided into methods based on machine learning and methods based on deep learning.Machine learning-based methods are mainly based on artificial design to extract features [1]. Celik et al. proposed a simple detection system based on wavelet transform, grayscale co-generation matrix (GLCM) and feedforward neural network, but the method cannot locate the specific location of defects [2]; Querqing et al. proposed a Gabor filter optimization algorithm based on the evolution of composite difference for the problem of many parameters of traditional Gabor wavelet filter and poor real-time algorithm [4]. The detection speed of the optimized model is 11 fps, which is still lower than the requirement of real-time detection; Liu Yuan et al. proposed a surface defect image edge detection algorithm based on Sobel operator by combining bilateral filtering and Hilditch refinement algorithm. However, the algorithm is affected by the light environment and is sensitive to noise. The detection method based on machine learning is simple and easy to implement, but it is influenced by human extracted features and has poor adaptability and robustness [5]. Deep learning based methods are mainly divided into YOLO, SSD in the first stage and Faster-RCNN algorithm in the second stage. He et al. proposed a defect detection system based on Faster-RCNN, and the mAP value of the defect detection task on NEU/DET dataset was 82.3%, but the detection speed of this method is slow and not suitable for industrial scenario applications [6]. The mAP of the improved algorithm reaches 80% and the detection speed is maintained at 50 fps, but the model has a large number of parameters and is not easy to deploy [7]. The mAP is 78. 25% on the NEU-DET dataset, but the accuracy of this model is low [8]. There are many algorithms for surface defect detection, but the algorithms that meet the requirements of real-time detection still have problems of low accuracy and real-time performance [9]. The above studies are inadequate for small samples and high-speed detection. In this paper, we propose a network structure ATYolo based on YOLOv5, which firstly combines Transformer and YOLOv5 model to improve the recognition accuracy, and secondly introduces APPF module instead of SPP module to optimize the training and inference speed of the model [10]. This method has good performance in the task of detecting surface defects of parts and can further meet the industrial deployment requirements.

2. Introduction of Yolov5 algorithm

Yolov5 uses a single-stage detection-based approach that divides target detection into three stages: feature extraction, prediction, and post-processing. In the feature extraction stage, a lightweight convolutional neural network is used, which can quickly extract feature information from images. In the prediction stage, an Anchor-based method is used to predict the position and size of the target. In the post-processing stage, non-maximal suppression (NMS) is used to reject overlapping bounding boxes and retain the bounding box with the highest confidence. However, despite its high accuracy and fast running speed, Yolov5 still faces some challenges in practical applications. For example, when dealing with some complex scenes, the detection accuracy is low and the running speed is slow [11]. The different number of feature extraction modules and convolutional kernels for specific network locations are the main differences of the YOLOv5 model. The model size and the number of model parameters increased sequentially for the five versions. Later more complex models YOLOv7 were developed, YOLOv8 is not suitable for industrial scenario applications [12]. Since defect detection requires high performance in real time and lightweight, this study improves the design of the strip surface defect detection network based on YOLOv5s model by considering the accuracy, efficiency and size of the recognition model. The network structure is shown in Fig. 1.

Figure 1.

Yolov5 model structure.

3. Network structure optimization study

3.1 Optimization of the SPP module

SPP (Spatial Pyramid Pooling) is a feature processing method commonly used in the field of neural network computer vision, whose main function is to extract features from input images of different sizes and scales in a hierarchical manner, so that neural networks can classify and recognize features from different parts of the input image more effectively. The core idea of SPP is to divide the original image into multiple layers and generate a grid of corresponding sizes. Then, the feature vectors in each grid are subjected to a maximum pooling operation to obtain the output feature vectors with fixed dimensions. Finally, the vectors of each layer are stitched together to form a fixed-length feature vector as the input of the neural network. This method can not only handle input images of different sizes and scales, but also reduce the size of the input images without losing important information. the advantages of SPP are that it can effectively handle different sizes, scales, and input image position transformations, and it can reduce the computational effort while keeping the size of the input images constant before and after pooling [13]. However, the disadvantages are that there may be information loss and overfitting problems, and more computational resources are required when processing high-resolution input images. the SPP module can lead to a huge amount of computation when performing pooling operations because of the need to process subregions of different sizes. Especially when the number of sub-regions is large, the computation will be very large, which limits the speed of SPP module in industrial scenarios. the SPP structure is shown in Fig. 2.

Figure 2.

SPP module structure diagram.

ASPP (Atrous Spatial Pyramid Pooling) is an improved SPP method that mainly solves the problem that SPP is computationally intensive when processing high resolution input images. It can increase the perceptual field without affecting the input image size by using a dilated convolution model instead of the traditional convolution operation [14]. The advantage of ASPP is that it can improve the classification accuracy of the model while using the same computational resources, and it can effectively handle the feature extraction task for high-resolution images. Compared with SPP, the advantages of ASPP are: ASPP can obtain richer perceptual field information. while SPP can only perform multi-scale pooling at some level of the input feature mapping, ASPP can provide information at various scales and has better adaptability for targets of different sizes. ASPP reduces the impact of scale variation on feature learning [15]. In SPP, features at different scales are processed by different pooling layers, while ASPP can compress features at different scales when feature extraction is performed by null convolution, reducing the error between features with different scales. This module adds a BN layer to incorporate depth-separable convolution, but the basic idea remains consistent relative to SPP. Convolutional kernels with different receptive fields are constructed by different null rates and used to obtain multi-scale object information.ASPP can effectively solve the over-fitting problem.

In summary, compared with SPP, ASPP can better cope with multi-scale image feature extraction tasks and can effectively reduce the occurrence of overfitting problems, which is of high practical value for neural networks in classification and recognition tasks. The structure of ASPP is shown in Fig. 3.

Figure 3.

ASPP module structure diagram.

3.2 Optimization of convolutional and C3 layers

In YOLOv5, the convolutional layer is mainly used to build the skeleton network and extract features from the input image. C3 layer is implemented by combining the convolutional layer and the channel rearrangement module and is mainly used in the downsampling process of the YOLOv5 network. C3 layer can reduce the size of the feature map while retaining more feature information. In the C3 layer, the number of channels in the feature map is first reduced by 1 $\times$ 1 convolution, then the feature map is processed by 3 $\times$ 3 convolution, and finally the channel rearrangement module rearranges the channel order. The core principle of both the convolution layer and C3 layer is to use convolution operation to process the input feature map and extract the features of the image [15]. In the convolution layer, hyperparameters such as convolution kernel size, step size, and boundary fill need to be set to control the effect of the convolution operation. The C3 layer, on the other hand, achieves downsampling by combining convolutional layers and channel rearrangement modules, and reduces model parameters through parameter sharing and local connectivity to improve the efficiency and accuracy of the model. However,such a mechanism has limitations in recognition accuracy in industrial scenarios due to the limited number of samples collected and inconspicuous defect features [16].

YOLOv5 backbone feature extraction network adopts C3 structure, which brings larger number of parameters, slower detection speed and limited application, and it is difficult to be applied in some real application scenarios such as mobile or embedded devices with such a large and complex model. Firstly, the model is too large and faces the problem of memory shortage, and secondly, these scenarios require low latency. Therefore, it is crucial to study small and efficient CNN models in these scenarios.YOLOv5 lacks the ability to model globally for convolutional neural networks, Transformer has the ability to obtain global information, which has been improved by introducing the multi-head attention mechanism earlier.

The Transformer model was firstly applied in Natural Language Processing (NLP), which mainly consists of Encoder and Decoder. ViT model firstly applied Transformer encoder module to computer vision. By combining Transformer with YOLOv5s, we introduce Transformer in the backbone network and replace some Convolutional blocks and C3 modules in the original YOLOv5s model with Transformer encoder modules. The Transformer encoder module is composed of a multi-head attention mechanism and a feedforward neural network (MLP). The specific structure is shown in Fig. 4. To begin with, the process starts by dividing the image into patches of a predetermined size. Each patch is then transformed into a one-dimensional vector by incorporating position encoding. This vector serves as the input to the Encoder module. A significant component of the Encoder module is the multi-head attention mechanism, which enables the simultaneous processing of multiple sets of input data. The self-attention mechanism plays a crucial role in computing the self-attentive feature output through dot product attention, as depicted in Eq. (1).

$\displaystyle\textit{Attention}(Q,K,V)=\textit{softmax}\left(\frac{QK^{T}}{% \sqrt{d_{k}}}\right)V$ (1)

Figure 4.

The architecture of transformer encoder.

Figure 5.

ATYolo model structure.

The correlation matrix of the vectors is generated by multiplying the transposed points of the query ( $Q$ ) and key ( $K$ ) vectors. To prevent the gradient from vanishing due to the softmax activation function, a normalization technique is applied. Specifically, the correlation values are divided by the square root of the dimensionality ( $d_{k}$ ). This normalized matrix is then multiplied with the value ( $V$ ) matrix to produce the weighted output. As a result, the multi-head attention mechanism module not only emphasizes the current pixel but also incorporates information from other pixels in the context. This enables the module to capture the contextual features beyond the local neighborhood. The Layer Norm and Dropout layers can help the network converge better and prevent overfitting, and the Transformer encoder module not only increases the ability to extract image feature information, but also can use the self-attentive mechanism to explore potential feature information. In the backbone network of STYOLO model, the Transformer encoder module is applied several times to facilitate the global extraction of deep and shallow image features, thus improving the detection rate [17].

Figure 6.

Live sampling chart.

Figure 7.

Network model index mAP comparison.

4. Experiments and results analysis

After constructing the neural network model for surface detection of the finished part, the model training was started. Based on the training set of workpiece surface defects, the constructed neural network model and the yolov5s target detection model used for comparison experiments were trained separately, and then the trained models were tested using the test set. The hardware configuration used for this neural network model training is as follows: CPU: i7-10700/GPU: GTX3080/video memory: 10G/OS: Windows 10/ Python 3.8/ PyTorch. high-speed camera sampling is shown in Fig. 6. In this study, the Adam optimizer is used for training with an initial learning rate (Lr) of 0.001; the BatchSize is set to 16, i.e., 16 images are trained in one iteration with a training cycle of 300 rounds.

In this study, the effectiveness of the designed surface defect detection model was experimentally evaluated by comparing it with the yolov5s network model. The trained surface defect detection model was compared with the yolov5s network model that underwent similar training. The evaluation metrics, specifically the mean average precision (MAP) values, were statistically analyzed to assess the performance of the models. The results of this analysis are presented in Fig. 7, providing insights into the comparative performance of the two models.

The above figure shows that the neural network model designed for the detection of defects in ring workpieces has a certain improvement over the Yolov5s neural network model, with an increase of about 5.3% in MAP value.In this defect detection neural network, four image features are mainly detected in the collected images, which represent the painted workpiece (component), defective wound (SK), drive rod head (Top), and missed sprayed workpiece (blank), and the accuracy (precision) of these four image features are shown in Fig. 8.

Figure 8.

Precision rate results display.

Figure 8 illustrates the changes in the accuracy rate with respect to the confidence level during the training process of both the designed network model and the yolov5 network model. The graph demonstrates that the accuracy rate of ATYolo, the designed network model, increases more smoothly with the progression of training iterations compared to Yolov5s. This indicates that ATYolo exhibits a more stable and consistent improvement in accuracy as the training proceeds, outperforming Yolov5s in terms of training efficiency and accuracy, and is slightly higher than Yolov5s. In the comparison, when the confidence level is greater than 0.5, the accuracy rate of ATYolo is two to three percentage points higher than that of Yolov5s. In the overall model, when the confidence level is greater than 0.5, the accuracy of the ATYolo model finally stays above 0.899 compared to 0.875 of Yolov5s, which is an increase. Also, our observation of Fig. 8 reveals that our SK convergence curves throughout in yolov5s show a lot of jumps etc. due to the smaller number of samples, however in the new network structure we find that a better smooth curve is obtained in ATYolo though with the same number of samples. It means that the network has a better generalization ability.

5. Conclusion

Among the target detection algorithms, Yolo has attracted much attention for its high accuracy and fast operation speed. However, when applied in industrial scenes, the algorithm faces the problem of low defect recognition rate due to the limited number of industrial scene samples. To solve these problems, two improvements are proposed to improve the network structure by replacing the SPP module with an ASPPF module to increase the size of the field of view. The high recognition rate of Transformer layer is also used to replace the original convolutional layer and C3 layer to improve the overall recognition capability. Experimental results show that the improvements proposed in this paper can significantly improve the detection accuracy of Yolov5 algorithm and make it more suitable for practical application scenarios. In subsequent experiments, we can try different Transformer layer depths to further optimize the impact on network recognition accuracy and computing speed. Meanwhile, readers can try to sample traditional image processing methods in a practical industrial environment to improve the quality of training images way to improve the recognition effect.

Footnotes

Acknowledgments

Supported by the Open Research Fund of Key Laboratory of Manufacturing and Automation, Xihua University (SZJJ2012-024).

References

Xiao

Zhang

Tao

Jiang

. An inverse kinematics method for robots after geometric parameters compensationMech. Mach Theory. 2022; 174: 104903. doi: 10.1016/j.mechmachtheory.2022.104903.

Karimanzira

Renkewitz

Shea

Albiez

. Object detection in sonar images. Electronics 9(7): 1180.

Tian

Cheng

Sun

, et al. Gesture recognition based on multilevel multimodal feature fusion. JIntell Fuzzy Syst 2020; 38(3): 2539-2550.

Liang

Shen

Feng

Yan

. Scale-aware fast r-CNN for pedestrian detection. IEEE Trans Multimedia. 2017; 20(4): 985-996.

Liao

, et al. Occlusion gesture recognition based on improved SSD. Concurr Comput PractExp. 2021; 33(6): e6063. doi: 10.1002/cpe.6063..

Zhiyong

. A hyperspectral target detection framework with subtraction pixel pair features. IEEE Access. 2018; 6: 45562-45577.

Jiang

Sun

Kong

Tao

. Gesture recognition based on skeletonization algorithm and CNN with ASL databaseMultimed. Tools Appl. 2019; 78(21): 29953-29970. doi: 10.1007/s11042-018-6748-0.

Hashjin

Boloorani

Khazai

Kakroodi

. Selecting optimal bands for sub-pixel target detection in hyperspectral images based on implanting synthetic targets. IET Image Process. 2018; 13(2): 323-331.

Zhao

Z-Q

Zheng

S-T

. Object detection with deep learning: A review. IEEE Transactions on Neural Networks and Learning Systems. 2019; 30(11): 3212-3232.

10.

Sung

Kim

Lee

Kim

. Realistic sonar image simulation using deep learning for underwater object detection. Int J Control Autom Syst. 2020; 18(3): 523-534.

11.

Alex

Ilya

Geoffrey

. ImageNet classification with deep convolutional neural networks. Commun. ACM. 2017; 60(6): 84-90.

12.

Shirley

Duncan

Campos Varillas

Hurley

Małek

Roehlly

Smith

MWL

Aussel

Bakx

Buat

, et al. HELP: the herschel extragalactic legacy project. Monthly Notices of the Royal Astronomical Society. 2021; 507(1): 129-155.

13.

Foster

Amano

. Hyperspectral imaging in color vision research: a tutorial. Journal of the Optical Society of America A. 2019; 36: 606-627.

14.

Aghaziyarati

Moradi

Talebi

. Small infrared target detection using absolute average difference weighted by cumulative directional derivatives. Infrared Phys. Technol. 2019; 101: 78-87.

15.

Binotti

Koseoglu

. Novel Parameters to Assess the Severity of Corneal Neovascularization Using Anterior Segment Optical Coherence Tomography Angiography. American Jouranal of Ophthalmology, 2020, 206-217.

16.

Meerdink

Hook

Roberts

Abbott

. The ECOSTRESS spectral library version 10. Remote Sens Environ. 2019; 230(111196): 1-8.

17.

Shi

Wang

Gamba

. Target detection with unconstrained linear mixture model and hierarchical denoising autoencoder in hyperspectral imagery. IEEE Trans Image Process. 2022; 31: 1418-1432.