Abstract
Aiming at the inconsistency of manual detection of mobile phone screen defects, the image feature extraction of traditional machine learning is often set based on experience, resulting in unsatisfactory detection results. Therefore, a mobile phone screen defect detection model (Ghostbackbone) which is proposed by this paper based on YOLOv5 s and Ghostbottleneck. The bottleneck of Ghostbackbone mainly uses and improves the Ghostbottleneck of GhostNet. The attention module of Ghostbackbone uses Coordinated Attention and Depthwise Separable Convolution for parameter reduction. Finally, Ghostbackbone uses YOLOv5 as the object detector to train the mobile phone screen defect dataset. The experimental results show that the parameter quantity of Ghostbackbone is 24% of that of YOLOv5 s, the average time of detecting a single picture is only 2% lower than that of YOLOv5 s, and the mAP0.5 : 0.95 is 2% higher than that of MobilenetV3 s.
Introduction
There are always some manufacturing defects in the mobile phone manufacturing production line. Because most image feature extraction of machine learning is manually set based on experience [1], while manual detection is prone to the problem of inconsistent detection standards. The traditional machine learning detection system and manual detection cannot adapt to the increasingly large production line. With the improvement of computer computing power, deep learning has also been greatly developed in the direction of object detection. The two-stage object detection algorithm has the advantages of high precision and can detect small targets, while the one-stage object detection algorithm has the detection speed that the two-stage object detection algorithm cannot achieve. The mobile phone defect object detection algorithm based on deep learning can temporarily meet the detection requirements of the current mobile phone manufacturing pipeline in speed and accuracy.
The main components of the mobile phone screen include cover glass (screen), touch module, and display module. However, the cover glass (screen) is the most vulnerable part to manufacturing defects. Therefore, it is necessary to detect the defect of the cover glass. In current computer vision, object detection in defect based on deep learning mainly includes three aspects: semantic segmentation [2, 28], object detection [5–10], and GAN [1]. In terms of semantic segmentation methods, a semantic segmentation algorithm for a small number of data sets and computational power is proposed by literature [28], which is determined by the segmentation network and decision network. Although this network requires few images to train and low computational power, it cannot effectively detect small objects and the false detection rate is still high; a modified algorithm based on FCN and can be used for complex background detection is designed by reference [3]. Its main implementation methods are modifying the VGG based FCN structure, filter size, canceling the abstaining neurons in the full connection layer, and expanding the network depth. In the field of object detection, reference [4] points out that classical algorithms can be divided into two types: one is a two-stage algorithm, which is represented by R-CNN [5–7] series; The other is a one-stage algorithm, which is represented by SSD [8] and YOLO [9–12] series. An improved algorithm based on Fast R-CNN is designed by reference [13], which is mainly realized by combining VGG16 with fast R-CNN and increasing the aspect ratio of the detection frame; The pre-trained YOLOv2 to calculate the candidate box and takes the detection result as the candidate box is used by reference [14]. Its purpose is to obtain the candidate box more quickly. Finally, the candidate box is input into the VGG16 network for candidate box classification, and finally, the result box is obtained, and the result box is used as the modified box for YOLOv2 retraining. There is also an application based on GAN [1]. Literature [1] believes that the collection defect is a small probability and difficult to collect, and there is no need to label the dataset manually, but the disadvantage is that the network has no good effect on the dataset with complex background. In addition to the classical applications, reference [2] designed an image recognition model based on a multi-scale convolutional neural network. Its main implementation method is to add Difference of Gaussian (DOG) on the basis of traditional CNN.
In addition to these deep learning methods, the use of a traditional support vector machine (SVM) [15, 16] is also an implementation method.
The mobile phone production line still needs a detection model with fast detection speed and easy to be implanted into various devices, using YOLOv5 proposed by this paper which can carry out high-speed detection as the detector, and an improved lightweight object detection network Ghostbackbone based on YOLOv5 s object detection network is proposed. Hence, this paper contributes a lightweight yolov5 backbone, which can greatly reduce the network parameters while maintaining the mAP of yolov5 s in mobile phones’ screen detection. Hence, this paper contributes a lightweight yolov5 backbone, which can greatly reduce the network parameters while maintaining the mAP of yolov5 s in mobile phones’ screen detection.
Related work
At present, there are some mobile phone defect detection algorithms based on deep learning in the mobile phone production line, but there are still deficiencies: there is still much room for improvement in speed and ease of use; Each algorithm has different definitions of defects; Various algorithms have their own choices in detection speed and accuracy, and there are few comprehensive algorithms; In terms of development framework, TensorFlow, which is difficult to use, is preferred in academia and industry. Therefore, scholars at home and abroad have made a lot of efforts in terms of high precision, high speed, and easy deployment of defect detection.
GhostNet
Huawei Noah team believes that not all feature maps must be obtained from convolution, but can be obtained through some ingenious linear operations. Therefore, it has designed a lightweight network GhostNet [17], and its main competitive object is MobileNet [18–20]. GhostNet is composed of GhostConv and Ghostbottleneck in the source code of YOLOv5. Ghostconv is mainly used to realize the pointwise expansion and pointwise linear projection of Ghostbottleneck; In YOLOv5, Ghostbottleneck (as shown as Table 1) is compared with Huawei’s open source Ghostbottleneck, and it can be found that the code of SE [22] (sequence and exception) module has been deleted. Since YOLOv5 has no formal paper at present, it can be analyzed from the perspective of code: the SE module can be added to the backbone by itself instead of fixed in a bottleneck and parameter schedule.
Overview of the implemented Ghost Bottleneck architecture in YOLOv5
Overview of the implemented Ghost Bottleneck architecture in YOLOv5
For a feature layer, the Ghost Module only generates part of the real feature layer by convolution, and the remaining feature layer is called the ghost feature layer. The ghost feature layer is not generated by convolution but is obtained by linear calculation of the real feature layer, and then the real feature layer and ghost feature layer are combined into a completed feature layer. Compared with previous networks, ghost modules can perform fewer convolution calculations than them.
Ghostbottleneck refers to the residual block structure in ResNet and designs two types of Ghostbottleneck (G-neck): one is the bottleneck (G-neckS1) with a stride size of 1; The other is the bottleneck (G-neckS2) with a stride size of 2. G-neck consists of G-neckS1 and G-bneckS2: in order to increase the number of channels, the first Ghost Module is used as the expansion layer; In order to match the number of channels in the shortcut, the second Ghost Module is used to reduce the number of channels.
Depthwise Separable Convolution [22] (DWSConv) is typically used in lightweight MobileNetV1 [18]. In MobileNetV1, DWSConv is used to replace the traditional convolution module. This convolution method can reduce the redundancy of the convolution kernel and greatly reduce the number of network parameters and computations. However, the Huawei Noah team proposed in document [17] that MobileNet and ShuffleNet introduced the Depthwise Separable Convolution, but the pointwise convolution in the second half still occupies a lot of memory and flops. Therefore, they decided to remove the pointwise convolution in Ghost Module (only use the depthwise convolution), which is also reflected in YOLOv5. However, this is an idea of exchanging accuracy for space, so whether to use pointwise convolution will be discussed in this paper.
YOLOv5 (5th Generation of YOLO)
Because the paper of YOLOv5 [24] has not been published, and many scholars are discussing whether it is worth being called YOLOv5, but YOLOv5 is not inferior to YOLOv4 in many fields, especially deployed to the Internet of Things devices. YOLOv5 may have certain advantages in mobile phone defect detection and production line in deployment. Compared with YOLOv4, the analysis of the first-generation code of YOLOv5 has few advantages (the data enhancement methods [29, 30] similar to yolov4, that is, the “BoF / BoS” idea in YOLOv4 [11] is adopted). However, YOLOv5 is implemented based on PyTorch, which means that it can be secondarily developed by more people.
Different from yolv4, yolv5 also has two characteristic functions: adaptive calculation anchor box and adaptive picture scaling: The adaptive calculation anchor box function code is implemented in autoanchor.py, in which the function to realize the adaptive calculation anchor box function is Check_ Anchors, whose input parameters are: dataset, model, threshold, and image size for reasoning. The metric function is the method of the Check_ Anchors function to judge whether to recalculate. When the BPT (best possible recall) and AAT (anchors above threshold) calculated by the metric function are less than the set threshold, the recalculation will be carried out.
The code of adaptive image scaling is in the letterbox function of datasets.py. Its function is to unify the image resolution. Its main implementation logic is as follows:
In step 1, judge the input-output ratio of length (L) and width (W) respectively, and take the minimum value between the two values, i.e. coefficient K, as shown in Equation 1:
Step 2, multiply the input length and width value by the coefficient K to obtain the minimum length and width value, as shown in Equation 2:
In step 3, judge the minimum length value to obtain the filling value n, as shown in Equation 3:
At present, YOLOv5 has five versions, in which version 3.0 embeds the Hardwish function into the Conv function, and version 4.0 uses the SiLU function to replace the previous LeakyReLU function and Hardwish function.
YOLOv5 has many network models, such as YOLOv5 s, YOLOv5 m, YOLOv5l, etc. However, there are many similarities among them. Here we will use yolov5 s as an example.
As shown in Table 2, YOLOv5 s is composed of four main modules: Focus, Conv (convolution), C3 (CSP bottleneck with 3 convolutions), and SPPF.
Overview of the backbone architecture of yolov5 to be improved
C3 represents BottleneckCSP with 3 convolution layers, a module used to obtain image features, its main modules being Conv, bottleneck: the first conv2d with a size of 1 * 1 was then added to BN+SiLU to form a convolution layer (kernel size = 1 * 1); The second and the same justifies its kernel size = 3 * 3.
Previously, in version4.0 of YOLOv5, what constituted YOLOv5was BottleneckCSP, and the new module called C3 (refer to the YOLOv5 update log) was now rebuilt for the purpose of trimming parameters. Its module composition is shown in Fig. 1, where Conv’s kernel size is both 1 * 1:

The structure of C3 in YOLOv5.
The network model designed in this paper runs on GitHub open source YOLOv5 (Release 4.0). This version integrates training, testing, and detection code, and the model file can be modified through the configuration file (YAML). The training and detection steps are as follows:
If a model needs to be customized, experimental.py should be added those codes that implements the function. Then, set the parameters in a function in yolo.py, and run yolo.py to test whether your setting is successful.
In summary, YOLOv5 optimizes the input to the dataset, i.e. adaptively modifies the dataset resolution size; proposes the C3 structure; uses the structure of PANet [27], FPN as a neck; uses the PyTorch architecture for development, making the code more readable; optimizes the code interpretation of Width Multiplier, Depth Multiplier position, which can be modified directly in the network file thus modifying its network depth and width.
For mobile phone production lines that still need a detection model that is fast and easy to be embedded in various devices, this paper proposes using YOLOv5, which is capable of high-speed detection, as a detector, and replacing the C3 module of YOLOv5 s with a less parameterized GhostBottleneck based on the YOLOv5 s object detection network that comes with YOLOv5, and Compared to YOLOv5 s and MobileNet, GhostBackbone is more suitable for low-power hardware environments in real-world production situations. In addition, PyTorch is easy to use in practical terms, and there is a growing number of papers on YOLOv5-based deployment solutions, and some bugs about YOLOv5 can be submitted directly from GitHub to the authors themselves to ask questions and invite answers.
However, some work [29, 30] has shown that deploying YOLOv5 s to edge devices that do not support CUDA technology still does not provide a significant speedup, firstly due to the limited computing power and secondly due to the relatively large memory footprint of the algorithm itself, so work on light weighting on YOLOv5 s still needs to continue.
GhostBackbone
The idea of building a GhostBackbone: use DWSConv or DWConv as the convolution module, and then use the improved ghostbottleneck to form a complete GhostBackbone.
As shown in Table 3, two networks will be used in this chapter, namely, the most basic GhostBackbone simple (hereinafter referred to as simple) and GhostBackbone deep (hereinafter referred to as deep). Deep is the network deepening version of simple, and other structures remain unchanged. As also shown in Table 3, two different convolution modules will be used: DWSConv and DWConv, a total of four networks.
Configuration list of the modified model
Configuration list of the modified model
Ghostbottleneck [24] that has been implemented is shown in Table 1 with no integrated SE module in Table 1. The improved Ghostbottleneck, therefore, requires the addition of the SE module and the CoordAtt to Ghostbottleneck as an optional module for comparison, and the improved ghostbottleneck is shown in Table 4:
Overview of improved Ghostbottleneck
Overview of improved Ghostbottleneck
In the light-weight network applications, the attention module is the module that has rarely been applied, because the computation of many attention module is still not the light-weight network can afford, but it is in the use of applications represented by MobileNet: SE module is a widely used attention module, but SE module pays attention to channel information and ignores location information. However, CoordAtt differs from this: adding the position information to the channel information makes it more computationally demanding while achieving better results.
Where the attention module includes: SE module, CoordAtt, whose code needs to be implanted into the Ghostbottleneck, using option as a variable to indicate whether or not the attention module is used. When using the attention module, attention is also paid to the parameter sizing problem and the parameter increment ratio reflects how much more parameters will eventually be added by using the CoordAtt instead of the se module in Fig. 2.

Improved bottleneck.
In response to this problem, [21] gives the comparison diagram and alternates the idea of calculating the parameter quantities. Then the SE module to CoordAtt parameter increment ratio shows equation 4, where R represents the dimensionality reduction coefficient, C represents the number of channels, and H and W represent two directions:
It can be seen from equation 6 that after replacement, the increment ratio is mainly related to the two parameters W and H, and the parameter amount of CoordAtt is about n2 times that of SE Module.
DWSConv or DWConv will be used as the convolution module in this network. The parameters of DWSConv and DWConv are greatly reduced compared with the normal convolution parameters, which can significantly reduce the parameters. Take the calculation method of DWSConv parameters as an example: N is the parameter, CW is the convolution kernel width, CH is the convolution kernel height, IW is the image width, IH is the image height, and Nin is the number of input channels, Nout is the number of output channels.
A standard convolution is shown in Equation 5:
A depthwise convolution is shown in Equation 6:
A pointwise revolution is shown in Equation 7:
After using Depthwise Separable Convolution instead of standard convolution, the parameter quantity can be significantly compressed as shown in Equation 8.
Table 5 shows the most basic network architecture of GhostBackbone (GhostBackbone-Simple).
Overview of the backbone of GhostBackbone-simple
Overview of the backbone of GhostBackbone-simple
After deepening the model GhostBackbone simple, the backbone architecture of the model GhostBackbone deep is obtained, as shown in Table 6: after convolution, insert 9 Ghostbottlenecks and 3 Ghostbottlenecks.
Overview of the backbone of GhostBackbone-Deep
Experimental environment
In this paper, YOLOv5 (version 4.0) will be used to train and test the mobile screen dataset. Different computer hardware and software will affect the experimental results, so it is necessary to list the computer configuration, as shown in Table 7.
Main configuration of training
Main configuration of training
In the field of mobile phone manufacturing, the LCD detector is the main equipment used to detect mobile phone defects. Therefore, the collection dataset should be close to the picture content collected by the LCD detector. In this paper, 3 types of defects (scratches, jags, oil) are taken as examples. These defects are taken from two materials: Toughened glass and the front screen of the mobile phone screen. During the manufacturing process of the mobile phone screens, it is inevitable that there will be scratches, jags, oil, and other residues on the screen. As shown in Fig. 3, the sample data set to be used in the experiment (3 categories in total) is listed.

Three types of Screen Defects Samples.
Due to the reflectivity of glass, the change of different light intensity, direction, and other conditions can make the detector and network pair get new learning. Therefore, when selecting the light source, two different light distribution methods are used. The advantage of these two light distribution methods is that they will not directly shoot light into the lens, resulting in light pollution and specular reflection, and when shooting, you should keep the room dark. Only the spatial parallel light beam has the best shooting effect. As shown in Fig. 4, these two light distribution methods can minimize the reflection of the mobile phone screen and background light into the camera lens, and can effectively and clearly capture various mobile phone screen defects. Through these two methods, three types of defects (jag, oil, screen, scratch) can be marked. Under the condition that each type of defect is randomly distributed, a total of 8919 images are constructed, 4993 for training, 2141 for verification, and 1785 for testing. The pixels of each image are 500 * 266 or 266 * 500.

Two ways of parallel lighting.
YOLOv5, as a detector, will use the box calibration format. Therefore, for some small defects (such as cracks in the red inner frame), the defects will be fully included in the data calibration, but for some conjoined or large defects (such as broken pits in the green inner frame), the main features of the defects will be calibrated into the box, as shown in parts C and D of Fig. 5:

The instruction of labeling rules at datasets.
Experiments without attention module
In this experiment, GhostBackbone-Simple will be used for training and experiment. The purpose is to evaluate the performance of parameters, speed (reasoning time and NMS time for detecting each 640 * 640 picture), mAP (the calculation method is shown in Eq. 11), etc.
To calculate the mAP, first calculate P (precision) and R (recall), where TP represents that the defect is correctly identified as a defect, FP represents that the defect is incorrectly identified (a defect is wrongly considered as other defect), and FN represents that the defect is not identified; In the mAP calculation formula, C represents the number of defect categories; N represents the number of pictures in the test set; K represents IOU; P (k) is the precision of the current IOU, and R (k) can be obtained similarly with this way.
In this experiment, all models will be run by 130 epochs and depth_ multiple = 0.33, width_ Multiple = 0.50, batch size = 32, and 1 dataloader worker. In this experiment, DWSConv module and DWConv are the main variables. Meanwhile, MobileNetV3s [20] and YOLOv5s [24] will be used to compare with the four models shown in Table 3. After training, use test.py to test the mAP and CPU detection time of the model, and test three different mAPs (mAPval 0.5 : 0.95, mAPtest 0.5 : 0.95, mAPval 0.5), confidence = 0.001 and IOU = 0.65; CPU detection time is set to confidence = 0.25, IOU = 0.45.
Because these models might be deployed to the Internet of things devices, CPU detection time will be taken as the speed metric. CPU detection time refers to the time (MS) required to detect a 500*266 or 266*500 image with the CPU.
In addition, due to the principle of computer and hardware, the detection speed of every time each model has a certain inaccuracy. These images will be reshaped to 640*640, because the “img-size” in “test.py” is set to 640*640.
It can be seen from Table 8 that parameter reduction can be effectively carried out by using ghostbottleneck and DWConv or DWSConv. It can be seen from Tables 8 and 9 that DWConv has lower parameters, but the pointwise convolution in DWSConv can improve the mAP. In terms of parameter reduction, these models using ghostbottleneck are better than MobileNetV3s and YOLOv5s; In terms of speed, it can also be seen that simple, which is made of ghostbottleneck, can accelerate the CPU detection speed. Even deep, which has undergone network deepening processing, can reach the speed close to MobileNetV3s and YOLOv5s. It can be seen from Table 8 that the loss of mAP caused by parameter reduction is unavoidable, but it can still be improved by some methods, such as the comparison between deep and deep+or simple and simple+, and the method of deepening the network can improve mAP; For another example, compared with deep or simple+and deep+, using DWSConv will improve the mAP.
List of parameters and Speed CPU (ms) of models
List of parameters and Speed CPU (ms) of models
In Fig. 6, since GhostNet’s bottleneck is able to extract feature maps relatively well, so taking Screen_ Scratch as an example, the confidence gap between Deep+and YOLOv5s is±0.02.

Comparison of three methods (Cite samples with scratch).
Under this small gap, it can be considered that both Deep+and YOLOv5s have about 79% probability that these defects are Screen_Scratch, and the confidence of MobileNetv3s in the left defect is 0.26; The confidence of the right defect is 0.68, which is not as high as that obtained by Deep+and YOLOv5s. As shown in Figs. 7 8, similar or similar conclusions can be obtained when the defects are replaced by jag and oil.

Comparison of three methods (Cite samples with oil).

Comparison of three methods (Cite samples with jag).
The attention module can help the network find places worth focusing on, which is of great positive significance to improve the network performance, but there are few attention modules applied to light networks. Therefore, the SE/module and coordinated attention will be used to experiment and apply them to defect detection. In this experiment, the Ghostbackbone-simple will be used as the experimental object. Its advantages are a small variable adjustment range and fast training speed.
The experimental data in Table 10 is also similar to the experimental results of CoordAtt [23]. After adding the attention mechanism, the network parameters increase slightly, and the parameter amount of CoordAtt will be higher than that of the SE module, and the inference time will increase; With the decrease reduction, the number of network parameters will increase, and the reasoning time will increase slightly.
A performance list of GhostBackbone-Simple being added attention
A performance list of GhostBackbone-Simple being added attention
List of Detection Speed and accuracy in various areas
Table 11 shows that under the same dimensionality reduction coefficient, GhostBackbone simple with CoordAtt is higher than mAP with SE Module, and the upward trend of parameters and inference time is not very large, so CoordAtt is suitable for improving GhostBackbone.
A mAP list of GhostBackbone-Simple being added attention
In this experiment, GhostBackbone deep with the best performance in Table 9 will be called and CoordAtt will be added for the experiment, in which the reduction of CoordAtt will be set to 4.
A list of all kinds of map of the model
A list of all kinds of map of the model
As shown in Table 12, after adding CoordAtt to Deep, the number of parameters did not change greatly, but the speed became slower, even slower than that of YOLOv5s. The SpeedCPU of Deep with CoordAtt and Deep+with CoordAtt decreased by about 8% - 10%; As shown in Table 13, in terms of map performance, Deep with CoordAtt can obtain better results. Compared with MobileNetV3s, it improves by about 4% and 2% compared with deep. After lightweight, although the mAP is not as good as YOLOv5s, it can be seen that it is still possible to explore from Deep+with CoordAtt.
List of parameters and Speed CPU (ms) of models
A list of all kinds of map of the model
As shown in these 3 experiments in chapter IV.B, Ghostbottleneck shows great advantages in reducing parameters, and in order to maintain accuracy, the methods of adding additional Ghostbottleneck and using DWSConv instead of DWConv are adopted. Therefore, the detection time of Deep becomes longer; In addition, the Deep with CoordAtt can also have better performance than the Deep with SE Module.
The DCGAN+LBP approach is well suited for low-case defect detection tasks, and although this work comes from the field of semantic segmentation, its excellent detection performance and speed are also well suited for industrial production lines. Dec-YOLOv5’s approach is also very good, using a good network model training method and improving the network so that their approach can maintain YOLOv5’s accuracy while further improving their detection speed. Although theirs can maintain the relative accuracy to the original model, the speedup in this paper is relatively large compared to their work, and can significantly compress the number of model parameters while still achieving a more likely accuracy.
Ghostbackbone is proposed by this paper, which adopts ghostbottleneck in GhostNet and uses coordinated attention to improve its original attention module. Experiments show that GhostBackbone has higher cost performance than mobilenetv3s and yolov5s in recognition accuracy, defect detection speed, and parameter reduction, making it possible for Ghostbackbone to be deployed on the mobile phone production line.
The Ghostbackbone proposed in this paper has achieved good performance, and there are still these works that can be improved in the future: when it is difficult to collect a large number of defect samples as object detection training datasets, unsupervised and semi-supervised algorithms and a defect data set generation and detection based on GAN positive samples [1] can be used as the next research direction.
Footnotes
Acknowledgments
This research was financially supported by National Key R&D Program of China (2018YFB 1308600, 2018YFB1308602). Major Scientific Research Project for Universities of Guangdong Province (2019KZDXM015, 2020ZDZX3058); Guangdong Provincial special funds Project for Discipline Construction (No. 2013WYXM0122); Key Laboratory of Intelligent Multimedia Technology (201762005).
