Abstract
In recent years, the research on object detection has been intensified. A large number of object detection results are applied to our daily life, which greatly facilitates our work and life. In this paper, we propose a more effective object detection neural network model ENHANCE_YOLOV4. We studied the effects of several attention mechanisms on YOLOV4, and finally concluded that spatial attention mechanism had the best effect on YOLOV4. Therefore, based on previous studies, this paper introduces Dilated Convolution and one-by-one convolution into the spatial attention mechanism to expand the receptive field and combine channel information. Compared with CBAM and BAM, which are composed of spatial attention and channel attention, this improved spatial attention module reduces model parameters and improves detection capabilities. We built a new network model by embedding improved spatial attention module in the appropriate place in YOLOV4. And this paper proves that the detection accuracy of this network structure on the VOC data set is increased by 0.8%, and the detection accuracy on the coco data set is increased by 7%when the calculation performance is increased a little.
Introduction
Object detection is one of the most significant deep learning tasks at present. It can be widely used in many fields of life, and it is a very important research direction in the field of computer vision. Object detection needs to use algorithms to determine which category the objects in the picture belong to, and use bounding boxes to enclose the target objects in the picture. With the rapid development of Internet technology, artificial intelligence and computer hardware, a large number of images and video data have appeared in human life, which makes the role of computer vision technology more and more important. People’s research on computer vision technology is also more and more hot. At present, object detection technology is widely used in real life, such as target tracking, medical image analysis, automatic driving, and national defense system and so on.
Convolutional Neural Networks (CNN) has achieved great success in visual understanding, which has led to a wide range of applications in Wearable device, security systems, mobile phones, automobiles and other fields. As an earlier algorithm model in the CNN model, the AlexNet [1] CNN model in 2012 has a milestone significance in the development history of deep learning. Using deep learning for object detection is currently the most common and effective method in related fields. Deep learning has a powerful expressive ability. In the case of sufficient and accurate training data, by training enough data and features, deep learning can allow the model to fully autonomously learn the feature data of the target object. So as to achieve the experimental purpose of extracting the target from the picture or video. Since 2012, researchers have continuously studied and improved the CNN model, and proposed R-CNN [2], FAST R-CNN [3], SPPNET [4], Faster R-CNN [5] and other network models. However, these network models are computationally heavy and memory intensive. The appearance of YOLO [6] network solves this problem to a certain extent. The size of tiny-YOLOv2 and tiny-YOLOV3 models is only 60.5MB and 33.7MB. Because client devices usually have limited computing resources but need real-time reasoning, these applications require CNN to provide high accuracy under the constraints of certain computing power. Therefore, simply piling up more trainable parameters and complex connections to enhance the model is not realistic. To solve the problem that the most accurate modern neural networks do not operate in real time, Lu et al. A YOLO- Compact network model is proposed for single-category real-time object detection [7].
In recent years, great progress has been made in more accurate object detection. At the same time, the proposed object detection models are consuming more and more computation. For example, the AMOEBANET-based NASFPN [8] detector requires 167 million parameters (30 times more than RetinaNet [9]) to achieve the most advanced accuracy in current models. The bloated model size and high computational costs prevent them from being deployed in many real-world applications, such as robotics and autonomous vehicles, where model size and computational time are limited. Considering the resource constraints in the real world, especially in real-time detection systems, the efficiency of detection models becomes an increasingly important reference factor. Bochkovskiy et al. [10] Proposed the latest model of YOLO series, YOLOv4, and designed a fast running target detector. Starting from the structural design of the model, Jie et al. [11] introduced a new building unit on the channel relationship, the Squeest-and-Director-block (SE). The expression ability of the network is improved by explicitly modeling the interdependence of the channels with the network convolutional features. Takahashi et al. [12] built an object detector with depth information and color information as input on the basis of the Yolov3 model, which greatly reduced the weight of the model.
Inspired by Bochkovskiy et al., this paper combines the improved spatial attention(Spa) mechanism with YOLO network, constructs the Enhance_YOLOV4 network model, and designs a lightweight target detector that can meet real-time detection in real life. This model can be easily trained and used, and real-time and high accuracy object detection results can be obtained quickly. Our contribution totals as follows:
(1)Since the YOLO model was proposed, both accuracy and efficiency have been taken into account. By introducing attention mechanism into the well-performing YOLOV4 detection network, the detection ability and quality of the model can be effectively improved with relatively little increase in the amount of calculation.
(2)This paper improves the spatial attention mechanism and verifies it on the YOLOv4 and Resnet50 network models. The results show that Spa&dila improves the performance of the model. At the same time, the results also proved the versatility of the improved attention mechanism.
(3)In this paper, a variety of ablation experiments are conducted to fully study the impact of several main attention mechanisms on the performance of multi-layer convolutional networks. The final experimental results verify that the model proposed in this paper has reached the most advanced effect.

ENHANCE_YOLOV4 has a good improvement in detection accuracy.
Object detection
The object detector is usually composed of two parts: Backbone, which can extract input image information and target features through training. Detection head, a detector which can be used to predict categories and Bounding Box.
The current backbone network model has multiple architectures. VGG network [16] proves that increasing the depth of the network can affect the final performance of the network to a certain extent. VGG uses 3 convolutional layers with a convolution kernel size of 3*3 to replace a convolutional layer with a 7*7 convolution kernel, and uses 2 3-by-3 convolutional layers to replace a 5-by-5 convolutional layer. This operation enhances the depth of the network while ensuring that the network model has the same perceptual field. As a result, VGG improves the effect of the neural network. The ResNet network [17] adds a direct connection channel and proposes the idea of residual learning, which solves the problems of information loss and loss to a certain extent. The ResNeXt network [18] proposes a strategy (grouped convolution) that is between ordinary convolution kernel depth and separable convolution, which achieves a balance between the two strategies by controlling the number of groups. The DenseNet network [19] starts with features and achieves better results and fewer parameters through the extreme use of features, so that the input of each layer comes from the output of all the previous layers.
In previous studies, researchers have developed many efficient head detection models. Zeming Li et al. [20] proposed Light-Head R-CNN, which reduces the amount of model calculations by using "thin" feature maps. Xiyang Dai et al. [21] proposed that Dynamic Head combines multiple self-attentions. This method significantly improves the representation ability of target detection head without any computational overhead.
Currently popular object detection algorithms can be divided into two categories. One is the R-CNN algorithm based on Region Proposal (R-CNN, Fast R-CNN, Faster R-CNN, etc.). They are two-stage and need to generate target candidate frames first, and then classify and regress the candidate frames. The other is one-stage algorithms such as Yolo and SSD [22], which only use a CNN to directly predict the categories and positions of different targets. The first type of method is more accurate, but slower. The second type of algorithm is faster and less accurate.
The one-stage detector does not require a region proposal network (RPN) to generate a series of candidate regions. They only use a single network to solve the problem of object classification and bounding box regression. Redmon et al. [6] proposed the YOLO model, which regards the object detection as a regression problem, which is spatially separated by bounding boxes and related class probabilities. The end-to-end characteristics of the model enable the final stage of the solution to be optimally optimized. Subsequently, scholars continued to optimize the YOLO model and proposed a new series of YOLO models. The YOLOv3 model uses a deeper convolutional layer, and uses 3 scale feature maps for reference to Feature Pyramid Network(FPN), which improves the accuracy of the model [13]. Lu et al. [11] through a series of ablation experiments, explored a series of methods to transform a large and deep network into a compact and efficient network, and assembled it into the YOLOv3 network, and proposed for single-class real-time object detection. The efficient YOLO-compact model. Takahashi et al. [12] explored object detection from three-dimensional space. They extended the network architecture of YOLOv3 to 3D, Intersection over Union (IoU) in three-dimensional space to confirm the accuracy of the region extraction results, and proposed a lightweight object detector "Expandable YOLO" that uses stereo camera input depth and color images. The YOLOv4 model [10] has achieved the latest results in real-time detection. Bochkovskiy and others created a CNN network that runs on GPU in real time, and greatly improved the efficiency and accuracy of model training. In addition, Zhou et al. [14] proposed an anchorless target detector based on the method of center point from the perspective of anchor point. The detector uses key point estimation to find the center point and returns to other target attributes, such as size, 3D position, direction, and even pose.
The two-stage detector uses the RPN to propose multiple suggestion boxes in the first stage, and then performs target classification and location information prediction on the suggestion boxes in the second stage. The R-CNN network [2] uses candidate frames for feature extraction and image classification of the training set, and adopts the idea of migration learning to fine-tune and extract regional features on a large data set. The Fast R-CNN network [3] absorbed the advantages of the R-CNN algorithm SPPnet algorithm, further improved the convolutional layer structure, added an ROI pooling layer, and adopted a multi-task loss function. The Faster R-CNN network proposes the RPN structure, and obtains the target proposal frame from the convolution feature map generated in the previous process through network training. The Mask R-CNN network [15] adds a branch to predict the segmentation mask on the basis of Faster R-CNN, and performs three tasks of classification, regression, and segmentation.
These models are based on the powerful local feature extraction capabilities of CNN. In order to improve the performance of CNN, these studies point out three factors that affect network capabilities: the width of the network (the number of channels in the network layer), and the depth (The number of network layers) and accuracy (the size of the input image and the size of the feature map) [23]. The EfficientNet [25] proposed later also verified this point.
YOLO creatively proposed a single-stage detection network, that is, object classification and object positioning are completed in one step. YOLO divides the picture into grids of 13*13, 26*26 and 52*52 through the backbone (Darknet or CSPdarknet), and then each cell is responsible for detecting the targets whose center point falls within the grid. Finally, YOLO directly returns the position of the bounding box and the category of the Bounding Box in the output layer, so as to realize single-step detection. In this way, YOLOv4 has achieved a detection speed of more than 65 frames per second on Tesla V100. At the same time, the YOLO detection network has achieved good results in fire identification and personal protective equipment wear identification [26].
Attention mechanism
Attention mechanism is widely used in deep learning [27, 28], and its source of inspiration can be attributed to people’s physiological perception of the environment. For example, the human visual system does not process the entire image at once, but prefers to select the prominent parts of the picture for centralized analysis and ignore the irrelevant information in the image. For neural networks, the core of the attention mechanism is to perform weighted combination of input image features and output to the next layer. The attention mechanism strengthens the accumulation of effective features in the feature map. The attention mechanism can achieve the alignment of input and output, while also being able to use more effective contextual information in the original data.
Mnih et al. [29] proposed the use of attention in image classification tasks, while reducing the computational complexity of CNN. Visual attention also brings great improvement to object detection. Ba et al. [30] used their attention on the problem of multi-target detection, in which images were processed in a sequential manner to learn to predict one object at a time. Therefore, in the end, a series of tags are generated for multiple objects until there are no more objects that the model can recognize. Deep Recurrent Attentive Writer Gregor et al. [31] had developed the use of attention in image generation. It uses an encoder-decoder framework that can compress and regenerate images gradually during the training process. The self-attention generative confrontation network proposed by Yu et al. [32] calculates the response of a certain location as the weighted sum of the features of all locations. They applied the self-attention mechanism to convolutional GAN. The self-attention module supplements the convolution by modeling the long-distance across the image area and the multi-level dependency relationship.
Hu et al. [11] proposed a compressed ‘Squeeze-and-Excitation’ module to use the relationship between channels, which can essentially be regarded as an attention mechanism based on the channel axis. The Convolutional Block Attention Module (CBAM) enhances the performance of the convolutional block through the combination of spatial attention and channel attention [23, 24]. Given an intermediate feature map, the CBAM module sequentially infers the attention map along two independent dimensions (channel and space), and then multiplies the attention map with the input feature map for adaptive feature optimization. In recent years, the channel attention mechanism itself has also shown great improvements to the performance of convolutional blocks. The ECANet network [33] proposed an effective channel attention (ECA) module, which only involves k (k< =9) parameters, but with Here comes a significant performance gain. In the research of this article, we try to combine YOLOV4, which performs well in object detection, with a variety of attention mechanisms to find the most suitable network structure. After a lot of experiments, we found that the performance of YOLO has been improved when combined with the spatial attention mechanism.
Proposed method
The new network model based on YOLOv4 in this paper is called Enhance_YOLOv4. The structure of our proposed model is shown in Fig. 2. The residual block in the figure is named Enhance_Resblock_body.

⨂ denotes element-wise multiplication, oplus represents the addition of elements, and ø represents the merging of two one-dimensional tensors into one two-dimensional tensor.
The complete network structure is shown in Fig. 3. The input feature map is passed through the deep convolutional network (DCNN) [34], and then passed into the spatial attention module to weight the feature map before output. Next we will look at the network modules in turn.

The convolution block consists of a convolution layer, a batch normalization operation, and a MISH activation function.
CSPDarknet-53 is the feature extraction network structure proposed in YOLOv4. The reason is that compared with Darknet-53 in YOLOv3, the feature extraction speed and accuracy of CSPDarknet-53 are higher than the previous models [10]. After the original image is input, we can extract feature maps with scale sizes of 52*52, 26*26, and 13*13 from the third, fourth, and fifth residual blocks, respectively. In this paper, the spatial attention module is added after the convolution block. We use Spa&dila weighting on the feature map obtained by convolution. Finally, the output feature map of this stage is obtained through a single-layer convolution.
The Input feature map in Fig. 2 is the feature map obtained after neural network processing. Due to the plug-and-play feature of the Spa&dila module, the feature map can be of any scale.
As the number of network layers increases, the neural network will experience degradation: Generally speaking, when the number of network layers increases, the training set loss gradually decreases and then stabilizes. But if you continue to increase the network depth, the training set loss will increase instead. After the residual block is introduced, the network can reach very deep, and the effect of the network will also become better. In order to effectively use the deep convolutional network, YOLOv4 carries out the residual block design here. After the feature map enters a residual block, the residual block contains one, two, eight, eight and four small residual blocks to extract the high-dimensional information of the picture. Each small residual block has two Convolutional blocks.
Part of the formula of Stage 0 is as follows:
Spatial attention is based on the spatial domain information in the picture for corresponding spatial transformation, and an effective method for calculating spatial attention is explored. In order to generate a two-dimensional spatial attention feature map, a 1-by-1 convolution is first calculated for the input feature map, and the dimension of the output is adjusted to 1. Then the maximum pooling operation is carried out [35]. We believe that maximum pooling can effectively retain spatial information, especially the texture information of the image. At the same time, an average pooling is performed on the input feature map to reduce the dimensionality, extract the effective features of different channels, and strengthen the background information of the feature map. Combine the two tensors obtained by maximum pooling and average pooling into a two-dimensional tensor. Then we perform a convolution operation on the two-dimensional tensor to get the original attention feature map. Finally, the Sigmoid operation is used to generate the weight map and overlay it back to the original input feature map, thereby enhancing the specific target area of interest and weakening the irrelevant background area.
For the input feature map F4 ∈ RC∗H∗W in Fig. 2, the spatial attention module can derive:
The calculation formula of the spatial attention structure is as follows:
Among them, σ is the activation function Sigmoid, and df7∗7represents an Dilated Convolution.
So far, the final output feature map F of the stage 0 part in Fig. 3 is:
In the experiment, we found that expanding the receptive field greatly improves the network’s detection of targets. In the baseline plus spatial attention mechanism in Table 1, we use a convolutional layer with a convolution kernel size of 7∗7.
Object detection mAP (%) on the VOC 2012 test set. We use the CSPDarknet detection framework to apply each attention mechanism to the backbone
Inspired by BAM [24], this article found that after using dilated convolution instead of ordinary convolution in the spatial attention module, the effect is improved again, and compared with the normal spatial attention module, there is almost no increase in parameters and calculations, as shown in the table 1 shown. At the same time, because the convolution kernel of 1*1 convolution can effectively integrate the information between the channels, we add a 1*1 convolution layer before the maximum and average pooling to integrate the channel information while effectively reducing the parameters and calculations of the model are analyzed. We hope to use as few parameters as possible to improve the performance of spatial attention.
In order to more clearly verify the effect of dilated convolution, this paper uses the Centernet detection model [39] to conduct further experiments on the basis of using resnet50 as the backbone network. As shown in Table 3, this article uses Resnet50 as the baseline, and compares the original attention mechanism with the improved attention mechanism. Meanwhile, in order to understand the influence of Spa&dila model’s hyperparameters on the experimental results in detail. We design different cases where the convolution kernel f is 3 or 7, and the dilation rate d is 2 or 4. Finally, we find that when the convolution kernel f=7 and the dilation rate d=2, we think the effect is the best. Thus we find the best hyperparameter setting for Spa&dila. This article believes that dilated convolution improves the receptive field of feature maps, thereby strengthening the ability of neural networks and attention modules to recognize objects.
VOC data set, using Resnet50 as backbone and centernet as detector
Datasets
In the ablation experiment, we choose to use the Pascal VOC2012 and MS COCO data sets for the experiment. Among them, the VOC data set has 20 categories, 13,870 pictures are used as the training set, 1542 pictures are used for the verification set, and 1713 pictures are used as the test set.
The full name of MS COCO is Microsoft Common Objects in Context. It originated from the Microsoft COCO data set that Microsoft funded and annotated in 2014. It is regarded as one of the most watched and authoritative competitions in the field of computer vision. The COCO dataset has 118,287 training sets with a size of 19.3GB on the object detection task, 5000 verification sets with a size of 1814.7M, and a total of 123287 sheets.
Parameter settings
In this experiment, each model is trained to convergence, basically training for more than 200 epochs. The weight attenuation used in this experiment is 0.0005, the first 50 epochs freeze learning rate is 0.001, and then the learning rate starts to decrease from 0.0001, the Adam function is used for parameter optimization, and the image input size of the model is 416∗416.
Experimental introduced
CSPDarknet (Baseline): We use the original model of YOLOv4 as the baseline experiment in Table 1 and Table 2.
MS COCO data set, using CSPDarknet detection framework
MS COCO data set, using CSPDarknet detection framework
Comparison of the experimental group: We add the following attention modules to the YOLOv4 network model.
Channel: Channel attention module.
Spa (Original): Original space attention module.
CBAM: Convolutional Block Attention Module. The combined use of spatial and channel-based attention.
Eca: Effective channel attention module.
BAM: Bottleneck Attention Module.
Spa&dila (Ours): Our proposed improved spatial attention module.
This experiment is carried out on the PyTorch framework under the hardware environment of Ryzen CPU3600, NVIDIA GeForce2070super graphics card.
This paper tests spatial attention, channel attention, Eca attention, CBAM attention and BAM attention in the VOC2012 and MSCOCO data sets. The experimental results show that our proposed spatial attention is better than other network structures in object detection tasks.
At the same time, all models use data enhancement methods to improve the recognition results, such as Mosaic data enhancement, cmBN and SAT self-antagonism training. In order to better evaluate the accuracy and computational cost of the model, this article sets up MAP, Flops, F1, Recall and other indicators in the experiment. Among them, the calculation formula of some parameters is as follows:
The experimental results are as follows:
Table 1 shows the experimental results. It can be observed that the Spa&dila module of spatial attention proposed in this paper produces better accuracy. The MAP of CSPDarknet + Spa&dila model reaches 79.90%, which is 0.57%higher than that of the benchmark model. This shows that 1*1 convolution and dilated convolution can extract effective information of feature images and produce good reasoning performance. At the same time, we can see that after using dilated convolution to replace the ordinary convolution in the spatial attention module, the effect is improved again. The MAP of CSPDarknet + Spa&dila exceeds the original Spa model and reaches the best among all the models in the table 1. At the same time, the Spa&dila model contains only slightly more parameters and computations than the original spatial attention module, and is lower than the BAM and CBAM attention modules, but it achieves better performance.
From Table 1, we can know that various combinations of spatial attention perform well. In Table 2, we focus on the following experiments. Table 2 shows a more widely applicable COCO data set. It can be seen that the method we proposed still has the best performance. Compared with the original YOLOv4 network model, we have improved the accuracy by 7%. Compared with the BAM attention module, our module improves the map by 0.89%, while getting the best F1. At the end of the article, in Fig. 4, we show a small number of test results.e will show a small number of test results at the end of the paper.

Partial test results of Coco data set.
It can be seen in Fig. 5 that after adding Spa&dila, the model converges faster. The Loss curve of Spa&dila is more gentle after convergence. This also proves that Spa&dila’s extraction of effective features is stronger than other models.

Object detection on the VOC 2012 test set. The loss changes before and after the five models converge (Note: the iterations times they train to convergence is not the same).
To illustrate the generality of our proposed Spa&dila, we add a set of experiments. In Table 3, we used a different Cenernet detection network than YOLOv4.
In addition, we also discuss the influence of the size of the convolution kernel on the experimental results. As shown in Table 3, no matter what the value of the convolution kernel is, the effect after adding spa module to the model structure is better than the baseline. Using f=7 produces better accuracy in both cases. This means that a wide field of view (i.e. a large receptive field) is needed to determine the important spatial areas. With this in mind, we paired different dilated convolution parameter dilation rate d to calculate spatial attention. Finally, it is found that when the convolution kernel size is 7 and the dilation rate d is set to 2, the model will achieve the optimal effect. As can be seen from Table 3, our model MAP and F1 values exceeded the baseline by 2.19%and 13%, respectively, reaching the most advanced level. Meanwhile, this set of experiments also shows the effectiveness and universality of Spa&dila proposed by us for CNN.
We propose an improved spatial attention mechanism, which is a method to improve the representation ability of the CNN. We combined the focused feature refinement with the reduced parameter 1*1 convolution, and achieved considerable performance improvements while keeping the overhead small.
In this paper, after discovering the influence of the acceptance field on the detection network in the experiment, the convolution part in the original attention mechanism is modified to dilated convolution, which enhances the influence of the spatial attention mechanism on the network model.
We further improve performance by making full use of spatial attention. Our final module learning effectively emphasizes or suppresses and refines the content and location of intermediate features. In order to verify its effectiveness, we conducted extensive experiments using various latest models and confirmed that the Enhanced_YOLOv4 proposed in this article outperforms all baselines on two different benchmark data sets: MS COCO and VOC 2012. In addition, we added the spatial attention module to Resnet50 to verify the versatility of the module. The conclusion is that we observe that our module causes the model to correctly locate and classify the target object. We hope that the improved spa can achieve better results in various network models.
Footnotes
Acknowledgments
This research is partially supported by National Natural Science Foundation of China Key Project (U2003208), National Natural Science Foundation of China (62162058), and Autonomous Region Major Science and Technology Project (2020A03004-4). We would also like to thank our tutor for the careful guidance and all the partici-pants for their insightful comments.
