Small object detection based on attention mechanism and enhanced network

Abstract

Small object detection has a broad application prospect in image processing of unmanned aerial vehicles, autopilot and remote sensing. However, some difficulties exactly exist in small object detection, such as aggregation, occlusion and insufficient feature extraction, resulting in a great challenge for small object detection. In this paper, we propose an improved algorithm for small object detection to address these issues. By using the spatial pyramid to extract multi-scale spatial features and by applying the multi-scale channel attention to capture the global and local semantic features, the spatial pooling pyramid and multi-scale channel attention module (SPP-MSCAM) is constructed. More importantly, the fusion of the shallower layer with higher resolution and a deeper layer with more semantic information is introduced to the neck structure for improving the sensitivity of small object features. A large number of experiments on the VisDrone2019 dataset and the NWPU VHR-10 dataset show that the proposed method significantly improves the Precision, mAP and mAP50 compared to the YOLOv5 method. Meanwhile, it still preserves a considerable real-time performance. Undoubtedly, the improved network proposed in this paper can effectively alleviate the difficulties of aggregation, occlusion and insufficient feature extraction in small object detection, which would be helpful for its potential applications in the future.

Keywords

Convolutional neural network small object detection attention mechanism feature fusion

1. Introduction

It is well-known that a small object is characterized by a bounding box with a ratio less than 0.1 [7] or a resolution less than 32 $\times$ 32 pixels [34], which leads to a great challenge for detecting it accurately and rapidly in practical applications, such as complex scene analysis, autonomous driving, remote sensing, and other areas of image semantic analysis and interpretation [3, 20, 23]. Thus, small object detection has gradually become a research hotspot in computer vision. Up to now, small object detection is often realized by convolutional neural networks (CNN) and their variants under deep learning. Generally, the algorithms of these CNN can be typically divided into one-stage and two-stage algorithms. Two-stage detecting algorithms, such as the region-based convolution neural networks (R-CNN) series [27, 28, 30], always pay more attention to high-precision detection. Compared with two-stage algorithms, one-stage algorithms such as you only look once (YOLO) series [1, 12, 13, 14], and single shot multiple box detector (SSD) series [6, 38], often need to reconcile the high-precision with the real-time performance simultaneously. For instance, Gu et al. [41] used the YOLOv5 network to detect the duck posture in the cage. The experimental results demonstrated that the YOLOv5 network can perform the task with a frame per second (FPS) of 20.7 and a mean average precision (mAP) of 98.37% in all postures. Compared with the FPS of 3.2 and mAP of 79.76% in Faster R-CNN, it has better real-time performance and detection accuracy. However, it is worthwhile to note that, due to the continuously down-sample operation in the backbone network for most one-stage algorithms, many features of small objects would be lost, resulting in the difficulty of capturing sufficient information in small object detection.

To address this issue, researchers start paying attention to obtaining more information through different methods from these classical networks. For instance, Liu et al. [24] considered that, for small object images captured by unmanned aerial vehicles (UAVs), the inadequate feature extraction is mainly caused by a small field of perception and less spatial information in traditional algorithms. Therefore, they applied two different ways to broaden the perceptual field and enrich spatial information. One way is to optimize the residual blocks in the darknet by concatenating two residual networks (ResNet) [16] units together with the same width and height. The second way is to enrich the entire darknet structure by adding convolution operations to earlier layers. In comparison to the methods proposed by Liu et al. [24], Ku et al. [4] argued that the key role in solving the problem of inadequate feature extraction is high-resolution features and feature fusion. By extracting the content and texture of the input image to capture local details of small objects, and using a combination of spatial pooling pyramid [17] (SPP) and path aggregation network [29] (PAN) in the neck network, an enhanced ability to utilize high-resolution features of small objects can be realized. Similar to [4], Oghaz et al. [25] also performed feature fusion on two feature layers with higher resolution. Moreover, the deconvolution operation played a key role in achieving better detection performance. It is because of the high-resolution features and their corresponding features fusion, as well as more spatial information and a larger perceptual field, an excellent performance for detecting small objects can be achieved in [4, 24, 25]. However, in practical use, especially in the use of complex scenes for detecting multiple small objects, high-resolution features, more spatial information become more and more difficult to obtain due to the aggregation and occlusion of small objects in complex scenes, resulting in more false detection and missed detection. Thus, researchers begin to study the solution of detecting aggregated and occlusive small objects in complex scenes. For example, Guo et al. [32] and Li et al. [42] both used a combination of the intersection over union (IOU) and non-maximum suppression (NMS) to decrease the ratio of false detection and missed detection. The difference is only that Li et al. [42] used the complete IOU, while Guo et al. [32] used the distance IOU. Different from the methods proposed by Guo et al. and Li et al., Guan et al. [19] enhanced the detection accuracy of partially occluded small objects by applying multiple pooling layers to mix contextual information into the feature map. It is worth noting that Guo et al. and Li et al. only focus on the solution of occlusion between small objects that should be detected. However, they cannot resolve the occlusion between the detected objects and other irrelevant objects. Though these methods [19, 32, 42] can be used to alleviate the problem of small objects occluded each other, the detecting performance including detecting accuracy and detecting speed would be somewhat reduced in complex scenes.

From the description of small object detection, it is found that the detecting performance of small objects in complex scenes is mainly determined by the extraction of small object features and feature fusion. If richer spatial, semantic information and a higher resolution feature information of small objects can be captured with a larger perceptual field and feature fusion, a relatively good performance would be achieved, which would be helpful for the solution of aggregation and occlusion for detecting small objects in complex scenes.

In this paper, we design a spatial-channel attention module composed of SPP module and multi-scale channel attention (MSCAM) module, named SPP-MSCAM. The SPP module is used to increase the perceptual field and obtain more spatial information. The MSCAM module is applied to capture enough semantic features on both global and local scales. By combining the SPP module and MSCAM module, the relation between the channel and spatial feature layers is strengthened and the richer features of small objects are obtained. Moreover, by fusing shallower feature information with higher resolution, and extending the feature pyramid network (FPN) and PAN parts to match the size of the shallower features, more features of small objects with different scales are fused, leading to a better detecting performance.

2. Related work

YOLOv5 has been widely used in small object detection [2, 37, 39] due to its high detection accuracy and fast detection speed simultaneously. Therefore, YOLOv5 is used as a typical benchmark network to study the effects of attention mechanism and enhanced network on its performance for detecting small objects.

2.1 YOLOv5

YOLOv5, a one-stage object detection model based on a convolutional neural network, has four structural components: Focus, CBS (composed of convolutional layer, batch normalization layer, and sigmoid-weighted linear unit, abbreviated as CBS), cross-stage partial (CSP) structures, and SPP module.

Focus, a component of the backbone network, uses four parallel slice operations to transfer the spatial information of the input image to the channel dimension without losing any of the finer details. CBS is a foundational module, including a convolutional layer, a batch normalization layer, and a sigmoid-weighted linear unit (SiLU) activation function. The CSP structure divides the original input into two branches for convolution and bottleneck operations. Moreover, both the backbone part and neck part all use CSP structures, but the difference is that the CSP structure in the backbone portion of the network has several residual units, while the CSP structure in the neck portion uses the CBS module to replace the residual unit. The SPP module consists of three maximum pooling operations. Then, the input feature map is converted into a fixed-size feature vector by the SPP module, which further enhances the expressiveness of small object features without slowing down inference.

2.2 Attention mechanism

Attention mechanisms in machine vision can be regarded as a dynamic weighting process based on input image information. Up to now, attention mechanism has been widely used to computer vision, such as image classification [15, 22] and image denoising [5]. Initially, it is used to imitate humans to quickly discover the desired attention region in complex images [21]. By concentrating on important areas of the input image, attention processes in small object detection can gather more crucial feature information. In addition, channel attention, spatial attention and their combination are often used to detect small objects. Squeeze-and-excitation networks (SE-Net) [10] and efficient channel attention (ECA) [26] are typical attention mechanisms for channels. Spatial transformer networks (STN) [23] are typical attention mechanisms for spaces. Convolutional block attention module (CBAM) [31] and bottleneck attention module (BAM) [11] are typically combined attention mechanisms for both channels and spaces.

The channel attention mechanism SE-Net was firstly proposed in [10], which has a squeeze and excitation block, as indicated by the dash box and dash box in Fig. 1a. This block can gather global information, capture the relationship between channels and enhance its representational capability. However, the global average pooling in the squeeze block is difficult to obtain complex global information. Moreover, the complexity of the model is increased due to the fully connected layer in the excitation block. To avoid channel degradation and reduce model complexity, ECA [26] was introduced. As illustrated in Fig. 1b, the convolutional layers are employed rather than fully connected layers in the excitation module of SE-Net. Though the channel attention mechanism can enhance the performance of a detection model, it often ignores the location information which is generated by the spatial attention feature layers. Woo et al. [31] proposed a convolutional block attention module, as depicted in Fig. 1c, which consists of the max-pooling operation, the average pooling, a multi-layer perceptron (MLP), a convolution operation, a sigmoid operation and batch normalization. It unites the channel attention module and the spatial attention module. Moreover, it also introduced two global pooling layers to generate global spatial information. However, the spatial sub-module may be impacted by the constrained perceptual field because it only uses convolution to handle the corresponding spatial attention information. Thus, in this paper, an improved attention mechanism by using pyramid pooling layer to expand the receptive field, and by combining multi-scale channel convolution to improve the ability of feature information extraction is proposed. The corresponding description is introduced in Section 3.1.

Figure 1.

The algorithm flow charts of the SE-Net, ECA and CBAM.

2.3 Feature enhancement network

Enhancing the accuracy of detecting by feature fusion from different scales is an essential way to improve the performance of small object detection. As we know, abundant semantic information is difficult to be captured from shallow layers, while rich location information and detailed information always exist in shallow layers. With the processing of the down-sample, more semantic information would be extracted while less location information and detailed information would be captured. To detect small objects accurately, it needs to obtain enough semantic information, location information and detailed information simultaneously. Thus, a suitable fusion of enough semantic information, location information and detailed information with different scales should be used to achieve the goal. For instance, Lin et al. [35] proposed FPN to enhance detection accuracy by fusing the features from various feature layers in the backbone part. Moreover, Liu et al. [29] introduced PAN to strengthen the entire feature hierarchy by enhancing the bottom-up path for obtaining accurate location information in shallow layers. As a result, the path of information transition between shallow layer and deep layer features was shortened. The computational cost was decreased and the detection accuracy was improved. Furthermore, Liang et al. [40] constructed a structure composed of two feature pyramids for predicting. The two feature pyramids were produced by the deconvolution module and the feature fusion module. Further modified branches for features fusion were added to better learn more deep features. Therefore, an amount of feature information including semantic information, location information and detailed information can be obtained simultaneously. As a result, small object detection can perform better.

3. Method

3.1 SPP-MSCAM

The attention mechanism can focus on the key regions of the input image by assigning certain weights to the channels and spaces. Based on these settings, the important spatial information and location information of different channels can be obtained. By collecting these fine-grained features, the attention mechanism can alleviate the issues of aggregation and occlusion for small objects in complex scenes. Meanwhile, it is also helpful for improving detection accuracy. Motivated by the spatial pooling pyramid and the multi-scale channel attention [43], the SPP-MSCAM is proposed, as shown in Fig. 2. Benefitting from the advantages of a larger perceptual field, rich spatial information and multiple-scale channel information in SPP-MSCAM, not only the accuracy of small objects detection in complex scenes is enhanced, but also the aggregation and occlusion of small objects in complex scenes can be also resolved effectively.

Figure 2.

The detailed structure of SPP-MSCAM. H, W and C represent the height, the width and the number of channels for the feature maps. The scale of input feature $X$ is H $\times$ W $\times$ C. $X_{s}$ is the output of the SPP module. $X_{g}$ and $X_{l}$ are the output of the local and global channel attention in the MSCAM module.

As shown in Fig. 2, the SPP module is indicated by the dash box. Firstly, the input image is operated by a 1 $\times$ 1 convolutional layer and then the number of channels can be reduced to 1/4 of the input image. Sequentially, the output of the 1 $\times$ 1 convolutional layer enters into three maximum pooling units. The corresponding pooling windows of these maximum pooling units are 5 $\times$ 5, 9 $\times$ 9 and 13 $\times$ 13, respectively. After that, the three outputs of maximum pooling and their input are concatenated together with the same channel dimension. The output $X_{s}$ includes adequate spatial feature information in multiple dimensions. Meanwhile, its scale returns to the same size as the input image. Therefore, the SPP can broaden the perceptual field and enrich the representation of the feature map.

In Fig. 2, the MSCAM module is composed of two parallel channel attention units, as indicated by the dash box. On the left of the dash box is the global channel attention module, which employs one-dimensional convolution to realize a local cross-channel interaction without dimensionality reduction. Moreover, the kernel size of one-dimensional convolution could be chosen adaptively. The formula of convolution kernel adaptive function [26] is followed as below:

$\displaystyle k=\psi(C)=\left|{\frac{\log_{2}(C)}{\gamma}+\frac{b}{\gamma}}% \right|_{\textit{odd}}$ (1)

Where $k$ represents the size of convolution kernel, $C$ represents the number of channels, $|\,|_{\textit{odd}}$ represents that $k$ is only odd numbers. Moreover, $\gamma$ and $b$ can be used to change the ratio between $C$ and $k$ which are set to 2 and 1 respectively in the paper. The formulation of the global channel attention module can be described as follows:

$\displaystyle{y}=\frac{1}{W\times H}\sum\limits_{i=1,j=1}^{W,H}{X_{ij}}$ (2)

Where $W$ and $H$ represent the width and the height for the feature map of SPP module, $i$ and $j$ represent the horizontal and vertical pixel points of the feature map, and $X_{ij}$ represents the pixel points in the i-th row and the j-th column of the feature map.

$\displaystyle X_{g}=\sigma(\textit{Conv1d}(y))$ (3)

Where Conv1d represents the one-dimensional convolution, $y$ represents the global average pooling, $\sigma$ represents the sigmoid activation function, and $X_{g}$ represents the output of the global attention. On the right of the dash box is the local channel attention module, in which there are two convolutional operations with a 1 $\times$ 1 kernel and rectified linear unit (ReLU) activation function. These two 1 $\times$ 1 convolutional operations are used to estimate inter-channel attention. The channels of the input features are firstly contracted and then dilated with a contraction factor of r. Moreover, the contracted feature scale is H $\times$ W $\times$ C/r and the dilated feature has the same scale with respect to its input feature. After passing through two-channel attention with different scales, their corresponding output features $X_{l}$ and $X_{g}$ can be added up together. By using the sigmoid function, the final attention weights can be calculated. Finally, the input image feature $X$ is multiplied with the final weights to produce finer features.

3.2 Network enhancement based on feature fusion

As we know, the shallow layers contain a large number of small objects features and location information while the deep layers contain more semantic features. The fusion of shallow features and deep features can make the deep layers obtain richer location information and semantic features, which helps improve the accuracy of small object detection. To obtain more semantic features and location information of small objects, we design another enhanced sub-network, named the efficient small object detection (ESOD), as indicated by the dash box in Fig. 3. As the input of the ESOD, P2 is the output of the CSP1_1 with less down-sample operation in the backbone network. It possesses a relatively higher resolution of 160 $\times$ 160, and thus more location information and features of small objects can be preserved. For this enhanced network, it has the same CSP2_1 layer, CBS layer, up-sample layer and concatenates operation, compared with classical FPN and PAN units. To match the size of P2, the features resolution of 80 $\times$ 80 after the processing of the second up-sample in FPN should be increased to 160 $\times$ 160 by up-sample operation in this network. By fusing P2 and the features after the up-sample in this unit, richer location features and semantic information can be transferred into CSP2_1. To extract more feature information from the fused features and detect small objects in multiple scales, the fused features can be further down-sampled by using a convolutional operation with a kernel of 3 and a stride of 2. Meanwhile, the corresponding prediction branch can be established and it would be more sensitive to small objects.

Figure 3.

The architecture of enhanced network. P2 is the output of the CSP1_1. The dash boxes indicate the FPN, the enhanced network (ESOD) and the PAN, respectively.

4. Experiments and results

4.1 Dataset

The performance of the enhanced small object detection network is assessed by the VisDrone2019 dataset. The VisDrone2019 dataset is collected by the AISKYEYE team at the Lab of Machine Learning and Data Mining, Tianjin University, China. There are 10209 static images, consisting of 6471 images for training, 548 images for verifying, and 3190 images for testing. Ten kinds of items exist in the dataset, including pedestrian, people, bicycle, vehicle, van, truck, tricycle, awning-tricycle, bus and motor. In particular, the images in the dataset are more comprehensive and complex than the other classical datasets, which contain a large number of small objects. Moreover, the labels of these small objects are very strict and fractionalized. For example, people who are only walking or standing can be identified as pedestrian objects, while others (such as squatters or bikers) are identified as people objects. However, due to the problems of complex environments, different weather, light intensity and shooting angle, it is relatively difficult to detect these small objects accurately and rapidly from this dataset.

To further prove the effectiveness of the method proposed in this paper, the NWPU VHR-10 dataset [8, 9] is selected to evaluate the performance of the network because the NWPU VHR-10 dataset is also a publicly available 10-class geospatial object detection dataset with 650 high-resolution images. Ten typical categories of objects in NWPU VHR-10 dataset corresponds to airplane, ship, storage tank, baseball diamond, tennis court, basketball court, ground track field, harbor, bridge, and vehicle. In this paper, the training set and verification set in NWPU VHR-10 dataset are randomly allocated according to the typical ratio of 8:2, which include 520 training set images and 130 verification set images, respectively.

4.2 Evaluating metrics

The network is evaluated by these metrics: $\textit{Precision}(P)$ , $\textit{Recall}(R)$ , mAP, FPS and the number of parameters. The formulas of Precision and Recall are described as follows by Eqs (4) and (5):

$\displaystyle\textit{Precision}=\frac{TP}{TP+FP}$ (4) $\displaystyle\textit{Recall}=\frac{TP}{TP+FN}$ (5)

True positive (TP) represents the number of predictions of the positive sample with a correct prediction. False positive (FP) represents the number of predictions of the positive sample with a wrong prediction. False negative (FN) represents the number of predictions of a negative sample with a wrong prediction. Precision refers to the percentage of positive samples with a correct prediction among all the positive samples, while Recall refers to the percentage of positive samples with a correct prediction among all the positive samples the network predicted. Moreover, the area of the Precision-Recall curve is defined as the average precision (AP) for describing the performance of accuracy, which can be denoted as follows by Eqs (6) and (7):

$\displaystyle AP=\int_{0}^{1}{P(R)dR}$ (6) $\displaystyle\textit{mAP}=\frac{\sum\nolimits_{q=1}^{Q}{AP(q)}}{Q}$ (7)

where $Q$ is the total number of object categories and $AP(q)$ is the $A P$ of a single category. The mAP indicates the mean average precision of all categories. Moreover, the mAP50 indicates the mean average precision with an IOU of 0.5 while mAP indicates the mean average precision with 10 thresholds from 0.5 to 0.95 per step of 0.05 among all categories. In this study, based on the object categories in the dataset, $Q$ can be set to 10. In addition, FPS can be used to characterize real-time performance, which can be defined as the number of images processed per second by the network. Thus, the larger the numerical value of mAP is, the higher the accuracy of the detection network is. Moreover, the higher the numerical value of FPS is, the faster the detecting speed of the detection network is.

4.3 Implementation details

The experiments were performed under Ubuntu 18.04.6LTS, CUDA version 11.2 and NVIDIA GeForce RTX3080 GPU using the python-compiled version of PyTorch 1.8.0 deep learning framework. Meanwhile, the YOLOv5 was employed as the benchmark model for comparison with our improved method. To avoid overfitting, the initial learning rate and weight decay rate were adjusted to 0.01 and 0.0005, respectively. Additionally, due to the high resolution of every image in the VisDrone2019 dataset and NWPU VHR-10 dataset, the training batch size was adjusted to 4. After repeated experiments, the training epochs and the momentum on the VisDrone2019 dataset and NWPU VHR-10 dataset were set to 300 and 0.937. Moreover, other parameters were consistent with the parameter settings of YOLOv5.

4.4 Results and discussions

4.4.1 Comparative research on the mechanisms of attention

To verify the SPP-MSCAM module, a series of contrastive experiments with various attention mechanisms were performed on VisDrone2019 dataset, including ECA [26], MSCAM and SPP-MSCAM. The corresponding results are shown in Table 1. The networks labeled with bottleneck and backbone subscripts indicate the corresponding location of the attention mechanism. Moreover, the bolded number in each column indicates the best result in contrastive experiments.

Table 1
Comparison of the detecting networks with different attention mechanisms

Model	P/%	mAP/%	mAP50/%	Params	FPS
YOLOv5	49.6	22.0	38.8	46.649M	80
YOLOv5 $+$ ECA ${}_{\text{bottleneck}}$	53.4	22.0	39.0	42.899M	71
YOLOv5 $+$ ECA ${}_{\text{backbone}}$	51.1	22.2	39.0	46.649M	76
YOLOv5 $+$ MSCAM ${}_{\text{backbone}}$	48.8	22.3	39.1	46.693M	76
YOLOv5 $+$ SPP-MSCAM ${}_{\text{backbone}}$	50.6	22.5	39.4	47.040M	64

Compared with the results under YOLOv5, the Precision and mAP50 under YOLOv5 $+$ ECAbottleneck are increased up to 53.4% and 39.0%, while the number of calculation parameters and FPS are decreased down to 42.899M and 71. When ECA is added to the backbone network, the mAP can be increased up to 22.2% while the Precision can be increased up to 51.1%. In addition, the FPS is almost the same as the case under YOLOv5. From the comprehensive comparison, it is found that the ECA module in the backbone is more effective than the case in the bottleneck. Therefore, in the next contrastive experiments, the corresponding attention mechanisms are all considered in the backbone network.

When MSCAM is applied in the backbone network, the mAP and mAP50 can be increased up to 22.3% and 39.1%, respectively. The number of calculation parameters and FPS are almost maintained. It means that extracting the small object features from global and local scales can obtain more feature information than that only from a global scale. Importantly, these attention mechanisms only focus on the channel information to capture the features of small objects. Further considering the spatial information, MSCAM can be replaced by SPP-MSCAM in the backbone network. It is found that the Precision, mAP and mAP50 with SPP-MSCAM are further increased up to 50.6%, 22.5% and 39.4%, respectively, which are superior to the case of MSCAM. The amount of calculation parameters is higher than that of YOLOv5 because the pooling pyramid and global average pooling are all used in our SPP-MSCAM, resulting in the enhancement of calculation in the whole which would lead to the reduced FPS. Though the FPS is decreased by 16 compared with the case of YOLOv5, the comprehensive performance of the detection network with SPP-MSCAM is still improved.

4.4.2 Contrastive experiments on network enhancement

The introduction of the attention module SPP-MSCAM enables the network to extract richer spatial and channel features. However, due to the continuously down-sample operations, these richer features and high-resolution information generated by shallow layers would be gradually lost. Moreover, once the high-resolution information is lost, some important features cannot be underutilized anymore. This may be the reason that only using an attention module such as SPP-MSCAM cannot improve the detecting performance dramatically.

To address the issue, one way is to capture the richer channel and spatial features with high-resolution from shallower layers and then fuse these features with more semantics features from deeper layers. Thus, by up-sampling the features including more semantics information from the deeper layer and matching the size of the features including more channels information with high-resolution from the shallower layer, the features from the deeper layer and shallower layer could be fused, and then the corresponding prediction branch can be established successfully. The corresponding structure can be called ESOD, as indicated in Fig. 3.

To demonstrate the effectiveness of ESOD, comparative experiments are also performed on the VisDrone2019 dataset. The experimental results can be seen in Table 2. The YOLOv5 $+$ SPP-MSCAM represents the network with SPP-MSCAM that has been introduced. The models on the third and fifth rows represent two different enhanced networks based on YOLOv5 $+$ SPP-MSCAM. Different to ESOD, ESOD512 represents the channel number of every CSP2_1 unit in the neck network is 512. The fourth row of Table 2 is the detection result of YOLOv5 $+$ ESOD512. Moreover, the bolded number in Table 2 indicates the best result.

Table 2
Contrastive experimental results in different enhanced network

Model	P/%	mAP/%	mAP50/%	Params	FPS
YOLOv5 $+$ SPP-MSCAM	50.6	22.5	39.4	47.040M	64
YOLOv5 $+$ SPP-MSCAM $+$ ESOD	58.2	29.0	48.6	51.316M	50
YOLOv5 $+$ ESOD ${}_{512}$	59.1	28.7	48.5	48.861M	52
YOLOv5 $+$ SPP-MSCAM $+$ ESOD ${}_{512}$	60.6	30.9	51.2	49.252M	40

Compared with the YOLOv5 $+$ SPP-MSCAM, the Precision, mAP and mAP50 in the enhanced network based on YOLOv5 $+$ SPP-MSCAM $+$ ESOD are dramatically increased up to 58.2%, 29.0% and 48.6%, respectively. The amount of calculation parameters is increased to 51.316 M and the FPS is decreased down to 50. Though the application of ESOD in the detecting network somewhat increases the number of calculation parameters and decreases the FPS, from the viewpoint of the overall situation, it is still a considerable choice to enhance the whole performance for detecting small objects. To further improve the performance for detecting small objects, we consider adjusting the channel number of every CSP2_1 unit in the neck network. After structural optimization and testing, the ESOD512 is selected to verify the detecting network. The results show that, compared with YOLOv5, the Precision, mAP and mAP50 based on YOLOv5 $+$ ESOD512 are further increased up to 59.1%, 28.7% and 48.5%, respectively. It clearly implies that the enhanced network is very helpful for improving the performance of the network. Considering the individual effectiveness of ESOD512 and SPP-MSCAM on the performance of the network, we finally combined SPP-MSCAM with ESOD512 to enhance the network performance. The results show that the Precision, mAP and mAP50 based on YOLOv5 $+$ SPP-MSCAM $+$ ESOD512 are further increased up to 60.6%, 30.9% and 51.2%, respectively. Instead, the amount of calculation parameters decreases by 2.064 M. It implies that the suitable parameters for an enhanced network may be also helpful for further improving the detecting performance. In particular, in comparison to the network with SPP-MSCAM, the Precision, mAP and mAP50 of the detecting network with ESOD512 can be increased by 10.0%, 8.4% and 11.8%, respectively. It obviously indicates that YOLOv5 $+$ SPP-MSCAM $+$ ESOD512 exhibits best performance for detecting small objects. This result agrees well with our analysis and prediction that have been proposed.

For the detection performance of small objects, mAP50 and mAP are the two most important target aims. To comprehensively compare these two properties, the mAP50 and the mAP under different models are shown in Fig. 4. Similar to the description of different models only with attention mechanisms, the network with SPP-MSCAM somewhat outperformers other networks. When ESOD and ESOD512 are applied, the network performance can be improved apparently. In a word, the SPP-MSCAM and the ESOD512 are very helpful for detecting small objects in complex scenes.

Figure 4.

Quantitative results of mAP50 and mAP curves under different models.

4.4.3 Ablation experiments on the NWPU VHR-10 dataset

In addition, a series of ablation experiments were carried out on the NWPU VHR-10 dataset to demonstrate the effectiveness of the proposed method. As shown in Table 3, compared with YOLOv5, mAP of YOLOv5 $+$ CBAM is improved from 56.3% to 57.6% while mAP of YOLOv5 $+$ SPP-MSCAM is improved from 56.3% to 57.4%. Apparently, the difference of mAP between YOLOv5 $+$ CBAM and YOLOv5 $+$ SPP-MSCAM is very small. However, the mAP50 of YOLOv5 $+$ SPP-MSCAM is obviously higher than that of YOLOv5 $+$ CBAM. Moreover, the FPS of YOLOv5 $+$ SPP-MSCAM is also better than that of YOLOv5 $+$ CBAM. After enhancing the network with ESOD ${}_{512}$ , the performance of YOLOv5 $+$ ESOD ${}_{512}$ is better than that of YOLOv5. Further combing the ESOD ${}_{512}$ with SPP-MSCAM models, the best performance with higher mAP, mAP50 and good real-time can be achieved in YOLOv5 $+$ SPP-MSCAM $+$ ESOD ${}_{512}$ . The bolded number indicates the best result.

Table 3
The ablation experiments result in different models on the NWPU VHR-10 dataset

Model	mAP/%	mAP50/%	FPS
YOLOv5	56.3	89.0	65
YOLOv5 $+$ CBAM	57.6	89.7	49
YOLOv5 $+$ SPP-MSCAM	57.4	91.5	61
YOLOv5 $+$ ESOD ${}_{512}$	57.1	89.6	43
YOLOv5 $+$ SPP-MSCAM $+$ ESOD ${}_{512}$	58.0	91.6	46

4.4.4 Comparative experiments with different algorithms

To verify good performance of the improved network for detecting small objects, some typical algorithms are tested on VisDrone2019 dataset. The results can be shown in Table 4 and the bolded number indicates the best results. Resnet50 [16] are used in Faster R-CNN [30] and Retina-Net [36] as backbone network while VGG16 [18] is selected as the backbone network in SSD300 [38]. Apparently, the detection accuracy of the method proposed in this paper is better than other algorithms while the FPS is relatively well. Undoubtedly, the model and its corresponding algorithm proposed in this paper can effectively improve the detection accuracy of small objects and preserve a good real-time performance simultaneously.

Table 4
Comparison of the mAP50, mAP and FPS under different models on the VisDrone2019 dataset

Model	Backbone	mAP50/%	mAP/%	FPS
Faster R-CNN [30]	Resnet50	16.9	8.1	22
YOLOXl [44]	CSPDarknet53	35.7	20.0	50
SSD300 [38]	VGG16	15.0	8.0	49
Retina-Net [36]	Resnet50	12.2	7.8	46
Ours	CSPDarknet53	51.2	30.9	40

To further demonstrate the effectiveness of the method proposed in this paper on different datasets, the NWPU VHR-10 dataset is selected as a candidate. The experimental results can be seen in Table 5 and the bolded number indicates the best results. It clearly indicates that the method proposed in this paper can both exhibit good performance on different datasets.

Table 5

Comparison of the mAP50, mAP and FPS under different models on the NWPU VHR-10 dataset

Model	Backbone	mAP50/%	mAP/%	FPS
Faster R-CNN [30]	Resnet50	84.9	40.8	26
YOLOXl [44]	CSPDarknet53	86.1	48.9	48
SSD300 [38]	VGG16	90.9	54.4	58
Retina-Net [36]	Resnet50	83.9	51.5	54
Ours	CSPDarknet53	91.6	58.0	46

4.4.5 Visualization of small object detection

To further compare the proposed way in this paper and the YOLOv5 network for detecting small objects in complex scenes, three images with many small objects including tiny objects, aggregated objects with low light and occlusive objects are selected to visually describe the difference. The visualization results under can be seen in Fig. 5a, c and e, respectively. In Fig. 5a, it is found that many tiny objects are undetected, which can be seen in the region denoted by the frame. Compared with the result of the YOLOv5 network in Fig. 5a, the YOLOv5 $+$ SPP-MSCAM $+$ ESOD512 network is more sensitive to tiny objects and then can obtain a better detecting performance. Apparently, many undetected tiny objects in the same region indicated in Fig. 5a are successfully captured, which can be seen in Fig. 5b. Besides this, some undetected small objects outside of the frame in Fig. 5a are also detected. It is because of the multiple-attention mechanism and the features fusion with different scales, the YOLOv5 $+$ SPP-MSCAM $+$ ESOD512 network can easier capture more feature information of small objects. For the aggregated small objects with low light, some detailed features are difficult to obtain, resulting in some undetected small objects, which can be seen in the region indicated by two frames in Fig. 5c. In contrast, most undetected small objects in these two regions are captured, which can be seen in Fig. 5d. Moreover, in the case of occlusive objects in a complex scene, as shown in Fig. 5e, it is difficult for the YOLOv5 network to detect the occlusive objects. Clearly, some small objects are undetected due to the occlusion by the tree or other small objects, which can be seen in the two regions denoted by frames. Through the larger receptive field from the attention mechanism, the improved network can obtain more spatial information. In addition, the higher resolution information from the shallower layer is also helpful for distinguishing the occlusive objects. Thus, the improved network can effectively enhance the detecting performance. As shown in Fig. 5f, most undetected occlusive objects in the two regions indicated by frames in Fig. 5e have been successfully detected.

Figure 5.

The visualization results of small object detection on the VisDrone2019 dataset. (a), (c) and (e) represent the detection results of YOLOv5 network, respectively. (b), (d) and (f) represent the detection results of YOLOv5 $+$ SPP-MSCAM $+$ ESOD512 network, respectively. The frames represent the comparison regions.

5. Conclusion

In summary, we propose an enhanced network for detecting small objects in complex scenes in this paper. Due to the aggregation and occlusion of small objects in complex scenes, we design the SPP-MSCAM module to obtain a larger receptive field, rich spatial information and multiscale channel information for effectively capturing more features of small objects. More importantly, the feature information can be adequately fused by ESOD512 with higher resolution from the shallower layer. The experimental results on the VisDrone2019 dataset and NWPU VHR-10 dataset demonstrate the effectiveness of the improved network. Moreover, compared with other classical algorithms, the enhanced network achieves a considerable performance for detecting small objects, which would be helpful for potential applications in various fields.

Footnotes

Acknowledgments

Jiangxi Provincial Natural Science Foundation (Grant No. 20224ACB201010). This work was supported by the Doctoral Start-up Fund of Jiangxi Normal University (No.9569).

References

Bochkovskiy

Wang

C.Y.

and Liao

H.Y.

, Yolov4: Optimal speed and accuracy of object detection, arXiv preprint arXiv:2004.10934, 2020.

Benjumea

Teeti

Cuzzolin

and Bradley

, YOLO-Z: Improving small object detection in YOLOv5 for autonomous vehicles, arXiv preprint arXiv:2112.11798, 2021.

Rekavandi

A.M.

Boussaid

Seghouane

A.K.

Hoefs

and Bennamoun

, A Guide to Image and Video based Small Object Detection using Deep Learning: Case Study of Maritime Surveillance, arXiv preprint arXiv:2207.12926, 2022.

Kim

and Jeong

, Real-Time ISR-YOLOv4 Based Small Object Detection for Safe Shop Floor in Smart Factories, Electronics 11 (2022), 1–18. doi: 10.3390/electronics11152348.

Tian

C.W.

Z.Y.

Zuo

W.M.

Fei

L.K.

and Liu

, Attention-guided CNN for image denoising, Neural Networks 124 (2020), 117–129. doi: 10.1016/j.neunet.2019.12.024.

C.Y.

Liu

Ranga

Tyagi

and Berg

A.C.

, Dssd: Deconvolutional single shot detector, arXiv preprint arXiv:1701.06659, 2017.

Chen

C.Y.

Liu

M.Y.

Tuzel

and Xiao

J.X.

, R-CNN for small object detection, in: Asian Conference on Computer Vision, Springer, Cham, 2016, pp. 214–230.

Cheng

and Han

, A survey on object detection in optical remote sensing images, ISPRS Journal of Photogrammetry and Remote Sensing 117 (2016), 11–28. doi: 10.1016/j.isprsjprs.2016.03.014.

Cheng

Zhou

and Han

, Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images, IEEE Transactions on Geoscience and Remote Sensing 54 (2016), 7405–7415. doi: 10.1109/TGRS.2016.2601622.

10.

Shen

and Sun

, Squeeze-and-excitation networks, in: IEEE Conference on Computer Vision and Pattern Recognition, IEEE, New York, 2018, pp. 7132–7141.

11.

Park

Woo

Lee

J.Y.

and Kweon

I.S.

, Bam: Bottleneck attention module, arXiv preprint arXiv:1807.06514, 2018.

12.

Redmon

Divvala

Girshick

and Farhadi

, You only look once: Unified, real-time object detection, in: IEEE Conference on Computer Vision and Pattern Recognition, New York: IEEE, 2016, pp. 779–788.

13.

Redmon

and Farhadi

, YOLO9000: better, faster, stronger, in: IEEE Conference on Computer Vision and Pattern Recognition, IEEE, New York, 2017, pp. 7263–7271.

14.

Redmon

and Farhadi

, Yolov3: An incremental improvement, arXiv preprint arXiv:1804.02767, 2018.

15.

Xiao

J.Y.

Tian

C.W.

Han

P.Y.

You

and Zhang

S.C.

, A serial attention frame for multi-label waste bottle classification, Applied Sciences 12 (2022), 1742–1754. doi: 10.3390/app12031742.

16.

K.M.

Zhang

X.Y.

Ren

S.Q.

and Sun

, Deep residual learning for image recognition, in: IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.

17.

K.M.

Zhang

X.Y.

Ren

S.Q.

and Sun

, Spatial pyramid pooling in deep convolutional networks for visual recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (2015), 1904–1916. doi: 10.1109/TPAMI.2015.2389824.

18.

Simonyan

and Zisserman

, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, 2014.

19.

Guan

L.T.

and Zhao

J.Q.

, Scan: Semantic context aware network for accurate small object detection, International Journal of Computational Intelligence Systems 11 (2018), 951–961. doi: 10.2991/ijcis.11.1.72.

20.

Yang

Sun

Wergeles

and Shang

, A survey and performance evaluation of deep learning methods for small object detection, Expert Systems with Applications 172 (2021), 1–14. doi: 10.1016/j.eswa.2021.114602.

21.

Guo

M.H.

T.X.

Liu

J.J.

Liu

Z.N.

Jiang

P.T.

T.J.

Zhang

S.H.

Martin

R.R.

Cheng

M.M.

and Hu

S.M.

, Attention mechanisms in computer vision: A survey, Computational Visual Media 8 (2022), 331–368.

22.

Zheng

M.H.

J.Y.

Shen

Y.J.

Tian

C.W.

Fei

L.K.

Zong

and Liu

X.Y.

, Attention-based CNNs for image classification: a survey, In Journal of Physics: Conference Series 2171 (2022), 1–6. doi: 10.1088/1742-6596/2171/1/012068.

23.

Jaderberg

Simonyan

Zisserman

and kavukcuoglu

, Spatial transformer network, Advances in neural information processing systems 28, MIT press, Cambridge, 2015.

24.

Liu

M.J.

Wang

X.H.

Zhou

A.J.

X.Y.

Y.W.

and Piao

C.H.

, UAV-YOLO: Small object detection on unmanned aerial vehicle perspective, Sensors 20 (2020), 1–12. doi: 10.3390/s20082238.

25.

Oghaz

M.M.D.

Razaak

and Remagnino

, Enhanced single shot small object detector for aerial imagery using super-resolution, feature fusion and deconvolution, Sensors 22 (2022), 1–22. doi: 10.3390/s22124339.

26.

Wang

Q.L.

B.G.

Zhu

P.F.

P.H.

Zuo

W.M.

and Hu

Q.H.

, ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks, arXiv preprint arXiv:1910.03151, 2019.

27.

Girshick

Donahue

Darrell

and Malik

, Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, New York, 2014, pp. 580–587.

28.

Girshick

, Fast r-cnn, in: IEEE International Conference on Computer Vision, IEEE, New York, 2015, pp. 1440–1448.

29.

Liu

Qin

H.F.

Shi

J.P.

and Jia

J.Y.

, Path aggregation network for instance segmentation, in: IEEE Conference on Computer Vision and Pattern Recognition, IEEE, New York, 2018, pp. 8759–8768.

30.

Ren

S.Q.

K.M.

Girshick

and Sun

, Faster r-cnn: Towards real-time object detection with region proposal networks, in: Advances in Neural Information Processing Systems 28, MIT press, Cambridge, 2015.

31.

Woo

Park

Lee

J.Y.

and Kweon

I.S.

, Cbam: Convolutional block attention module, in: European Conference on Computer Vision, Springer, Cham, 2018, pp. 3–19.

32.

Guo

S.Y.

L.L.

Guo

T.Y.

Cao

Y.Y.

and Li

Y.L.

, Research on Mask-Wearing Detection Algorithm Based on Improved YOLOv5, Sensors 22 (2022), 1–16. doi: 10.3390/s22134933.

33.

Kang

Y.Q.

and Zhou

, Recent advances in small object detection based on deep learning: A review, Image and Vision Computing 97 (2020), 1–14. doi: 10.1016/j.imavis.2020.103910.

34.

Lin

T.Y.

Maire

Belongie

Hays

Perona

Ramanan

Doll’ar

and Lawrence Zitnick

, Microsoft coco: Common objects in context, in: European Conference on Computer Vision, Springer, Cham, 2014, pp. 740–755.

35.

Lin

T.Y.

Dollár

Girshick

K.M.

Hariharan

and Belongie

, Feature pyramid networks for object detection, in: IEEE conference on computer vision and pattern recognition, IEEE, New York, 2017, pp. 2117–2125.

36.

Lin

T.Y.

Goyal

Girshick

K.M.

and Dollár

, Focal loss for dense object detection, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988.

37.

Zhang

T.Y.

Chai

Zhao

Z.Q.

and Tian

W.D.

, Improved YOLOv5 Network with Attention and Context for Small Object Detection, in: Intelligent Computing Methodologies: 18th International Conference, Springer, Cham, 2022, pp. 341–352.

38.

Liu

Anguelov

Erhan

Szegedy

Reed

C.Y.

and Berg

A.C.

, SSD: Single shot multibox detector, in: European Conference on Computer Vision, Springer, Cham, 2016, pp. 21–37.

39.

W.T.

Liu

L.L.

Long

Y.L.

Wang

X.W.

Wang

Z.H.

J.L.

and Chang

, Application of local fully Convolutional Neural Network combined with YOLO v5 algorithm in small target detection of remote sensing image, PloS One 16 (2021), 1–15. doi: 10.1371/journal.pone.0259283.

40.

Liang

Zhang

Zhuo

Y.Z.

and Tian

, Small object detection in unmanned aerial vehicle images using feature fusion and scaling-based single shot detector with spatial context analysis, IEEE Transactions on Circuits and Systems for Video Technology 30 (2020), 1758–1770. doi: 10.1109/TCSVT.2019.2905881.

41.

Wang

S.C.

Yan

Tang

S.J.

and Zhao

S.D.

, Identification and Analysis of Emergency Behavior of Cage-Reared Laying Ducks Based on YoloV5, Agriculture 12 (2022), 1–16. doi: 10.3390/agriculture12040485.

42.

Chen

Zhang

and Li

, YOLO-ACN: Focusing on small object and occluded object detection, IEEE Access 8 (2022), 227288–227303. doi: 10.1109/ACCESS.2020.3046515.

43.

Dai

Y.M.

Gieseke

Oehmcke

Y.Q.

and Barnard

, Attentional feature fusion, in: IEEE Winter Conference on Applications of Computer Vision, IEEE, New York, 2021, pp. 3560–3569.

44.

Liu

S.T.

Wang

Z.M.

and Sun

, Yolox: Exceeding yolo series in 2021, arXiv preprint arXiv:2107.08430, 2021.

Small object detection based on attention mechanism and enhanced network

Abstract

Keywords

1. Introduction

2. Related work

2.1 YOLOv5

2.2 Attention mechanism

3. Method

3.1 SPP-MSCAM

4.1 Dataset

4.2 Evaluating metrics

4.4 Results and discussions

4.4.1 Comparative research on the mechanisms of attention

Table 1 Comparison of the detecting networks with different attention mechanisms

Table 2 Contrastive experimental results in different enhanced network

Table 3 The ablation experiments result in different models on the NWPU VHR-10 dataset

Table 4 Comparison of the mAP50, mAP and FPS under different models on the VisDrone2019 dataset

Footnotes

Acknowledgments

References

Table 1
Comparison of the detecting networks with different attention mechanisms

Table 2
Contrastive experimental results in different enhanced network

Table 3
The ablation experiments result in different models on the NWPU VHR-10 dataset

Table 4
Comparison of the mAP50, mAP and FPS under different models on the VisDrone2019 dataset