Small object detection combining attention mechanism and a novel FPN

Abstract

Since small objects occupy less pixels in the image and are difficult to recognize. Small object detection has always been a research difficulty in the field of computer vision. Aiming at the problems of low sensitivity and poor detection performance of YOLOv3 for small objects. AFYOLO, which is more sensitive to small objects detection was proposed in this paper. Firstly, the DenseNet module is introduced into the low-level layers of backbone to enhance the transmission ability of objects information. At the same time, a new mechanism combining channel attention and spatial attention is introduced to improve the feature extraction ability of the backbone. Secondly, a new feature pyramid network (FPN) is proposed to better obtain the features of small objects. Finally, ablation studies on ImageNet classification task and MS-COCO object detection task verify the effectiveness of the proposed attention module and FPN. The results on Wider Face datasets show that the AP of the proposed method is 11.89%higher than that of YOLOv3 and 8.59%higher than that of YOLOv4. All of results show that AFYOLO has better ability for small object detection.

Keywords

Small object detection YOLOv3 DenseNet attention mechanism FPN

1 Introduction

Object detection is a basic problem in the field of computer vision. It is also a hot topic of theoretical research in recent years. It is widely used in many applications, such as face detection [1], mask detection [2], automatic driving [3], etc. The main task is to accurately locate and identify various objects in images or video. In recent years, with the rapid development of deep learning, convolutional neural networks (CNNs) have been applied more and more in computer vision, and achieved great success. Meanwhile, it has injected new vitality into object detection. Object detection based on deep learning is mainly divided into two categories. The first one is two-stage detector. The representative two-stage detector is R-CNN [4], proposed by Girshick et al., which firstly uses the CNNs to extract the region of interests (RoIs), and then performs classification and regression. Fast R-CNN [5] and Faster R-CNN [6] also belong to this pipeline. The accuracy of the two-stage detectors is relatively high, but the inference speed is slow. The other one is one-stage detector. One of the more influential one-stage detectors is YOLOv3 [7], an end-to-end detector. It can directly predict the locations and categories of the objects in the image or video. The one-stage detectors have fast inference speed and good real-time performance. But the accuracy is always slightly lower than two-stage detectors.

YOLO series methods, such asYOLOv2 [8] and YOLOv3, are known for their detection speed. Among them, the more influential YOLOv3 adopts a new backbone, namely DarkNet-53. It uses the idea of ResNet [9] for reference to improve the accuracy and efficiency of the detector. YOLOv3 uses fixed-size images as input, and uses regression to directly predict the locations and categories of the bounding boxes. Although YOLOv3 has a fast inference speed, its detection performance has not been completely explored, especially for small objects. The proposed method aims at solving the problem that YOLOv3 is not sensitive to small objects and is easy to miss-detection.

AFYOLO, a YOLOv3-based detector is proposed in this paper. Firstly, the parts of the ResNet in the Darknet53 are replaced by DenseNet [10]. And a new attention module was added into backbone. Then, in order to further enhance the features of small objects, a novel feature pyramid network (FPN) is introduced. Among the letters of AFYOLO, A stands for attention mechanism and FPN is F. Compared with YOLOv3, AFYOLO has achieved excellent performance on Wider Face [11] and MS COCO [12] benchmark. Before that, we used ImageNet [13] to verify the effectiveness of the attention module. Moreover, other baseline models are used on the MS COCO val set to evaluate and compare with ours. All of the baselines with our modules have achieved better results.

There are three contributions in this paper. (1) An attention module with better performance by effectively combining ECA channel attention and spatial attention have been proposed. It can be conveniently embedded into other CNNs (2) To solve the problem of poor detection performance of small objects, a Feature Enhancement Layer (FEL) is proposed and embedded in FPN. Better performance was achieved on the MS COCO val set with a small number of additional parameters. (3) Our proposed module (Attention module and FPN) is effective on ImageNet and MS COCO benchmarks. The improved YOLOv3-based detector, AFYOLO, is also proposed to achieve competitive results at Wider Face benchmark.

2 Related works

2.1 Small objects detection

In the field of object detection, large objects are usually easier to be detected because of their large area and rich features. Small object detection, such as UAV remote crowd detection [14] and satellite remote sensing image detection [15], are always difficult tasks. Small object detection has been a difficult and hot research topic in the computer vision field.

Small objects contain fewer pixels and carry less information, which often leads to miss detection and false detection. The detection accuracy of some methods for small objects is far lower than that for large objects [6 , 16]. In view of the above problems, many scholars have carried out a variety of studies. Rabbi J et al. [17] used an edge-enhanced super-resolution GAN [18] (EESRGAN) to improve the quality of remote sensing images and achieved good performance on satellite data sets. BYLA et al. [19] designed a multi-block SSD [16], which divides the original image into several patches, inputs them into SSD separately, and finally merges them. Compared with the traditional SSD, accuracy is improved by 9.2%. Zhang et al. [20] used deconvolution to recover the lost information in convolution, and achieved good performance on several data sets. Meng et al. [21] also input the image patches into CNNs, and achieved good performance on the relevant data sets by a data enhancement strategy. Wei et al. [22] proposed a CNN composed of a multi-scale object proposal network and a multi-scale object detection network. The mean average precision (mAP) on aviation and remote sensing data sets reaches 89.6%. Hu et al. [23] proposed a scale-insensitive CNN (SINET), which is a two-stage detector, to address the scale-sensitive problem of existing CNNs. It uses context-aware RoI pooling to generate fixed-size feature vectors for each proposal, and then inputs them to a multi-branch decision network for classification and regression. The method achieves state-of-the-art performance on KITTI [24]. Liu et al. [25] introduced MobileNetV2 [26] as the backbone of YOLOv3, and designed a new feature fusion method to improve the performance of small object detection.

In our work, to make better use of the features extracted from the backbone to improve the detection performance, a feature enhancement layer is added to the backbone to enhance the ability of feature extraction, and connected to FPN through residual connections.

2.2 Attention mechanism

Many recent works have proposed to use channel attention [27], spatial attention, or both [28] of them to improve the feature extraction performance of CNNs. By explicitly establishing the dependencies between channels or spatial information, the feature representation generated by the convolutional layer can be improved. The intuition behind the attention mechanism is to enable the network to learn where to focus, and further focus on what the importance is.

One of the most outstanding methods is Squeeze-and-Excitation Networks (SENet [27]). It first used an adaptive average pooling on the input features to generate the channel statistic Cs with the dimension of the channel number. Then, two fully connected layers are used in turn to generate channel attention weights. To reduce the number of parameters and the complexity of the model, the dimension of channel statistic Cs is reduced in the fully connected layer, and then restored to its original size.

Recently, Wang et al. [29] proposed ECA-Net, which was inspired by SENet. Since SENet uses channel dimension reduction to decrease model complexity, the correspondence relationship between channels and their weights is broken. In other words, SENet successively reduces and increases the dimensions of channels, so that the generated channel attention weights and channels are no longer a direct one-to-one correspondence. In ECA-Net, adaptive average pooling is used to generate channel attention weights just like SENet. The difference is that ECA-Net does not use channel dimension reduction, but adopt a simple one-dimensional convolution to obtain the channel attention weights. In this way, the parameters in ECA-Net are only the size of convolution kernel. On the other hand, because ECA-Net uses one dimension convolution, it also has the ability of cross-channel interaction. It effectively preserves the dependency between channels, rather than abandoning the relationship to calculate independently.

In our work, a cascade ECA module was proposed and used in series with spatial attention. CNNs can obtain stronger feature extraction capabilities with a small amount of memory by this method.

3 Proposed methods

3.1 Proposed attention module

To capture contextual information, we proposed a combination of a cascade ECA module and a spatial attention module. Cascade just means two process paths of ECA module which will capture more context feature. Inspired by SKNet [30], this path uses dilated group convolution to transform the input feature X, as shown in Fig. 1. In this way, less memory cost is introduced and the receptive field is expanded. Then it is fed to ECA module to extract the channel attention weights W_{C
₁}, and this process is mathematically expressed in Equation (1).

Fig. 1

Channel attention module.

$W_{C_{1}} = σ (1 D_{k = 3} (P_{avg} (C_{dg} (X))))$ (1)

Where, 1D_k=3 indicates the convolution with dimension of 1 and kernel size of 3. The purpose of dilated group convolution is mainly to increase the receptive field. There is no significant change in the information. Therefore, the convolution with different kernel sizes are adopted to prevent similar calculation results with the other path. The detailed ablations will be conducted for verification in section 4.3.3. P_avg is a global average pooling operator. C_dg means the dilated group convolution layer, the group numbers keep the same setting with SKNet, i.e. 32. σ is the Sigmoid function. The other process path is the standard ECA module, which is illustrated by:

$W_{C_{2}} = σ (1 D_{k = 5} (P_{avg} (X)))$ (2)

In this way, the channel attention feature map X_C can be calculated by:

$X_{C} = X \cdot (W_{C_{1}} + W_{C_{2}})$ (3)

After the channel attention, we designed a serial structure with spatial attention, which is shown in Fig. 2. Followed by channel attention, similar to CBAM, a max pooling and an average pooling for X_C are utilized to generate two spatial statistics. Then, a convolution layer is used to get the spatial attention weight W_S:

Fig. 2

Proposed attention module.

$W_{S} = σ (C_{7 \times 7}^{2 d} (f_{M} (X_{C}), f_{A} (X_{C})))$ (4)

Where f_M, f_A indicates max pooling and average pooling, respectively. Then a two-dimension convolution ( $C_{7 \times 7}^{2 d}$ ) with kernel size of 7×7 is utilized to generate the spatial attention weights W_S. And a channel-wise multiplication is also used to generate the final feature Y. In this way, the network can learn the channel weights as well as the spatial position weights, which further improves the ability of the network to focus on more important features.

The proposed attention module can be easily embedded into the classical CNNs. Fig. 3(a) shows a residual unit of Darknet-53, the backbone of YOLOv3, which contains 1×1 convolution and 3×3 convolution. Each layer is followed by Batch Normalization and ReLU activation. Fig. 3(b) is the architecture of residual unit with proposed attention module. The feature map output by convolution was successively sent to channel attention (CA) and spatial attention (SA), and ReLU activation followed them.

Fig. 3

(a) Original convolution block. (b) Convolution block with proposed attention module

3.2 Proposed feature pyramid networks

3.2.1 Feature Enhancement Layer

Recently, Chen et al. [31] proved that C5, which is the output feature of the last residual block in backbone, contains the strongest information, and the mAP of only using C5(SiMo, single input and multiple output, Fig. 4(b)) was only 0.9%lower than that of the traditional method(MiMo, multiple input and multiple output, Fig. 4(a)). Inspired by their work, we designed a feature enhancement layer(FEL) to better utilize the features extracted from the backbone. Moreover, we also use the outputs of other stages, namely {C2, C3, C4, C5}, to obtain better results. FEL consists of three parts, which are pyramid pooling module(PPM), adaptive spatial fusion module (ASF) and double-ECA module (DEM).

Fig. 4

MiMo(a) and SiMo(b) detection.

PPM. The structure of PPM is shown in Fig. 5. It can obtain features of different scales, which is similar to PSPNet [32]. But the difference lies in the use of different sizes of kernels for feature maps of different sizes, namely {C5 : 1,2,3,4}, {C4 : 1,2,3,5}, {C3 : 1,3,5,7}, {C2 : 1,5,7,9}. One of the factors considered is the size of the feature map. The feature map located in the high-level layers of the network has a smaller size. For example, C5 in YOLOv3 is 13×13 with the input size of 416×416.

Fig. 5

PPM, Pyramid pooling module.

On the other hand, using a larger kernel in a smaller feature map will result in a larger receptive field. This will further result in a large number of small objects after being mapped back to the original image, and reduce the performance of the detector. Secondly, after transforming in CNNs, the resolution of the feature map will become very small. Small objects, edges, textures, and other low-level features gradually become less or even disappear. Therefore, we use sub-pixel convolution [33] as up-sample operator to get four new feature maps:

${r_{1}, r_{2}, r_{3}, r_{4}} = C_{sp} (C_{1 \times 1} (P_{k_{1}, k_{2}, k_{3}, k_{4}} (X)))$ (5)

Where, X is the input tensor with the shape of (H,W,C). P indicates the pyramid pooling with kernels. C_1×1 and C_sp is 1×1 convolution and sub-pixel convolution, respectively. {r₁, r₂, r₃, r₄} is the set of output feature maps with the shape of (H,W,d). C is four times as much as d. In this way, the high-resolution information of the image can be obtained. At the same time, it can contain more low-level object information.

ASF. Considering the aliasing effect caused by interpolation, Guo et al. [34] proposed an adaptive spatial fusion module to fuse them instead of simply point-wise addition. Similarly, we adopt multi groups operation to reduce the parameters and memory. The 1×1 convolution is also used to squeeze and excite the number of channels. In our work, the group is set to 2, that is, the input features are split into two groups. As shown in Fig. 6, the ASF module first concatenate the outputs of PPM to $X \in ℝ^{H \times W \times C}$ :

Fig. 6

ASF, Diagram of adaptive spatial fusion module.

$X = Concat [r_{1}, r_{2}, r_{3}, r_{4}]$ (6)

Then X is split into two sub-tensors with the shape of (H, W, 1/2C). The 1×1 convolution is used to reduce the dimension to r (r < C) by a ratio of 0.5, and expand the features which have been refined by 3×3 convolution. Sigmoid function is used to normalize the output of convolutions. Accordingly, the feature maps can be computed as:

$G_{1, 2} = σ (C_{1} (C_{3} (C_{1} (X_{Split}))))$ (7)

Where, G_1,2 indicate the feature maps of two groups. X_Split is the sub-tensor of input X. C₁ and C₃ indicate the 1×1 and 3×3 convolution, respectively. σ is sigmoid function. Note that each convolution is followed by batch normalization and LeakyReLU activation. Finally, the shape will be changed by concatenating X, G₁ and G₂ to H×W×2C. It will be split and summed to obtain the final context feature, that is:

$ASF (X) = \sum_{h, w, c}^{H, W, C} split (Concat (G_{1}, G_{2}, X))$ (8)

Where, ASF (X) is the final output, and will be fed to the FPN through the residual connection. $\sum_{h, w, c}^{H, W, C}$ is the point-wise addition operator.

DEM. The feature map is fully processed and utilized in PPM and ASF, and the spatial context information of the feature map is enhanced. The DEM uses the channel attention to further extract rich information from channel dimension. Cascading two branches of ECA module (namely double-ECA), as shown in Fig. 7. It can obtain the channel weight W by:

Fig. 7

DEM, Double-ECA module.

$W = \sum_{i = 0}^{C} (σ (C_{k = 3}^{1 d} (P_{avg} (X))), σ (C_{k = 3}^{1 d} (P_{max} (X))))$ (9)

Where, P_max is the global max pooling operator. $C_{k = 3}^{1 d}$ indicates the one-dimensional convolution. The output of AFS can be compute by Y = W · X. In DEM, global max-pooling and global avg-pooling are used to obtain channel statistical information, which are more robust to the location and category of the object.

3.2.2 Multi-scale feature fusion

FPN [35] effectively combines the spatial location information, texture, and other features of the low-level layers with the semantic features of the high-level layers. PANet [36] adds a bottom-up path next to the original FPN to further improve the feature representation. Similar to PANet, a bottom-up path is added next to the original top-down path. The features from the backbone are up-sampled and pooled by the two paths, respectively. The new feature pyramid structure is shown in Fig. 8. In addition, our work also includes a residual connection from the backbone to the FPN, that is, the feature enhancement layer. It is fused with the feature map of FPN through element-wise addition, and then sent to the detection head.

Fig. 8

Overall architecture of AFYOLO.

4 Experiments and results

In this section, we first use the proposed attention module to perform image classification on ImageNet benchmark to verify its better classification performance than other modules such as ECA. Secondly, using the proposed attention module and FPN which are used to modify Faster R-CNN and other detectors to perform object detection task on MS COCO benchmark to prove the superiority of our method. Then in the next part, after conducting related ablation studies, use Wider Face val set to verify the detection performance of AFYOLO on small objects.

4.1 Image classification on ImageNet

We compare our modules with the SE, CBAM and ECA using CNNs, i.e., ResNet-50(R50), ResNet-101(R101) and MobileNetV2 (MV2) on ImageNet, with the same computing resource (Linux OS, 4 GeForce RTX 2080Ti GPUs). For training ResNet-50/-101 with our attention module, we adopt the same hyper-parameter Settings as [29] and directly copy the results. Specifically, the input images are randomly cropped and randomly flipped horizontally.

The stochastic gradient descent (SGD) is used as optimizer, with weight decay of 1e-4, momentum of 0.9 and batch size of 256 (32 images per GPU). Both of two models are trained 100 epochs with the initial learning rate of 0.1, which is decreased by a factor of 10 at 30, 60 and 90 epoch, respectively. For training MobileNet with our attention module, it is trained within 400 epochs using SGD optimizer with weight decay of 4e-5, momentum of 0.9 and batch size of 96 (12 images per GPU). Initial learning rate is 0.045, which adjusted by linear schedule with decay rate of 0.98.

The results given in Table 1 show that the proposed module has improved Top-1 accuracy by 2.39%and 1.93%than the original ResNet-50 and ResNet-101, respectively. Our method is also effective in lightweight CNN. The performance of MobileNetV2 combined with our attention module is better than that of SENet and ECA-Net. Compared with SENet and ECA-Net, our method also obtains competitive results. In summary, the image classification results on ImageNet verify the effectiveness of our proposed attention module.

Table 1
Image classification results of different methods on ImageNet

Methods Backbone Params (M) FLOPs (G) Top-1 (%) Top-5 (%)

R50 R50 24.37 3.83 75.20 92.52

SE R50 26.77 3.84 76.71 93.38

CBAM R50 26.77 3.84 77.34 93.69

ECA R50 24.37 3.83 77.48 93.68

Ours R50 27.12 4.29 77.59 93.69

R101 R101 42.49 7.30 76.83 93.48

SE R101 47.01 7.31 77.62 93.93

CBAM R101 47.01 7.31 78.49 94.31

ECA R101 42.49 7.30 78.65 94.34

Ours R101 47.67 8.23 78.76 94.36

MV2 MV2 3.34 300.0 71.64 90.20

SE MV2 3.40 301.1 72.42 90.67

ECA MV2 3.34 300.1 72.56 90.81

Ours MV2 3.59 333.0 72.55 90.82

Methods	Backbone	Params (M)	FLOPs (G)	Top-1 (%)	Top-5 (%)
R50	R50	24.37	3.83	75.20	92.52
SE	R50	26.77	3.84	76.71	93.38
CBAM	R50	26.77	3.84	77.34	93.69
ECA	R50	24.37	3.83	77.48	93.68
Ours	R50	27.12	4.29	77.59	93.69
R101	R101	42.49	7.30	76.83	93.48
SE	R101	47.01	7.31	77.62	93.93
CBAM	R101	47.01	7.31	78.49	94.31
ECA	R101	42.49	7.30	78.65	94.34
Ours	R101	47.67	8.23	78.76	94.36
MV2	MV2	3.34	300.0	71.64	90.20
SE	MV2	3.40	301.1	72.42	90.67
ECA	MV2	3.34	300.1	72.56	90.81
Ours	MV2	3.59	333.0	72.55	90.82

4.2 Object detection on MS COCO

In order to further evaluate our attention module and FPN on object detection task, we use Faster R-CNN, Mask R-CNN [37] and RetinaNet [38] as basic detectors to verify the effectiveness of our proposed modules. We compared the performance of our methods with ResNet, SENet and ECA-Net, respectively. And shows the detection result of each detector using the FPN proposed. All of backbones are pretrained on ImageNet and transferred to MS COCO through fine-tuning. Running at Linux with 4 GeForce RTX 2080Ti GPUs, all detectors are implemented by MMDetection toolkit [39]. For training, adopting 1333×800 as the size of input images. We use SGD as optimizer with weight decay of 1e-4, momentum of 0.9 and batch size of 8 (2 images per GPU). All detectors are trained for schedule 1x, i.e. 12 epochs. The learning rate is initialized to 0.01, which is decreased by a factor of 10 at 8 and 11 epoch, respectively. The evaluation results are shown in Table 2.

Table 2
Object detection results of different methods on MS COCO

Method Detector Params FLOPs AP AP₅₀ AP₇₅ AP_S AP_M AP_L

R50 Faster R-CNN 41.53M 207.04G 36.4 58.2 39.2 21.8 40.0 46.2

R50 + SE Faster R-CNN 44.02M 207.14G 37.7 60.1 40.9 22.9 41.9 48.2

R50 + ECA Faster R-CNN 41.53M 207.14G 38.0 60.6 40.9 23.4 42.1 48.0

R50 + A(ours) Faster R-CNN 44.32M 207.50G 38.2 60.8 41.0 23.9 42.3 48.1

R50 + AF(ours) Faster R-CNN 52.19M 211.56G 38.3 61.1 41.1 24.1 42.4 48.1

R101 Faster R-CNN 60.52M 283.11G 38.7 60.6 41.9 22.7 43.2 50.4

R101 + SE Faster R-CNN 65.24M 283.30G 39.6 62.0 43.1 23.7 44.0 51.4

R101 + ECA Faster R-CNN 60.52M 283.30G 40.3 62.9 44.0 24.5 44.7 51.3

R101 + A(ours) Faster R-CNN 65.79M 284.03G 40.5 63.4 44.4 24.8 45.0 51.4

R101 + AF(ours) Faster R-CNN 73.66M 288.09G 40.8 64.0 44.9 25.2 45.3 51.6

R50 Mask R-CNN 44.18M 275.58G 37.2 58.9 40.3 22.2 40.7 48.0

R50 + SE Mask R-CNN 46.67M 275.69G 38.7 60.9 42.1 23.4 42.7 50.0

R50 + ECA Mask R-CNN 44.18M 275.69G 39.0 61.3 42.1 24.2 42.8 49.9

R50 + A(ours) Mask R-CNN 47.98M 276.01G 39.2 61.7 42.3 24.4 42.8 50.1

R50 + AF(ours) Mask R-CNN 55.85M 280.07G 39.3 62.0 42.4 24.7 43.0 50.2

R50 RetinaNet 37.74M 239.32G 35.6 55.5 38.2 20.0 39.6 46.8

R50 + SE RetinaNet 40.23M 239.43G 37.1 57.2 39.9 21.2 40.7 49.3

R50 + ECA RetinaNet 37.74M 239.43G 37.3 57.7 39.6 21.9 41.3 48.9

R50 + A(ours) RetinaNet 40.49M 239.75G 37.6 58.1 40.0 22.2 41.6 49.2

R50 + AF(ours) RetinaNet 48.36M 243.81G 37.9 58.6 40.5 22.5 41.8 49.4

Method	Detector	Params	FLOPs	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
R50	Faster R-CNN	41.53M	207.04G	36.4	58.2	39.2	21.8	40.0	46.2
R50 + SE	Faster R-CNN	44.02M	207.14G	37.7	60.1	40.9	22.9	41.9	48.2
R50 + ECA	Faster R-CNN	41.53M	207.14G	38.0	60.6	40.9	23.4	42.1	48.0
R50 + A(ours)	Faster R-CNN	44.32M	207.50G	38.2	60.8	41.0	23.9	42.3	48.1
R50 + AF(ours)	Faster R-CNN	52.19M	211.56G	38.3	61.1	41.1	24.1	42.4	48.1
R101	Faster R-CNN	60.52M	283.11G	38.7	60.6	41.9	22.7	43.2	50.4
R101 + SE	Faster R-CNN	65.24M	283.30G	39.6	62.0	43.1	23.7	44.0	51.4
R101 + ECA	Faster R-CNN	60.52M	283.30G	40.3	62.9	44.0	24.5	44.7	51.3
R101 + A(ours)	Faster R-CNN	65.79M	284.03G	40.5	63.4	44.4	24.8	45.0	51.4
R101 + AF(ours)	Faster R-CNN	73.66M	288.09G	40.8	64.0	44.9	25.2	45.3	51.6
R50	Mask R-CNN	44.18M	275.58G	37.2	58.9	40.3	22.2	40.7	48.0
R50 + SE	Mask R-CNN	46.67M	275.69G	38.7	60.9	42.1	23.4	42.7	50.0
R50 + ECA	Mask R-CNN	44.18M	275.69G	39.0	61.3	42.1	24.2	42.8	49.9
R50 + A(ours)	Mask R-CNN	47.98M	276.01G	39.2	61.7	42.3	24.4	42.8	50.1
R50 + AF(ours)	Mask R-CNN	55.85M	280.07G	39.3	62.0	42.4	24.7	43.0	50.2
R50	RetinaNet	37.74M	239.32G	35.6	55.5	38.2	20.0	39.6	46.8
R50 + SE	RetinaNet	40.23M	239.43G	37.1	57.2	39.9	21.2	40.7	49.3
R50 + ECA	RetinaNet	37.74M	239.43G	37.3	57.7	39.6	21.9	41.3	48.9
R50 + A(ours)	RetinaNet	40.49M	239.75G	37.6	58.1	40.0	22.2	41.6	49.2
R50 + AF(ours)	RetinaNet	48.36M	243.81G	37.9	58.6	40.5	22.5	41.8	49.4

Faster R-CNN. ResNet-50 and ResNet-101 are used as backbones in Faster R-CNN, respectively. As shown in Table 2, After combining the SE module, ECA module, and our proposed module (AF) with the classic Faster R-CNN, all of the performance is improved. Obviously, our proposed module performs a little better than the other methods, which improved AP by 1.9%and 2.1%on ResNet-50 and ResNet-101, respectively.

Mask R-CNN. We further verify the performance of the proposed modules on MS COCO by using Mask R-CNN as the backbone. As shown in Table 2, our proposed module (AF) is superior to the ResNet by 2.1%in terms of AP under the settings of 50 layers. Meanwhile, attention module (A) achieves 0.2%gains over ECA block when using ResNet-50 as backbone.

RetinaNet. RetinaNet is also used to verify the performance of our proposed method. As shown in Table 2, our method outperforms the ResNet-50 by 2.3%in terms of AP. Meanwhile, our attention module (A) improves AP by 0.3%over ECA-Net for ResNet-50.

In general, the performance of our method is better than that of other methods mentioned above. This shows that our attention module and FPN are conducive to the improvement of object detection performance.

4.3 Ablation studies

In order to verify the performance of AFYOLO on small objects, Wider Face is used for training and testing. The training set contains a total of 12,877 images, belonging to 61 activity categories, such as parades, conferences, festivals, etc. It contains 159,424 faces, and 95,818 faces are seriously blurred, most of which are small objects. There are 1,845 exaggerated faces, 8,134 highly exposed faces, and 2,399 partially blurred faces. This data set has many dense small faces. We have selected several example pictures to show in Fig. 9.

Fig. 9

The sample images of Wider Face.

Before training AFYOLO, the number of training samples is expanded by using random horizontal mirror images and changing contrast to enhance the original image. The parameters of data augmentation are saturation = 1.5, and exposure = 1.5. In all the experiments, the detectors were trained for 100 epochs. The pretrained model is utilized to accelerate the convergence, and some parameters are frozen in the first 50 epochs, to protect the classification performance obtained by pretraining. In the last 50 epochs, all the parameters are trained together. In the training, the subdivision is set to 16, and batch-size is set to 8. SGD was used to optimize the network parameters with decay as 0.0005, momentum as 0.9, and the initial learning rate as 0.001. In the 50th epoch, the learning rate was reduced to 0.1 times of the original, i.e., 0.0001.

4.3.1 The proposed attention module

We use the ECA module and the proposed attention module on YOLOv3 for comparison. Table 3 shows the results. The same training hyper-parameters were set in two methods for a fair comparison.

Table 3
Comparison on Wider Face Val Set

Method Params (M) FLOPs (G) Inference (ms) AP₅₀ (%)

YOLOv3 + ECA 72.05 38.41 41 55.41

YOLOv3 + Ours 76.04 38.87 51 58.12

Method	Params (M)	FLOPs (G)	Inference (ms)	AP₅₀ (%)
YOLOv3 + ECA	72.05	38.41	41	55.41
YOLOv3 + Ours	76.04	38.87	51	58.12

The IoU threshold was set at 0.5 during the test. It can be seen from the results that the performance of the detector is obviously improved by adding the spatial attention module, at the cost of slightly reduced speed and a few more parameters.

4.3.2 The number of scales of detection head

Showing the difference between three-scale and four-scale, we conducted an ablation study. More specifically, we used YOLOv3 as the baseline to compare the performance of three-scale and four-scale. The results are shown in Table 4. Both methods were trained 50 epochs, and the confidence threshold was set at 0.3, and the IoU threshold was 0.5 during the test. It can be found that although the Recall of four-scale prediction is higher, AP and Precision of three-scale prediction is better. Besides, the inference speed is faster than that of four-scale prediction.

Table 4
Comparison of three-scale and four-scale prediction

Metrics three-scale four-scale

AP(%) 33.96 33.24

Recall(%) 29.04 33.56

Precision(%) 86.29 61.04

Inference Time(ms) 50 52

Metrics	three-scale	four-scale
AP(%)	33.96	33.24
Recall(%)	29.04	33.56
Precision(%)	86.29	61.04
Inference Time(ms)	50	52

4.3.3 Ablation methods in this paper

We use ECA Moudle in the backbone to propose a channel attention containing a cascade structure. The dilated group convolution is used for one of the branches. This part will prove its validity. As shown in Table 5, let Single indicate the use of a single original ECA Module. Cascade means that two original ECA Modules are directly cascaded with no other operation. Cascade^† represents the cascade with dilated convolution mapping proposed in this paper. All of models were trained 50 epochs on AFYOLO. The IoU threshold was 0.5 during the test.

Table 5
Ablations of proposed attention module

Model Method AP₅₀

Single 54.2

AFYOLO Cascade 54.4

Cascade^† 55.9

Model	Method	AP₅₀
	Single	54.2
AFYOLO	Cascade	54.4
	Cascade^†	55.9

Obviously, the gain of directly cascade is very low, only 0.2%. So it can be inferred that directly cascade does have feature overlap, resulting in redundant cascade. Our method obtained a gain of 1.7%by using dilated convolution mapping, and were able to improve the receptive field.

This paper proposes a new attention module, a new FPN structure, and other optimization tricks for small object detection. These methods are added to the baseline one by one for comparative study. Table 6 shows the results of them. For a fair comparison, the IoU threshold was set to 0.5 in all experiments.

Table 6

Results of proposed modules and method. √ means that the corresponding method is adopted. 104 is the larger scale. D represents whether the Res block is replaced with the Dense block; R is the connection mode from backbone to FPN, Iden and FEL represents identity mapping and employment of FEL, respectively. A represents our proposed attention module

104	D	R	A	Params (M)	FLOPs (G)	Infer (ms)	AP₅₀(%)
				61.52	32.76	30	45.54
√				69.00	34.71	34	51.90
√	√			71.28	37.22	35	53.70
√	√	Iden.		72.05	38.41	36	54.81
√	√	Iden.	√	76.04	38.87	51	58.12
√	√	FEL	√	83.91	42.93	55	58.86

It can be seen that original YOLOv3 (Row 2 of Table 6) has the lowest AP value with the least inference time. At the same time, it can also be noted that when the proposed methods were added to YOLOv3, the accuracy became higher. The AP value of AFYOLO (Last row of Table 6) is the highest, which proves the effectiveness of the proposed method. It is worth mentioning that although the number of parameters increases slightly, the performance has been greatly improved.

4.4 Tiny face detection on Wider Face

AFYOLO and YOLOv3 were used to evaluate the performance of tiny face detection on WiderFace, and the predicted results are shown in Fig. 10. The images on the left (a, c, e) are the predicted results of AFYOLO, and others belong to YOLOv3.

Fig. 10

Detection results of YOLOv3 and AFYOLO.

False. In the detection results of YOLOv3, many false results (Yellow padding) can be seen. In contrast, the corresponding number of AFYOLO was significantly reduced. The reduction of the number of false results is one of the important reasons for the improvement of accuracy.

Missing. From all of these images, it can be found that there are some missing objects (Green bounding box). Obviously, there are more missing objects in the results of YOLOv3, while the number of missing objects in AFYOLO is greatly reduced.

In order to more clearly observe the performance of the proposed method, especially for small objects, we visualized the Precision-Recall(P-R) curves of the proposed module used on baseline, as shown in Fig. 11. Among them, mostly small faces or occluded objects are difficult to detect and the corresponding precision is lower. However, it can be seen from Fig. 11 that the accuracy of AFYOLO on each task is improved, especially in the hard task. The AFYOLO achieves the highest precision, which is higher than classic YOLOv3. The results prove that AFYOLO is effective for detecting small and occluded objects.

Fig. 11

Visualization of P-R curves for various difficult tasks.

We also used YOLOv4 [40] for training and testing on Wider Face. Table 7 shows the results on the validation set. It can be seen that the method in this paper achieved the highest AP₅₀ of 58.86%, which is much better than that of YOLOv4 and YOLOv3.

Table 7

Comparison of YOLO series methods on Wider Face Val Set

Model	Params (M)	FLOPS (G)	FPS	AP₅₀
YOLOv3	61.52	32.76	33	45.54
YOLOv4	63.94	29.88	47	48.84
AFYOLO	83.91	42.93	18	58.86

5 Conclusion

Aiming at the problems of small object detection difficulty and low detection accuracy, this paper proposes an improved method, AFYOLO, based on YOLOv3. AFYOLO combines the classic YOLOv3 with our proposed attention module, a new FPN and other tricks for small object detection. Ablation experiments and object detection experiments were carried out on different datasets. The results show that the performance of ImageNet and MSCOCO benchmark can be improved by adding the proposed attention module or FPN to some popular networks. The AP of AFYOLO at small object detection task on Wider Face val set is better than that of YOLOv3 and YOLOv4. Experimental results verify the effectiveness of the proposed attention module and FPN, and the effectiveness of the proposed detection network in small object detection.

Footnotes

Acknowledgments

This work is supported by the National Nature Science Foundation of China (51209167), the Nature Science Foundation of Shaanxi Province (2019JM-474), the Science and technology project of Xi’an (2020KJRC0055), the Funding of Shaanxi Key Laboratory of Geotechnical and Underground Space Engineering (YT202004).

References

Liu

, Video Face Detection Based on Deep Learning, Wireless Personal Communications, (2018).

, Xia

, Jiang

, et al., 3D Face Mask Presentation Attack Detection Based on Intrinsic Image Analysis, IET Biometrics 9(3) (2020), 100–108.

Dai

, Hybridnet: A fast vehicle detection system for autonomous driving, , Signal Processing: Image Communication 70 (2019), 79–88.

Girshick

, Donahue

, Darrell

and Malik

, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587. doi: 10.1109/CVPR.2014.81.

Girshick

, Fast R-CNN, 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1440–1448. doi: 10.1109/ICCV.2015.169.

Ren

, He

, Girshick

and Sun

, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, in, IEEE Transactions on Pattern Analysis and Machine Intelligence 39(6) (2017), 1137–1149. doi: 10.1109/TPAMI.2016.2577031.

Redmon

and Farhadi

, YOLOv3: 542 An Incremental Improvement, arXiv preprint arXiv:1804.02767, 2018.

Redmon

and Farhadi

, YOLO9000: Better, Faster, Stronger, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6517–6525. doi: 10.1109/CVPR.2017.690.

, Zhang

, Ren

and Sun

, Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778. doi: 10.1109/CVPR.2016.90.

10.

Huang

, Liu

, Van Der Maaten

and Weinberger

K.Q.

, Densely Connected Convolutional Networks, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2261–2269. doi: 10.1109/CVPR.2017.243.

11.

Yang

, Luo

, Loy

C.C.

and Tang

, WIDER FACE: A Face Detection Benchmark, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5525–5533. doi: 10.1109/CVPR.2016.596.

12.

Lin

T.Y.

, et al., Microsoft COCO: Common Objects in Context. In: Fleet D., Pajdla T., Schiele B., Tuytelaars T. (eds) Computer Vision –ECCV 2014. ECCV 2014, Lecture Notes in Computer Science, vol 8693. (2014), Springer, Cham. https://doi.org/10.1007/978-3-319-10602-1_48.

13.

Deng

, Dong

, Socher

, Li

, Kai Li and Li Fei-Fei , ImageNet: A large-scale hierarchical image database, 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255. doi: 10.1109/CVPR.2009.5206848.

14.

Ren

, et al., Target Detection of Rural Buildings in UAV Remote Sensing Images Based on Convolutional Neural Network, Journal of Nanjing Normal University (Engineering and Technology Edition), (2019).

15.

Ding

, et al., A Comparison: Different DCNN Models for Intelligent Object Detection in Remote Sensing Images, Neural Processing Letters 49(3) (2019), 1369–1379.

16.

Liu

, SSD: Single Shot MultiBox Detector. In: Leibe B., Matas J., Sebe N., Welling M. (eds) Computer Vision –ECCV 2016. ECCV 2016, Lecture Notes in Computer Science, vol 9905, (2016), Springer, Cham. https://doi.org/10.1007/978-3-319-46448-0_2

17.

Rabbi

, Ray

, Schubert

, et al., Small-Object Detection in Remote Sensing Images with End-to-End Edge-Enhanced GAN and Object Detector Network, Remote Sensing 12(9) (2020), 1432.

18.

Goodfellow

I.J.

, Pouget-Abadie

, Mirza

, et al., Generative Adversarial Networks, , Advances in Neural Information Processing Systems 3 (2014), 2672–2680.

19.

B Y L A, A H D, D H L C, et al., Multi-block SSD based on small object detection for UAV railway scene surveillance, Chinese Journal of Aeronautics 33(6) (2020), 1747–1755.

20.

Zhang

, Wu

, Peng

, et al., SODNet: Small Object Detection Using Deconvolutional Neural Network, IET Image Processing, 2020.

21.

Meng

, Song

, Li

, et al., A Block Object Detection Method Based on Feature Fusion Networks for Autonomous Vehicles, , Complexity 2019 (2019), 1–14.

22.

Wei

, Wen

, Haijian

, et al., Geospatial Object Detection in High Resolution Satellite Images Based on Multi-Scale Convolutional Neural Network, Remote Sensing 10(1) (2018), 131.

23.

, et al., SINet: A Scale-Insensitive Convolutional Neural Network for Fast Vehicle Detection, in, IEEE Transactions on Intelligent Transportation Systems 20(3) (2019), 1010–1019. doi: 10.1109/TITS.2018.2838132.

24.

Geiger

, Lenz

and Urtasun

, Are we ready for autonomous driving? The KITTI vision benchmark suite, 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 3354–3361. doi: 10.1109/CVPR.2012.6248074.

25.

Liu

, Wang

and Bi

, Research on Multi-target and Small-scale Vehicle Target Detection Method, Control and Decision 36(11) (2021), 2707–2712.

26.

Sandler

, Howard

, Zhu

, Zhmoginov

and Chen

, MobileNetV2: Inverted Residuals and Linear Bottlenecks, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520. doi: 10.1109/CVPR.2018.00474.

27.

, Shen

, Albanie

, Sun

and Wu

, Squeeze-and-Excitation Networks, in, IEEE Transactions on Pattern Analysis and Machine Intelligence 42(8) (2020), 2011–2023. doi: 10.1109/TPAMI.2019.2913372.

28.

Woo

, Park

, Lee

J.Y.

and Kweon

I.S.

, CBAM: Convolutional Block Attention Module. In: Ferrari V., Hebert M., Sminchisescu C., Weiss Y. (eds) Computer Vision – ECCV 2018, ECCV 2018, Lecture Notes in Computer Science, vol 11211, Springer, Cham. https://doi.org/10.1007/978-3-030-01234-2_1.

29.

Wang

, Wu

, Zhu

, et al., ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

30.

, Wang

, Hu

and Yang

, Selective Kernel Networks, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 510–519. doi: 10.1109/CVPR.2019.00060.

31.

Chen

, Wang

, Yang

, et al., You Only Look One-level Feature, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

32.

Zhao

, Shi

, Qi

, Wang

and Jia

, Pyramid Scene Parsing Network, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6230–6239. doi: 10.1109/CVPR.2017.660.

33.

Shi

, et al., Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1874–1883. doi: 10.1109/CVPR.2016.207.

34.

Guo

, Fan

, Zhang

, Xiang

and Pan

, AugFPN: Improving Multi-Scale Feature Learning for Object Detection, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 12592–12601. doi: 10.1109/CVPR42600.2020.01261.

35.

Lin

, Dollár

, Girshick

, He

, Hariharan

and Belongie

, Feature Pyramid Networks for Object Detection, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 936–944. doi: 10.1109/CVPR.2017.106.

36.

Liu

, Qi

, Qin

, Shi

and Jia

, Path Aggregation Network for Instance Segmentation, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 8759–8768. doi: 10.1109/CVPR.2018.00913.

37.

, Gkioxari

, Dollár

and Girshick

, Mask R-CNN, 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2980–2988. doi: 10.1109/ICCV.2017.322.

38.

Lin

, Goyal

, Girshick

, He

and Dollár

, Focal Loss for Dense Object Detection, 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2999–3007. doi: 10.1109/ICCV.2017.324.

39.

Chen

, Wang

, Pang

, et al., MMDetection: Open MMLab Detection Toolbox and Benchmark, arXiv preprint arXiv:1906.07155, 2019.

40.

Bochkovskiy

, Wang

and Liao

, YOLOv4: Optimal speed and accuracy of object detection, arXiv preprint arXiv: 2004.10934, 2020.

Small object detection combining attention mechanism and a novel FPN

Abstract

Keywords

1 Introduction

2 Related works

2.1 Small objects detection

2.2 Attention mechanism

3 Proposed methods

3.1 Proposed attention module

3.2.1 Feature Enhancement Layer

4.1 Image classification on ImageNet

Table 3 Comparison on Wider Face Val Set Method Params (M) FLOPs (G) Inference (ms) AP50 (%) YOLOv3 + ECA 72.05 38.41 41 55.41 YOLOv3 + Ours 76.04 38.87 51 58.12

Table 4 Comparison of three-scale and four-scale prediction Metrics three-scale four-scale AP(%) 33.96 33.24 Recall(%) 29.04 33.56 Precision(%) 86.29 61.04 Inference Time(ms) 50 52

Table 5 Ablations of proposed attention module Model Method AP50 Single 54.2 AFYOLO Cascade 54.4 Cascade† 55.9

Footnotes

Acknowledgments

References

Table 3
Comparison on Wider Face Val Set

Method Params (M) FLOPs (G) Inference (ms) AP₅₀ (%)

YOLOv3 + ECA 72.05 38.41 41 55.41

YOLOv3 + Ours 76.04 38.87 51 58.12

Table 4
Comparison of three-scale and four-scale prediction

Metrics three-scale four-scale

AP(%) 33.96 33.24

Recall(%) 29.04 33.56

Precision(%) 86.29 61.04

Inference Time(ms) 50 52

Table 5
Ablations of proposed attention module

Model Method AP₅₀

Single 54.2

AFYOLO Cascade 54.4

Cascade^† 55.9