FI-FPN: Feature-integration feature pyramid network for object detection

Abstract

The multi-layer feature pyramid structure, represented by FPN, is widely used in object detection. However, due to the aliasing effect brought by up-sampling, the current feature pyramid structure still has defects, such as loss of high-level feature information and weakening of low-level small object features. In this paper, we propose FI-FPN to solve these problems, which is mainly composed of a multi-receptive field fusion (MRF) module, contextual information filtering (CIF) module, and efficient semantic information fusion (ESF) module. Particularly, MRF stacks dilated convolutional layers and max-pooling layers to obtain receptive fields of different scales, reducing the information loss of high-level features; CIF introduces a channel attention mechanism, and the channel attention weights are reassigned; ESF introduces channel concatenation instead of element-wise operation for bottom-up feature fusion and alleviating aliasing effects, facilitating efficient information flow. Experiments show that under the ResNet50 backbone, our method improves the performance of Faster RCNN and RetinaNet by 3.5 and 4.6 mAP, respectively. Our method has competitive performance compared to other advanced methods.

Keywords

Feature pyramid network object detection feature fusion channel attention

1. Introduction

Object detection is a fundamental task in the field of computer vision and is widely applied in various domains, such as autonomous driving [5], face mask detection [35], unmanned aerial vehicle (UAV) scene analysis [37], and robot vision [34]. In recent years, with the rapid development of deep convolutional network model architectures for extracting image features [14] and advancing object recognition, detectors based on deep convolutional networks have brought higher accuracy to this essential task. This significant improvement also introduces new challenges, including the imbalance of feature levels [8] and inconsistent object sizes [36]. Feature Pyramid Network (FPN) [18] is a representative network structure proposed to address these issues.

FPN extracts multi-layer feature maps of different sizes and dimensions from different feature extraction stages of the backbone model designed for image classification and connects them horizontally into a pyramid network structure. FPN incorporates top-down connections between adjacent levels for incorporating rich semantic features at higher levels into the low-level feature maps to generate feature representations that combine high resolution and strong semantics. Although the FPN structure is simple and effective, there is still room for optimization and improvement. Some existing methods such as NAS-FPN [9], Aug-FPN [12], and Bi-FPN [32] can improve the detection accuracy of object detection methods using FPN structure to some extent at the expense of speed and training duration. However, there are still some problems: 1. Loss of the information in the highest-level feature maps. 2. Weak features of small objects in low-level feature maps. 3. Insufficient feature fusion at each level.

1. Loss of the information in the highest-level feature maps. In the FPN structure, the features of different stages in the backbone network are used, and the $1 \times 1$ convolutional layer is used to directly reduce the dimension to the same dimension for processing between the features of different stages. However, the feature map of the highest level often has a high dimension, but due to the large reduction in the number of channels (e.g., 2048 dimensions directly downscaled to 256 dimensions in ResNet50 [14]), the rich semantic features in the highest-level layer of the pyramid lose much information instead.

2. Weak features of small objects in low-level feature maps. In the FPN structure, the low-level feature map retains detailed information about the image, and the detection of low-level features is more advantageous for objects that occupy a small image area. High-level feature maps have rich semantic information, so for detecting objects with complex shapes and complex features, the high-level features are more effective. However, due to the direct fusion of high-level features into low-level features in the FPN structure, the high-level features with rich semantics occupy the main prominence in the shallow features used to detect small objects, weakening the small object features and thus hindering the detection of small objects [17].

3. Insufficient feature fusion at each level [21]: The feature fusion among each level is the key content of FPN structure improvement in recent years, such as PA-FPN [22], NAS-FPN [9], Bi-FPN [32], etc. proposed to densely connect the feature maps between various layers and then stack multiple network structures, although this way increases the final accuracy, but introduces a large number of parameters and slows down the inference speed of the network.

In response to the above problems, we designed a new feature pyramid network structure named FI-FPN in our work and proposed three modules to optimize the existing problems of the FPN structure. First, we propose the multi-receptive field fusion (MRF) module for addressing the information loss of the top-level feature map. Before the feature pyramid structure, the highest-level features are obtained by stacking dilated convolution and standard convolution to obtain a larger receptive field, covering objects of various sizes. The max-pooling layer is used simultaneously to extract features of different scales, and the features at each stage of the extraction process are concatenated for efficient use of the highest-level features. Second, we proposed the contextual information filtering (CIF) module, which uses two branches for different contextual information extraction and combines the channel attention mechanism [15] to enhance the feature extraction process. Finally, aiming at the problem of insufficient feature fusion of each layer in FPN and the problem of aliasing effect [11], we propose the efficient semantic information fusion module (ESF). This module introduces a new branch that fuses the features of each layer of the feature pyramid by progressive splicing and fusion from low-level to high-level, and maintains the detection performance of the object detector for small objects by subtracting high-level features from low-level features.

Fig. 1.

The visualization of our object detection method is compared with that of the same backbone. Green bounding boxes represent the results of FPN, while blue bounding boxes represent the results of FI-FPN.

We evaluated our model on the MS COCO dataset [20] by replacing the FPN structure in RetinaNet [19] and Faster R-CNN [29] with FI-FPN. Our method achieved an improvement of 4.6 points and 3.5 points in Average Precision (AP) on ResNet50 as the backbone, respectively, outperforming other advanced FPN-based detectors. As shown in Fig. 1, the improvements from our method can be visualized when using the same backbone network. It is evident that our method can locate the bounding box more accurately and demonstrates better performance for small objects.

2. Related work

2.1. CNN-based object detector

After R-CNN [10] introduced CNN into the field of object detection, it opened the first two-stage object detection algorithm based on deep learning, which greatly improved the effect of object detection tasks. The current mainstream CNN-based object detectors are generally divided into two-stage object detectors and one-stage object detectors; they identify and localize objects by learning scale-sensitive features [17].

In the two-stage detector, SPPNet [25] proposed the Spatial Pyramid Pooling (SPP) strategy to solve the repeated operation problem in R-CNN, which converts any input into a fixed-length output through the pooling operation, thereby avoiding Repeated computation of convolutional features. Faster RCNN proposed a Region Proposal Network (RPN) instead of pre-handled proposals to generate candidate regions and uses nine shapes of anchor boxes as initial predictions and then performs regression adjustment on the anchor boxes, which greatly improved the performance of the two-stage object detector, almost all the two-stage detectors after Faster RCNN are based on it. Based on Faster RCNN, R-FCN [6] removes each branch’s fully connected layer of independent computation and designs an architecture that shares computation on the entire image to reduce workload and improve detection accuracy. Mask RCNN [13] introduced RoIAlign instead of RoIPooling based on Faster RCNN to obtain a better localization effect and added a mask branch for instance segmentation.

On the other hand, the one-stage object detector uses a unified network to directly regress and analyze the location and category information of the object and output the results directly to identify and localize the object. YOLO [26] predicted the classification confidence and bounding box on a single feature map, divided the input image into $7 \times 7$ grids; each grid is responsible for predicting the object whose center point is in the grid, and regresses the position of the center point relative to the grid, the length and width of the object, and the category, enabling real-time object detection. SSD [23] proposed to use feature maps of multiple scales to extract bounding boxes of different sizes to detect objects of different sizes. RetinaNet [19] proposed that the extreme foreground-background class imbalance encountered in the detector training process was the main reason for the low accuracy of the one-stage object detector, and the focal loss function was introduced. RetinaNet also used the FPN structure, used the convolutional layer for downsampling, and added the P7 layer with a smaller feature map to speed up the network operation and greatly improve the accuracy of the first-stage detector, making the one-stage object detector not only faster but also comparable to the accuracy of the two-stage object detector. YOLOv3 [28] proposed to replace the softmax classifier with multiple logistic regression classifiers and also introduced the FPN architecture, upsample the high-level feature maps twice and fuse the results with the low-level features respectively, and predicting objects of different scales is performed by setting different anchor points on three feature layers. Such as YOLOv3 and YOLOv4 [1], in recent years, the one-stage object detection model has also widely used the feature pyramid structure.

2.2. Multi-level features

For object detection tasks, solving the key problem of the semantic difference between multi-scale features [16] is crucial. FCN [24] and U-Net [30] used skip connections to fuse information from lower layers. FPN used horizontally connected multi-level features and a top-down structure to enhance the detection of objects of various scales, greatly improving the detection effect. PANet [22] was the first to propose a bottom-up path augmentation FPN model, which used accurate low-level localization signals to enhance the entire feature hierarchy, thereby shortening the information path between low-level and top-level features. Long-range feature correspondences are captured in a multi-scale feature pyramid. NAS-FPN introduced Neural Architecture Search [38](NAS) into the field of object detection. Through NAS, the best FPN architecture was obtained, and multi-layer stacking was performed, which greatly improved the model’s accuracy.

Bi-FPN introduced learnable weights to learn the importance of different input features and increased cross-layer links. Compared with the previous feature pyramid structures, it further strengthens the feature fusion between layers using bidirectional connections. Aug-FPN analyzed the various information losses generated by FPN in the fusion process and proposed a series of solutions. Nevertheless, existing methods are still insufficient for detecting objects of different scales and addressing the problem of information loss in necks.

3. Methodology

Fig. 2.

Overall pipeline of FI-FPN based detector. Among them, (a) is the backbone network, such as ResNet50, etc., (b) and (c) are the two main components of FI-FPN, where (c) is the ESF module, (d) is the predict head of the object detector. We will introduce the three modules of MRF, CIF and ESF in detail in the following.

In this section, we detail the proposed pyramid network FI-FPN for optimizing channel information loss and addressing the design flaws existing in the FPN structure. The usage conditions of FI-FPN are fundamentally similar to those of FPN. FI-FPN is a feature fusion technique that enables hierarchical feature fusion across different scales. For backbone networks that extract feature maps at multiple stages and scales (a characteristic possessed by most mainstream and well-performing backbone networks), FI-FPN can be combined with them to improve the performance of computer vision tasks such as object detection.

Its overall framework is shown in Fig. 2(b) and (c). The method shown in the figure takes the common Faster RCNN framework as an example, taking an image as input and outputting multi-scale features {C2, C3, C4, C5} from various feature extraction stages in a CNN backbone (e.g., ResNet50), which correspond to the feature maps with strides {4, 8, 16, 32} pixels in feature hierarchy concerning the input image. The highest C5 layer extracts its rich semantic information through the MRF module. The remaining layers use the CIF module to reduce the dimension and introduce channel attention. They then use the up-sampling and down-sampling operations to generate the feature pyramid {M2, M3, M4, M5, M6}. Finally, the aliasing effect reduction and dimensional bottom-up concatenation are performed through the ESF module to achieve bottom-up feature fusion. FI-FPN consists of three main components, which we will describe in detail below.

3.1. Multi-receptive field fusion (MRF)

Fig. 3.

MRF module. Among them (a) shows the overall architecture of the module, (b) shows the details of the receptive field extraction(RFE) component, $f_{s}$ represents channels reduction, which is implemented using a convolutional layer with a convolution kernel size of 1.

In FPN, the network’s input is the output of different stages of the backbone network. For the general backbone network (VGG [31], ResNet, Darknet [27], Effcientnet), in the process of feature extraction, the length and width of the feature map gradually become smaller, and the number of channels gradually increases. Taking ResNet50 as an example, the number of output channels in each feature extraction stage is {256, 512, 1024, 2048}, and the high-level features are rich in semantic information after many calculations. In the previous work, a $1 \times 1$ convolutional layer is generally used to directly reduce the output of the backbone network to the dimension required by the FPN structure (generally 256). The information of high-level features often requires the use of a large number of channels retaining and directly performing a large proportion of channel reduction will lead to the loss of high-level information.

In the work of YOLOF [4], the author only uses the features of the highest level and uses Dilated Encoder and Uniform Matching to obtain information on objects of different scales and successfully achieves results similar to the method of Multiple-in-Multiple-out, which shows that high-level features still have the potential to be exploited.

Inspired by this, we designed the MRF for the problem of information loss in the FPN. As shown in Fig. 3, we use the highest-level layer to reduce the dimension to a feature map with a higher dimension than other layers (512 dimensions are used in this experiment) and then input the receptive field extraction (RFE) component, as shown in Fig. 3(b). MRF gradually expands the receptive field of feature maps to cover more scales of objects by stacking dilated convolution [3] and max-pooling layers. And then, the feature maps of different receptive fields’ output at different stages are concatenated to fully integrate the detection capabilities of different receptive fields for objects of different scales. Finally, two feature selection layers $f_{s}$ (i.e. a $1 \times 1$ Conv layer for efficiency) are introduced to downscale the dimension to the default channel dimension of FPN.

Based on the experience of dilated convolution parameter tuning [33], we designed dilated convolution sequences with stride {2, 5, 1, 2} (since the convolution layer with dilated rate 1 is equivalent to ordinary convolutional layer, we put {1, 2} two dilated convolutions in the same layer). Ablation experiments show that the MRF can more effectively retain high-level semantic information and improve the model’s accuracy.

3.2. Contextual information filtering (CIF)

In FPN, a $1 \times 1$ convolutional layer is used for channel reduction after the multi-scale features {C2, C3, C4, C5} of the backbone network to generate {P2, P3, P4, P5} layers. Our method proposes a two-branch channel selection network model to replace this process, as shown in Fig. 4. Firstly, two feature selection layers $f_{s}$ are used to reduce the dimension of the input feature map, and then channel attention is extracted from the residual branch. The Bottle Neck structure is utilized for feature extraction in branch B of the CIF module. The default number of channels of the feature pyramid is set to the constant parameter C (generally 256), and the reduction ratio is manually set to the constant parameter R. The $f_{a}$ process is implemented using global average pooling and three fully connected (FC) layers, where the number of neurons in the hidden layer is reduced to $C / R$ .

Fig. 4.

CIF module. Where $f_{s}$ represents channels reduction. $f_{a}$ is the channel attention extraction operation, which is implemented using a global average pooling layer.

Fig. 5.

ESF module. The red line is a branch added for top-down feature fusion and low-level aliasing reduction. Where U is the up-sampling operation and S is the corresponding element subtraction.

After extracting the rich semantic information in the P5 layer and reducing their loss through the MRF module, the feature fusion from higher layers to lower layers is also an important part of the FPN. In FPN, the high-level features are directly added to the low-level features, and then the up-sampling operation is performed before the feature layer $P^{i}$ to fuse the high-level rich semantic information to the low-level features. The fusion process formula can be expressed by Eqs (1)–(2). $\begin{array}{l} (1) & P_{up}^{i} = f_{up} (P^{i} + P_{up}^{i + 1}) \\ (2) & P^{i} = f_{c} (P_{up}^{i + 1} + P^{i}) \end{array}$

Where i is the index of the pyramid level, $P^{i}$ represents the output feature of the i-th layer, $f_{up}$ is the up-sampling operation, and $f_{c}$ is the convolution operation.

In our work, we use the method shown in Fig. 4 to extract features and then perform up-sampling operations, and the residual path can largely retain high-level information. The red line shows the flow path of high-level information in the model in Fig. 5. On the premise that the two branches obtain different semantic information, branch A retains more high-level raw information because only $1 \times 1$ convolutional layers are used, shortening the fusion path of the upper and lower layers. Furthermore, the channel attention of the feature map is extracted in branch A, which is used to channel filter for branch B feature extraction. The process can be expressed by Eqs (3)–(5). $\begin{array}{l} (3) & P_{A t t}^{i} = σ (f_{g p} (f_{c} (P_{u p}^{i + 1} + P^{i}))) \\ (4) & P^{i} = f_{s} (concat (f_{c} (P_{u p}^{i + 1} + P^{i}), (P_{u p}^{i + 1} + P^{i}) * P_{A t t}^{i})) \\ (5) & P_{u p}^{i} = f_{u p} (P^{i}) \end{array}$

Where σ is the sigmoid activation function, $f_{s}$ is a $1 \times 1$ convolutional layer to reduce the dimension, $f_{g p}$ is implemented using an average global pooling layer, and $concat$ is the concatenation operation of the channel dimension. In the CIF, the up-sampling process not only retains the high-level information but also combines the feature extraction process of this layer, the result of which will be used in the ESF for aliasing effect reduction.

3.3. Efficient semantic information fusion (ESF)

The high-level feature map of the pyramid structure is rich in semantic information, and the low-level feature map is rich in detailed information. The key point in the previous pyramid structure design is how to better integrate the two factors. We propose the ESF, whose structure is shown in Fig. 5. The features of the high-level feature map are fused to the low-level feature map using multiple up-sampling. However, since the up-sampling process uses summation for feature fusion, it will cause aliasing effects in the low-level feature maps. We use an aliasing effect reduction module for the lowest two feature layers to mitigate the impact of aliasing effects on model localization information. Taking the implementation of FI-FPN on Faster RCNN as an example, the output of the P2 layer subtracts the aliasing information extracted from {M3, M4, M5} layers after up-sampling, and the output of the P3 layer subtracts the aliasing information extracted from {M4, M5} layers after up-sampling, to weaken the aliasing effect.

Inspired by PANet [22], we concatenate the feature maps of each level from the bottom to the top of the channel dimension last, which achieves feature fusion from the lower layer to the upper layer and enhance the model’s ability to detect objects of different sizes.

4. Experiments

4.1. Dataset and evaluation metrics

We perform our model on the MS COCO dataset benchmark. It contains 115k training images (train2017) and 5k validation images (val2017), and 20k unpublished labeled images (test-dev) for testing, with a total of 80 categories of object annotations. We train all models on train2017 and report the results of ablation experiments on val2017. Test-dev’s evaluation server gives the final result. All reported results follow the standard COCO-style average precision (AP) metric.

4.2. Implementation details

All our experiments are implemented based on MMDetection [2]. We train and validate the model on an NVIDIA Tesla V100 (32GB) GPU. All experiments use the default configuration of MMDetection, and the short side of the input image is adjusted to 800 pixels. No additional image augmentation is used, and 8 images are trained each time. In Table 1, 1× represents training for 12 epochs and decreasing the learning rate tenfold after the 8th and 11th epochs; 2× represents training for 24 epochs and decreasing the learning rate tenfold after the 16th and 22nd epochs; 50e represents training for 50 epochs and decreasing the learning rate tenfold after the 30th and 40th epochs.

We use RetinaNet and Faster RCNN as baselines for one-stage and two-stage detectors, with initial learning rates set to 0.01 and 0.005, respectively, and stochastic gradient descent (SGD) optimizer. All other hyperparameters in this paper follow the MMDetection default parameters if not otherwise specified.

4.3. Main results

Table 1
Comparisons with state-of-the-art methods on COCO test-dev. Top: baseline for one-stage detectors and two-stage detectors; Middle: state-of-the-art FPN-based methods; Bottom: our implementations. ‘ $^{}$ ’ indicates that the method uses additional data augmentation

Method Backbone Size Schedule AP ${AP}_{50}$ ${AP}_{75}$ ${AP}_{S}$ ${AP}_{M}$ ${AP}_{L}$

RetinaNet MobileNetV2 800 1× 27.1 43.6 28.1 13.4 29.1 35.9

RetinaNet ResNet-50 800 1× 36.3 55.5 38.7 20.5 40.1 47.5

RetinaNet ResNet-101 800 1× 39.1 59.1 42.3 21.8 42.7 50.2

Faster RCNN ResNet-50 800 1× 37.4 58.2 40.4 21.2 40.8 48.1

Faster RCNN ResNet-101 800 1× 39.4 60.2 43.0 22.3 43.3 49.9

Faster RCNN ResNet-101 800 2× 39.9 60.3 43.2 23.0 43.8 52.9

Mask RCNN ResNet-50 800 1× 38.2 58.8 41.4 21.9 40.9 49.5

Mask RCNN ResNet-101 800 1× 40.4 61.2 44.1 23.1 43.5 50.8

RetinaNet w Aug-FPN ResNet-50 800 1× 37.5 58.4 40.1 21.3 40.5 47.3

RetinaNet w NAS-FPN ResNet-50 $640^{}$ 50e 40.5 58.4 43.2 20.0 46.6 56.3

Faster RCNN w Aug-FPN ResNet-50 800 1× 38.8 61.5 42.0 23.3 42.1 47.7

Faster RCNN w Aug-FPN ResNet-101 800 1× 40.6 63.2 44.0 24.0 44.1 51.0

Libra RetinaNet w Balanced FPN ResNet-50 800 1× 37.8 57.5 40.5 21.5 40.8 47.4

Libra Rcnn w Balanced FPN ResNet-50 800 1× 38.6 60.0 42.0 22.4 41.3 47.7

Libra Rcnn w Balanced FPN ResNet-101 800 1× 40.2 61.2 44.1 22.7 43.6 52.1

RetinaNet w FI-FPN (ours) MobileNetV2 800 1× 29.7 46.3 31.2 15.8 32.2 38.6

RetinaNet w FI-FPN (ours) ResNet-50 800 1× 40.9 60.5 43.8 24.3 44.2 49.3

RetinaNet w FI-FPN (ours) ResNet-101 800 1× 42.0 61.6 45.0 25.3 45.4 51.4

Faster RCNN w FI-FPN (ours) ResNet-50 800 1× 40.9 61.8 44.6 24.0 44.4 50.2

Faster RCNN w FI-FPN (ours) ResNet-101 800 1× 42.2 63.0 46.0 24.9 45.5 52.2

Method	Backbone	Size	Schedule	AP	${AP}_{50}$	${AP}_{75}$	${AP}_{S}$	${AP}_{M}$	${AP}_{L}$
RetinaNet	MobileNetV2	800	1×	27.1	43.6	28.1	13.4	29.1	35.9
RetinaNet	ResNet-50	800	1×	36.3	55.5	38.7	20.5	40.1	47.5
RetinaNet	ResNet-101	800	1×	39.1	59.1	42.3	21.8	42.7	50.2
Faster RCNN	ResNet-50	800	1×	37.4	58.2	40.4	21.2	40.8	48.1
Faster RCNN	ResNet-101	800	1×	39.4	60.2	43.0	22.3	43.3	49.9
Faster RCNN	ResNet-101	800	2×	39.9	60.3	43.2	23.0	43.8	52.9
Mask RCNN	ResNet-50	800	1×	38.2	58.8	41.4	21.9	40.9	49.5
Mask RCNN	ResNet-101	800	1×	40.4	61.2	44.1	23.1	43.5	50.8
RetinaNet w Aug-FPN	ResNet-50	800	1×	37.5	58.4	40.1	21.3	40.5	47.3
RetinaNet w NAS-FPN	ResNet-50	$640^{*}$	50e	40.5	58.4	43.2	20.0	46.6	56.3
Faster RCNN w Aug-FPN	ResNet-50	800	1×	38.8	61.5	42.0	23.3	42.1	47.7
Faster RCNN w Aug-FPN	ResNet-101	800	1×	40.6	63.2	44.0	24.0	44.1	51.0
Libra RetinaNet w Balanced FPN	ResNet-50	800	1×	37.8	57.5	40.5	21.5	40.8	47.4
Libra Rcnn w Balanced FPN	ResNet-50	800	1×	38.6	60.0	42.0	22.4	41.3	47.7
Libra Rcnn w Balanced FPN	ResNet-101	800	1×	40.2	61.2	44.1	22.7	43.6	52.1
RetinaNet w FI-FPN (ours)	MobileNetV2	800	1×	29.7	46.3	31.2	15.8	32.2	38.6
RetinaNet w FI-FPN (ours)	ResNet-50	800	1×	40.9	60.5	43.8	24.3	44.2	49.3
RetinaNet w FI-FPN (ours)	ResNet-101	800	1×	42.0	61.6	45.0	25.3	45.4	51.4
Faster RCNN w FI-FPN (ours)	ResNet-50	800	1×	40.9	61.8	44.6	24.0	44.4	50.2
Faster RCNN w FI-FPN (ours)	ResNet-101	800	1×	42.2	63.0	46.0	24.9	45.5	52.2

To verify the effectiveness of our work, we evaluate the most popular one-stage and two-stage detectors with FI-FPN on the COCO test-dev set and compare them with other state-of-the-art FPN-based object detection methods on the COCO test-dev in Table 1.

In the one-stage object detection model, by replacing the FPN of the RetinaNet model with our FI-FPN, an improvement of 4.6 AP and 2.9 AP was achieved with ResNet50 and ResNet101 as the backbone network, respectively, and 40.9 AP and 42.0 AP were achieved without using additional data augmentation and without changing the loss function. In the typical two-stage model Faster RCNN, when ResNet50 and ResNet101 are used as backbone networks, it increases by 3.5 AP and 2.8 AP, from 37.4 AP and 39.4 AP to 40.9 AP and 42.2 AP, respectively.

The results of recent years’ state-of-the-art FPN-based methods for the same experimental settings in MMDetection are also presented in Table 1. The experiments demonstrate the improvement of our method compared to the state-of-the-art FPN-based method in different models.

Fig. 6.

Comparison of PR curves. IOU refers to the intersection-over-union. The results are reported based on the COCO val2017 dataset.

Fig. 7.

Example comparison of qualitative results. Replace the neck part of the object detection model. When FPN is the neck, the detection result is represented by a green bounding box; when Fi-FPN is the neck, the detection result is represented by a blue bounding box. Both methods are implemented in RetinaNet with ResNet50 as the backbone, and the training settings and training duration are the same.

Figure 6 compares the precision-recall (PR) curve after replacing the FPN structure with FI-FPN. Experiments show that FI-FPN can improve the precision while improving the recall rate at different IOUs, which helps the detection model detect more positive objects.

We also visualize the comparison of our FI-FPN and FPN performance in Fig. 7. It can be seen that the FPN will miss or incorrectly locate some objects that are too small or occluded, which may be caused by insufficient feature fusion or insufficient acquisition of receptive field information. FI-FPN can effectively optimize these problems and produce satisfactory results. These images are output by RetinaNet with FPN and RetinaNet with FI-FPN based on ResNet50 backbone with 1× training duration, respectively, marked in green and blue, with a threshold of 0.5. The input images are selected from the COCO val2017 dataset.

4.4. Ablation study

4.4.1. Validating the effects of each module component

To quickly and effectively validate the effects of each component and verify the performance of our method on various datasets, we use the PASCAL VOC [7] dataset and ResNet50 backbone network when validating the components of each submodule. The PASCAL VOC challenge is a world-class image classification and object detection competition, containing a total of 16k images. We select the training and validation sets from PASCAL VOC 2007 and PASCAL VOC 2012 as the training sets for this experiment. Since the annotations for the PASCAL VOC 2012 test set have not been released, we use the PASCAL VOC 2007 test set for this experiment. Apart from using the PASCAL VOC dataset, the experimental settings remain unchanged.

To validate the optimal configuration of each components in the MRF module, we add different improvement strategies based on the original FPN structure, and the experimental results are shown in Table 2, Conv refers to using an additional convolutional branch, Dila Conv refers to adding dilated convolution in that branch, Channel concat and Channel sum represent two feature map fusion methods. Based on the ablation study results in Table 2, we ultimately chose the combination of dilated convolution and Channel concat, which can effectively improve the model’s detection accuracy.

CIF and ESF modules are typically used in combination. To validate the optimal configuration of each components in these modules, we add different improvement strategies based on the original FPN structure, and the experimental results are shown in Table 3.

In Table 3, channel attention refers to using an additional channel attention branch, sub path refers to the path that reduces aliasing effects through feature map up-sampling, down-up path refers to the bottom-up feature fusion path. Based on the ablation study results in Table 3, we ultimately chose the combination of channel attention mechanism, sub path, and down-up path, which can effectively improve the model’s detection accuracy.

Table 2
Ablation study results of different components in the MRF module on the VOC dataset

RFE Fusion mAP(%)

Conv Dila Conv Channel concat Channel sum

76.5

✓ 76.8

✓ ✓ 78.0

✓ ✓ ✓ 78.5

✓ ✓ ✓ 78.1

RFE	Fusion	mAP(%)
				76.5
		✓		76.8
✓		✓		78.0
✓	✓	✓		78.5
✓	✓		✓	78.1

Table 3

Ablation study results of different components in the CIF and ESF modules on the VOC dataset

Channel attention	Sub path	Down-up path	mAP(%)
			76.5
✓	✓		77.4
		✓	78.2
✓		✓	78.6
	✓	✓	78.6
✓	✓	✓	79.2

4.4.2. Validating the effects of each module in FI-FPN

Meanwhile, we also validate the impact of each module in FI-FPN on the COCO dataset. The ablation experiment is performed on the RetinaNet with ResNet50 as the backbone network, with input images reshaped to a resolution of $(640, 640)$ and a training duration of 1×. For comparison, all experiments use the same hyperparameters and training settings. We progressively add MRF, CIF, and ESF modules to the standard RetinaNet network. By experimenting with the improvements brought by the combination of CIF and ESF modules, we aim to demonstrate the effectiveness of each module. The overall ablation study is reported in Table 4.

Table 4
The effect of each component. From results reported at COCO val2017

MRF CIF ESF AP ${AP}_{50}$ ${AP}_{75}$ ${AP}_{S}$ ${AP}_{M}$ ${AP}_{L}$

35.6 53.8 37.9 16.2 40.5 50.8

✓ 37.1 55.3 39.5 17.0 42.3 53.2

✓ 36.3 54.4 38.9 17.4 40.8 50.6

✓ 36.5 54.2 39.0 16.5 41.4 53.0

✓ ✓ 37.3 54.7 39.4 18.0 42.5 52.5

✓ ✓ ✓ 38.0 55.8 40.5 17.7 43.6 53.3

MRF	CIF	ESF	AP	${AP}_{50}$	${AP}_{75}$	${AP}_{S}$	${AP}_{M}$	${AP}_{L}$
			35.6	53.8	37.9	16.2	40.5	50.8
✓			37.1	55.3	39.5	17.0	42.3	53.2
	✓		36.3	54.4	38.9	17.4	40.8	50.6
		✓	36.5	54.2	39.0	16.5	41.4	53.0
	✓	✓	37.3	54.7	39.4	18.0	42.5	52.5
✓	✓	✓	38.0	55.8	40.5	17.7	43.6	53.3

The experimental results show that the MRF module improves the mAP index by 1.5 AP based on FPN, which effectively improves the model’s ability to detect objects, which is in line with the original intention of the design. The combination of CIF and ESF modules can improve mAP by 1.7 AP, proving the advantages of combining the two modules. Finally, our method improves mAP by 2.4% overall.

4.5. Runtime analysis

We also measure the inference time of FI-FPN. The average inference time is calculated by sequentially inputting all images of COCO val-2017 into the model, and the size of the input images is unified as (800, 1333). We run the test on an NVIDIA Tesla V100 (32GB) GPU. RetinaNet with FPN can run at 16.5 fps; as a comparison, by replacing the neck part of RetinaNet with FI-FPN, the model can run at 15.3 fps. The 1.2 FPS reduction only decreases the inference speed, while the accuracy is increased by 4.6 mAP. The results demonstrate the inference speed advantage of our method.

5. Conclusion

In this paper, we proposed a new FI-FPN for object detection, which fixes the design flaws in FPN. To reduce the loss of high-level feature information, we proposed the MRF module, which uses dilated convolutions to obtain receptive fields at different scales. We also proposed the CIF module that uses a dual-branch and channel attention mechanism to redistribute channel attention weights to enhance features. Finally, we proposed the ESF module to optimize and fuse different levels of feature information. Experiments show that FI-FPN can improve the performance of mainstream object detection frameworks on the challenging MS COCO dataset.

References

Bochkovskiy,

C.Y.

Wang and

H.Y.M.

Liao, Yolov4: Optimal speed and accuracy of object detection, arXiv preprint, 2020. arXiv:2004.10934.

Chen,

Wang,

Pang et al., MMDetection: Open mmlab detection toolbox and benchmark, arXiv preprint, 2019. arXiv:1906.07155.

L.C.

Chen,

Papandreou,

Schroff et al., Rethinking atrous convolution for semantic image segmentation, arXiv preprint, 2017. arXiv:1706.05587.

Chen,

Wang,

Yang, et al. You only look one-level feature, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13039–13048.

Feng,

Haase-Schütz,

Rosenbaum et al., Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges, IEEE Transactions on Intelligent Transportation Systems 22(3) (2020), 1341–1360.

Dai,

Li,

He et al., R-fcn: Object detection via region-based fully convolutional networks, Advances in Neural Information Processing Systems (2016), 29.

Everingham,

Van Gool,

C.K.I.

Williams et al., The Pascal visual object classes (VOC) challenge, International Journal of Computer Vision 88 (2010), 303–338. doi:10.1007/s11263-009-0275-4.

Ge,

Jie,

Huang et al., Delving deep into the imbalance of positive proposals in two-stage object detection, Neurocomputing 425 (2021), 107–116. doi:10.1016/j.neucom.2020.10.098.

Ghiasi,

T.Y.

Lin and

Q.V.

Le, Nas-fpn: Learning scalable feature pyramid architecture for object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7036–7045.

10.

Girshick,

Donahue,

Darrell et al., Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587.

11.

Gong,

Yu,

Ding et al., Effective fusion factor in FPN for tiny object detection, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 1160–1168.

12.

Guo,

Fan,

Zhang et al., Augfpn: Improving multi-scale feature learning for object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12595–12604.

13.

He,

Gkioxari,

Dollár et al., Mask r-cnn, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2961–2969.

14.

He,

Zhang,

Ren et al., Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.

15.

Hu,

Shen and

Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141.

16.

Li,

Chen,

Wang et al., Scale-aware trident networks for object detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6054–6063.

17.

Li,

Pang,

Shen et al., NETNet: Neighbor erasing and transferring network for better single shot object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13349–13358.

18.

T.Y.

Lin,

Dollár,

Girshick et al., Feature pyramid networks for object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117–2125.

19.

T.Y.

Lin,

Goyal,

Girshick et al., Focal loss for dense object detection, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988.

20.

T.Y.

Lin,

Maire,

Belongie et al., Microsoft coco: Common objects in context, in: European Conference on Computer Vision, 2014, pp. 740–755.

21.

Liu,

Huang and

Wang, Learning spatial fusion for single-shot object detection, arXiv preprint, 2019. arXiv:1911.09516.

22.

Liu,

Qi,

Qin et al., Path aggregation network for instance segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8759–8768.

23.

Liu,

Anguelov,

Erhan et al., Ssd: Single shot multibox detector, in: European Conference on Computer Vision, 2016, pp. 21–37.

24.

Long,

Shelhamer and

Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.

25.

Purkait,

Zhao and

Zach, SPP-Net: Deep absolute pose regression with synthetic views, arXiv preprint, 2017. arXiv:1712.03452.

26.

Redmon,

Divvala,

Girshick et al., You only look once: Unified, real-time object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788.

27.

Redmon and

Farhadi, YOLO9000: Better, faster, stronger, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7263–7271.

28.

Redmon and

Farhadi, Yolov3: An incremental improvement, arXiv preprint, 2018. arXiv:1804.02767.

29.

Ren,

He,

Girshick et al., Faster r-cnn: Towards real-time object detection with region proposal networks, Advances in Neural Information Processing Systems (2015), 28.

30.

Ronneberger,

Fischer and

Brox, U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015, pp. 234–241.

31.

Simonyan and

Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint, 2014. arXiv:1409.1556.

32.

Tan,

Pang and

Q.V.

Le, Efficientdet: Scalable and efficient object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10781–10790.

33.

Wang,

Chen,

Yuan et al., Understanding convolution for semantic segmentation, in: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, pp. 1451–1460. doi:10.1109/WACV.2018.00163.

34.

H.A.M.

Williams,

M.H.

Jones,

Nejati et al., Robotic kiwifruit harvesting using machine vision, convolutional neural networks, and robotic arms, Biosystems Engineering 181 (2019), 140–156. doi:10.1016/j.biosystemseng.2019.03.007.

35.

Wu,

Li,

Zeng et al., FMD-Yolo: An efficient face mask detection method for COVID-19 prevention and control in public, Image and Vision Computing 117 (2022), 104341. doi:10.1016/j.imavis.2021.104341.

36.

Zhao,

Shi,

Qi et al., Pyramid scene parsing network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2881–2890.

37.

Zhu,

Lyu,

Wang et al., TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2778–2788.

38.

Zoph and

Q.V.

Le, Neural architecture search with reinforcement learning, in: International Conference on Learning Representations, 2018, pp. 8697–8710.

RFE		Fusion		mAP(%)

Conv	Dila Conv	Channel concat	Channel sum
				76.5
		✓		76.8
✓		✓		78.0
✓	✓	✓		78.5
✓	✓		✓	78.1

FI-FPN: Feature-integration feature pyramid network for object detection

Abstract

Keywords

1. Introduction

2.1. CNN-based object detector

2.2. Multi-level features

3. Methodology

4. Experiments

4.1. Dataset and evaluation metrics

4.2. Implementation details

4.3. Main results

4.4.1. Validating the effects of each module component

Table 2 Ablation study results of different components in the MRF module on the VOC dataset RFE Fusion mAP(%) Conv Dila Conv Channel concat Channel sum 76.5 ✓ 76.8 ✓ ✓ 78.0 ✓ ✓ ✓ 78.5 ✓ ✓ ✓ 78.1

5. Conclusion

References

Table 2
Ablation study results of different components in the MRF module on the VOC dataset

RFE Fusion mAP(%)

Conv Dila Conv Channel concat Channel sum

76.5

✓ 76.8

✓ ✓ 78.0

✓ ✓ ✓ 78.5

✓ ✓ ✓ 78.1