Abstract
Object localization has been the focus of research in Fine-Grained Visual Categorization (FGVC). With the aim of improving the accuracy and precision of object localization in multi-branch networks, as well as the robustness and universality of object localization methods, our study mainly focus on how to combine coordinate attention and feature activation map for target localization. The model in this paper is a three-branch model including raw branch, object branch and part branch. The images are fed directly into the raw branch. Coordinate Attention Object Localization Module (CAOLM) is used to localize and crop objects in the image to generate the input for the object branch. Attention Partial Proposal Module (APPM) is used to propose part regions at different scales. The three classes of input images undergo end-to-end weakly supervised learning through different branches of the network. The model expands the receptive field to capture multi-scale features by Selective Branch Atrous Spatial Pooling Pyramid (SB-ASPP). It can fuse the feature maps obtained from the raw branch and the object branch with Selective Branch Block (SBBlock), and the complete features of the raw branch are used to supplement the missing information of the object branch. Extensive experimental results on CUB-200-2011, FGVC-Aircraft and Stanford Cars datasets show that our method has the best classification performance on FGVC-Aircraft and also has competitive performance on other datasets. Few parameters and fast inference speed are also the advantages of our model.
Introduction
The purpose of Fine-Grained Visual Categorization (FGVC) is to distinguish various more fine-grained object categories through subtle and indistinguishable visual differences between subordinate categories of the same kind of objects, such as CUB-200-2011 [21], Stanford Cars [13], FGVC-Aircraft [15] are all commonly used datasets in the FGVC field. There are usually only indistinguishable visual differences between different sub-categories with the influencing factors such as deformation, occlusion, care, and shooting angle, making FGVC become a challenging task. convolutional neural network (CNN) still dominates the image classification, but traditional single convolutional neural networks do not perform well in FGVC. The SOTA methods usually choose a convolutional neural network with strong feature extraction capability as the backbone network, and introduce some improved methods to enable the network to capture fine-grained differences in FGVC tasks to improve classification performance.
[17] proposed an object-part attention method, which exploits objects and part attention to extract local subtle differences and achieves excellent results in distinguishing subordinate categories. It demonstrates the effectiveness of methods which focus on different part regions of object in fine-grained visual classification tasks by using more than one deep learning models. Recurrent Attention Convolutional Neural Network (RA-CNN) [5] uses three branches to iteratively generate regional attention from coarse-grained to fine-grained, and uses a mutually reinforcing way to recursively learn the discriminative regional attention and the feature representation based on the region. It concatenates feature vectors of multiple region parts for classification. However, RA-CNN can only amplify the most important features and pay insufficient attention to other finer-grained features, resulting in limited effects. MMAL-Net [29] designs the Attention Partial Proposal Module (APPM) according to the sliding window and uses the non-maximum suppression (NMS) method to crop out multiple feature regions of different scales and different parts. It can focus on other more fine-grained features while focusing on the main feature regions like RA-CNN. The fully connected layer and the convolutional layer in three branches of MMAL-Net share the parameters with each other, so they can have good classification performance for objects of different scales and parts. However, the attention object localization module AOLM used in MMAL-Net needs to intercept the output features of two different layers in the convolutional neural network to achieve good localization performance. Our experimental results show that even if the backbone network of MMAL-Net is replaced from the original ResNet-50 [8] to ResNeXt-50 [28] which has the same architecture but stronger classification ability, the classification accuracy of the model will drop significantly. In addition, the localization accuracy of AOLM will drop significantly as the network focuses more and more on discriminative regions. MMAL-Net’s reliance on the classification results output by a single branch also leads to its underutilization of the performance of multi-branch models.

The structure of multi-branch selection fusion fine-grained classification algorithm based on coordinate attention localization.
Based on the advantages and weaknesses of the MMAL-Net, we propose our fine-grained classification algorithm. Our model structure is shown in Fig. 1. In the training phase, our model consists of three branches. The raw branch directly uses the original image for training, and learns the overall features according to the complete information of the object. The output of the last convolutional layer in the raw branch is sent to CAOLM as the input to obtain the bounding box information. After the fusion of multi-scale information and the weight redistribution of the coordinate attention module, the object image not only contains all the structural features of the object but also more fine-grained features, what’s more, background that is not helpful for classification are cropped. The APPM is used to crop out multiple different parts from the object image that have the greatest degree of differentiation and the least degree of mutual inclusion as part images and enter the part branch for training. In this way, the network can learn the features of different parts and scales of the target. Inspired by Selective Kernel Net (SKNet) [14], we propose a selective branch block (SBBlock), which is used to select and fuse the output features of different branches to avoid the cropped image missing the discriminative information in the raw image. Selective Branch Atrous Spatial Pooling Pyramid (SB-ASPP) is used to reduce the impact of the network’s increasing focus on more discriminative regions on the localization ability of Coordinate Attention Object Localization Module (CAOLM).It can expand the receptive field of the network and capture multi-scale features. The three branches of our model share the parameters of the fully connected layer and the convolutional layer. In this way, they can have good classification performance for objects of different scales and parts in learning process.
What follows is our summary of our work. (1) CAOLM is proposed to solve the disadvantage that AOLM can only play a good localization ability under certain conditions, and improve the localization ability by using the attention to the coordinate position. (2) SB-ASPP is proposed to alleviate the disadvantage that the AOLM localization accuracy declines after a certain period of network training and the grid effect problem of ASPP. (3) SBBlock is proposed to selectively fuse features from different branches for better classification. (4) On all datasets used in the experiments, our method has competitive classification performance compared to the best algorithms and outperforms MMAL-Net steadily.And the amount of parameters of our entire model is only comparable to Resnet-101.
Attention mechanism
In the field of image classification, attention mechanisms are special structures embedded in models that mimic the characteristics of human visual observations that are biased towards dominant features. It is used to automatically learn and calculate the input data contribution size and adjust it. (SENet) [10] adaptively redistributes weights in the channel dimension through squeezing and excitation operations, enabling the network to focus on more efficient channels. SKNet [14] applies group convolution and branch convolution to make the network adaptively adjust the appropriate receptive field size. Spatial attention is used in (CBAM) [26]. CBAM calculates the attention map in the spatial and channel dimensions in turn, after which it is multiplied by the input feature map to optimize the feature map weights. The recent method [7] adopts vertical max-pooling combined with horizontal max-pooling, and finally stitches them together to form an attention map. In this way, important location information can be obtained more effectively. Inspired by these methods [7,14,26], we design the coordinate attention part in CAOLM.
Localization-classification subnetworks
A model using this approach usually consists of a classification sub-network and an object localization sub-network. The accuracy of object localization by the object localization sub-network is the focus of research in this direction. The classification sub-network is classified by learning more fine-grained object area features generated by the object localization sub-network. Earlier methods such as Mask-CNN [25] required the use of part-level annotations to help train classification subnetworks. It generates multiple informative partial images and achieves classification by concatenating these part-level feature vectors to represent the entire object. The generation and maintenance of additional annotations is labor-intensive and affects practical applications. Therefore, methods such as NTS-Net [2], TASN [3], RA-CNN [5] and [7] use the attention mechanism to locate the object region to avoid additional overhead. MMAL-Net [29] is used to localize the target image simply by calculating the regions in the feature map where the response value is greater than the average value.
Atrous convolution
Atrous convolution [2] is widely used in tasks such as semantic segmentation and object detection. It refers to the interval sampling of the input image during the convolution process, which can expand the receptive field and capture multi-scale context information without reducing the resolution. Atrous Spatial Pooling Pyramid [3] (ASPP) uses different intervals in the same layer, which can precisely control the receptive field of the network and capture multi-scale information. However, due to the sparse sampling of the input signal by atrous convolution, the problem of local information loss and lack of correlation of information will be caused. Hybrid Dilated Convolution [22] (HDC) alleviates this problem by using dilated convolutions with different intervals in different convolutional layers. [23] Periodically sample the original feature map to form 4 sets of feature maps with decreasing resolution, and then use the original atrous convolution parameters to perform convolution respectively, and then upsample and combine the results of the convolution to collect inconsistent local information. By introducing an attention method, SB-ASPP selectively fuse the outputs of convolutions at different intervals and add shortcuts to allow our model to capture multi-scale features while avoiding the loss of local information.
Method
Selective branch block

The structure of selective kernel convolution.
In SKNet [14], a nonlinear method is proposed to adjust the receptive field size of neurons by aggregating information from multiple convolution kernels adaptively. There are three operators in selective kernel (SK) convolution, which are split, fusion, and selection. The split operator generates different branches with convolution kernels of different sizes, corresponding to receptive fields of different sizes. By combining and aggregating information from multiple paths, the fusion operator obtains a global representation of the selection weights. According to selection weights, the selection operator aggregates the feature maps of different paths and redistributes the weights. The SK convolution is computationally lightweight and only slightly increases the parameters and computational cost. Figure 2 shows the structure of the SK convolution. We want to be able to combine and aggregate information from different branches. The fusion and selection part of SK convolution are reserved, and we call it the selective branch block (SBBlock), the structure is shown in Fig. 3.

The structure of selective branch block.
SB-ASPP is inspired by ASPP [3], which samples the input signal in parallel with multiple different sampling intervals in the same layer. This method can accurately control the spatial resolution and capture features at different scales. We hope that the model can capture features of different scales in different parts of the object before CAOLM localizes the object. Our proposed SB-ASPP is still based on atrous convolution, which allows us to adjust the response in one layer at any resolution desired. When we consider a one-dimensional input

Selective branch atrous spatial pooling pyramid.
By aggregating the feature maps in the channel dimension, the AOLM in [29] obtains the activation map. The activation map is used to generate the bounding box to obtain the object image. We optimized the acquisition method of the activation map, and the specific process is as follows.
Inspired by AttNet [7], CBAM [26], Coordatt [9], we design the coordinate attention part in CAOLM. The purpose of the attention module is to spatially give greater weight to the region where the recognized object is located.

The structure of split coordinate attention module.
Taking a single branch as an example, we send
In order for the model to sufficiently learn the image features obtained by CAOLM and APPM. As shown in Fig. 1, our model with a three-branch structure which consists of raw branch, object branch and part branch. The raw image has three branches after the raw branch is used for feature extraction CNN. The first branch outputs the classification results of the raw branch directly through the fully connected layer. In the second branch, the feature map is sent to SBBlock together with the output result of the last convolutional layer of the object branch, and then outputs the classification result of the object branch through the fully connected layer. In the third branch, the feature map is sent to CAOLM after passing through the SB-ASPP module to generate a mask map for cropping the raw image. The cropped image is used as the input of object branch, and it has two branches after passing through the CNN of the object branch. The feature map is sent to SBBlock in one branch, and in another branch, the feature map is sent to APPM for cropping the part area as the input of the part branch. Cross-entropy loss is used as the classification loss of all three of our branches, as shown in Eqs (18), (19) and (20)
Experiments
Implementation details
In all the experiments we conducted, we fixed the image size to
Datasets

Dataset samples.
The samples of the datasets used in the experiments (CUB-200-2011 [21], Stanford Cars [13], FGVC-Aircraft [15]) are shown in Fig. 6. The CUB-200-2011 dataset contains a total of 11788 annotated images, of which 5909 images are used for training and 5879 images are used for testing. The dataset images contain a total of 200 different categories of birds. The Stanford Cars dataset contains a total of 16,185 annotated images, of which 8,144 were trained and 8,041 were tested. The dataset images contain a total of 196 cars of different brands and models. The FGVC-Aircraft dataset contains a total of 10,000 annotated images, of which 6,667 images are used for training and 3,333 images are used for testing. These images are typically grouped into 100 different types of aircraft in fine-grained visual classification. Classification tasks on these datasets are still challenging.
Comparison results with state-of-the-art algorithms
Comparison results with state-of-the-art algorithms
As shown in Table 1, our method is used to compare with state-of-the-art algorithms (this SOTA information comes from “
We perform ablation studies on the Stanford Cars dataset.
The effectiveness of the three-branch structure. The effectiveness of the recurrent multi-branch model has been demonstrated in [5,29]. In our model (using ResNext-50 as the backbone network), if only the raw branch is used, the classification accuracy is 85.3%. On the basis of the above, if the model has the object branch, the classification accuracy can reach 93.5%. Our model with three branches can achieve a classification accuracy of 95.7%.The experimental results show that if the part branch is not used in the inference stage, the model can double the inference speed while the classification results are still excellent. The classification accuracy of our model in this case is 95.6%.
The effectiveness of CAOLM. It can achieve 94.7% accuracy when using ResNet-50 as the backbone in MMAL-Net. But when the backbone was replaced with ResNet-50, the accuracy of MMAL-Net dropped to 93.3% instead. The performance of AOLM degrades when convolutional neural networks focus on discriminative features very efficiently. Also using ResNet-50 as the backbone network, we replaced the AOLM in MMAL-Net with our CAOLM, and the accuracy increased to 95.1%. When using SBBlock, the accuracy of MMAL-Net improves to 95.3%. It is proved that our method can improve the localization and classification performance effectively.
The effectiveness of changing the backbone. The backbone used by our method is ResNext-50, which achieves 95.6% accuracy classification performance on Stanford cars. This is better than the classification accuracy (95.3%) when our method uses ResNet-50 as the backbone. It proves that CAOLM has better generality to the backbone network of the model.
The effectiveness of SB-ASPP. If SB-ASPP is removed in our method, the accuracy drops from 95.6% to 95.0%. SB-ASPP can adjust the network receptive field and focus on global multi-scale information, which alleviates the disadvantage that the localization performance of AOLM will decrease when the convolutional neural network is very effective in focusing on discriminative features.
Object localization performance
Compared to the ground-truth bounding boxes, the percentage of predicted boxes which correctly locate more than 50% of the IOU (Intersection over Union) is PCP (Percent of Right Localized Parts). As shown in Table 2, the localization ability of CAOLM is significantly improved compared with SCDA [18] and AOLM [29]. Although multi-layer integration significantly improves the performance of AOLM, in actual experiments, we find that which two layers are used for integration in different networks requires a lot of experiments to find. This method reduces the universality of AOLM, and the effect of single-layer AOLM is not ideal in practical experiments. The positioning accuracy of AOLM can reach a maximum of 85.1%, but it will actually decrease to 71.1% with the training of the network. It is worth mentioning that if the Mixup image enhancement strategy is used, the AOLM cannot find the intersection of the activation maps of the two layers, which proves that the robustness of the AOLM to the noise of the input signal is not good. The accuracy of our proposed CAOLM can reach up to 87.4%, and it will drop to 77.1% after network training. This figure rises to 79.1% when using SB-ASPP. The part branch of the cropped image by APPM makes the model adaptable to the scale of the object, and it still achieves good classification performance.
Object localization accuracy on CUB-200-2011
Object localization accuracy on CUB-200-2011
In this paper, we propose an efficient fine-grained classification method that can be trained end-to-end without additional annotations. The multi-branch structure can fully utilize the images obtained by APPM and CAOLM to achieve excellent performance, and use SBBlock to make the information between branches complement each other. SB-ASPP can capture multi-scale information to alleviate the problem of CAOLM localization ability decline. We propose a recurrent multi-branch model utilizing multiple outputs for classification, and a new coordinate attention-based localization method. Experimental results on CUB-200-2001, FGVC-Aircraft and Stanford Cars datasets show that our method is very competitive with the state-of-the-art algorithms.The way CAOLM combines attention mechanisms and feature activation maps cannot be corrected by the results. The next research focus can be on studying trainable methods for locating objects through coordinate attention.
Footnotes
Acknowledgements
This work was supported by National Natural Science Foundation of China (No. 62062007, 62272198) and Guangdong Key Laboratory of Data Security and Privacy Preserving (No. 2017B030301004).
