Multi-feature fusion of convolutional neural networks for Fine-Grained ship classification

Abstract

Fine-Grained ship classification is quite challenging because the visual differences between the subcategories are small. Due to the large intra-class similarity, it is very difficult to classify the ship objects without bounding box/part annotations. In this paper, we propose a model that combines multiple deep CNN features and use fusion strategies to explore of multi-scale features relationship. Because different levels/depths CNN features have different properties, so we combine multiple low-level local CNN features with high-level global CNN feature for object classification. The model shows a good way of tailoring pre-trained CNN models to fine-grained ship classification, which have lower cost in computation and storage compared with some state-of-the-art CNN methods and achieves the significant classification performances in FGVC-Aircraft and Stanford Cars datasets.

Keywords

Convolutional neural networks Fine-grained classification ship recognition

1 Introduction

Automatic ship recognition plays a key role in military and civilian fields such as precision guidance, maritime traffic management, anti-terrorism, search and rescue, etc. In recent years, the technical literatures of ship target classification based on optical images are increasing. Christina Corbane et al. [1] used statistical methods and mathematical morphology such the wavelet analysis and Radon transform to detect ship in high spatial resolution optical imagery. QingJiang Wang et al. [2] proposed an approach based on Bayesian networks and use two-step approach to classify ships. J. Antel et al. [3] identified ships from Quickbird colour images, which used a Bayessian classifier based on image-extracted features. Haiyang Chen et al. [4] presented a kind of improved FB algorithm to compute evidences with a single or more states and used fuzzy discrete dynamic Bayesian network model for ship recognition. Aleksey Vladimirovich Dolgopolov et al. [5] used deep neural network auto encoder for automatic ship feature extraction. Jiexiong Tang et al. [6] proposed a compressed-domain ship detection framework using DNN and ELM for optical space borne images. Ying Liu et al. [7] proposed a novel ship detection and classification approach which utilized deep convolutional neural network (CNN) as the ship classifier.

In the previous works, there are many problems in ship recognition which are summarized as follows: a) Due to ship images are affected by weather, light, sea conditions, image sensors and other factors, the feature extraction of ship target is often difficult. b) With the increase in the number of ship images, the traditional ship recognition pipeline: feature extraction and classifier are not satisfied with the requirements of efficient processing. c) At present, it is difficult to recognize the ships belong to the same class,because there are subtle differences within subordinate ship categories. Now fine-grained classification has received wide attention, such as bird classification [8], flower classification [9] and car classification [10]. This inspires us to use fine-grained classification method for identifying ship subordinate categorization.

In the same categories, the visual differences are subtle and which be easily overwhelmed by factors such as pose, viewpoint, or location of the object in the image [11]. Therefore, fine-grained datasets require strong annotations e.g. bounding boxes for object or even object parts. However, in practical applications, many fine-grained datasets do not have bounding boxes. Moreover, human-defined regions or the regions learned by existing unsupervised methods may not be optimal for classification [12]. Recently, many models based on convolutional neural networks (CNNs) obtained outstanding performances on several fine-grained classification. These models usually have several dozen million weights, which cannot be properly trained on mid-scale datasets for easily falling into overfitting when training. The previous researches use multiple CNNs to extract discriminative features from local object region, most of them just concatenate these features together prior to classification. However, such a direct combination does not explore the relationship between these features and utilize the feature relationships.

To address these issues mentioned previously, we propose a novel multi-feature fusion convolutional neural network (DMCFN) for fine-grained ship recognition without bounding box/part annotations. The proposed DMCFN is a stacked network, which takes the input from full images to fine-grained local regions at multiple scales.

The contributions of this work are three folds:

We propose a novel multi-feature fusion convolutional neural network for fine-grained ship classification, which uses two kind of deep network models to extract global feature and local feature separately.

We use more detailed local features from lower layers and achieve superior performance over the state-of-the-art approaches based on the features extracted from the top layer.

We use structural regularization on the fusion layer to explore the correlations of multiple features.

The remainder of this paper is organized as follows. Section 2 briefly reviews the related works. Section 3 elaborates the proposed framework, including formulation and optimization. Extensive experimental results and comparisons with alternative methods are given in Section 4. Finally, we discuss what we learned, future work and conclusion in Section 5.

2 Related work

2.1 Fine-grained image recognition

Fine-grained visual categorization has been widely studied. There are some works depend on the hand-crafted feature description and encoding methods for fine-grained classification. In [13], Hierarchical Structure Learning (HSL) algorithm and Geometric Phrase Pooling (GPP) algorithm are used to capture mid-level structures for fine-grained classification. The method of [14] detected volumetric part models based on Poselets and used Stacked Evidence Trees to aggregate information about part properties.

Due to the success of deep learning, Convolutional neural networks (CNN) have been widely used in fine-grained image recognition. TsungYu Lin et al. [11] built a bilinear CNN model, which extracted two local features and multiplied these features to obtain an image descriptor. Zhicheng Yan et al. [15] proposed a Hierarchical Deep Convolutional Neural Network (HD-CNN) that decomposed fine-grained classification task into two steps.

In many recent studies, many works incorporated precise part information to improve fine-grained classification. Jonathan Krause et al. [16] proposed a method based on generating parts using co-segmentation and alignment without part annotations. Philippe-Henri Gosselin et al. [17] proposed a search-based architecture to search for more informative parts and thus improved recognition. Chen Huang et al. [18] explored a task-driven approach for progressive part detection, which provided the most discriminative visual features for the subsequent object classification. Jonathan Krause et al. [19] proposed an object representation that detects important parts in a fully unsupervised manner. Jianlong Fu et al. [20] considered the relationship between region detection and fine-grained feature learning and recursively learned discriminative region attention and region-based feature representation at multiple scales in a mutually reinforced way.

Although there are a significant performance improvement by using CNN, from the above methods we can find out that global CNN features extracted from full images are too spatially rigid to be optimal for fine-grained classification. Therefore, we consider classifying images using local information combined with global information. Results from these works show that the local features are more competitive than CNN features based on full image and can provide important complementary information.

2.2 Multi-feature fusion

Recently, several feature fusion methods have been proposed. Jiang Liu et al. [21] adopted a weighted fusion scheme to combine the content and the context features without need of human interaction. Anran Wang et al. [22] proposed a modality and component aware feature fusion framework to combine regressoes for FV features and global CNN features. Heechul Jung et al. [23] used two models to extracts temporal appearance features and temporal facial landmark points, then adopted a new integration method to boost the performance of the facial expression recognition. Zuxuan Wuet al. [24] proposed a regularized DNN to lefusion layer is adopted to impose regularizatarn feature relationships an class relationships jointly. Xin Lu et al. [25] incorporated heterogeneous inputs generated from the image, which included a global view and a local view, and unified the feature learning and classifier training in a double-column deep convolutional neural network.

3 Approach

In this section, we will introduce the proposed multi-feature fusion convolutional neural network (DMCFN) for fine-grained image recognition. The DMCFN comprises two kind of deep networks: the deep coarse component network (DCCN) and the deep fine component network (DFCN). The DCCN based on a CNN, which is used to extract the high-level semantic feature necessary for ship recognition. The DFCN catches detailed local features and help to separate the target from distracters. In our work, we combine one DCCN and three DFCN to increase the ship recognition performance. Especially, this structure can be trained end to end. The architecture of our deep convolutional neural network is shown in Fig. 1.

Fig. 1

The framework of multi-feature fusion convolutional neural network (DMCFN), which is composed by one deep coarse component network (DCCN) and three deep fine component networks (DFCN). These networks receive an image as input without any bounding box or part annotation. A fusion layer is adopted to impose regularization on the network parameters. Finally, a joint fine-tuning method is used on these networks.

In this section, we will introduce the proposed multi-feature fusion convolutional neural network(DMCFN) for fine-grained image recognition. The DMCFN comprises two kind of deep networks: the deep coarse component network (DCCN) and the deep fine component network (DFCN). The DCCN based on a CNN, which is used to extract the high-level semantic feature necessary for ship recognition. The DFCN catches more detailed local features which help to separate the target from distracters. In our work, we combine one DCCN and three DFCN to increase the ship recognition performance. Especially, this structure can be classified by end to end. The architecture of our deep convolutional neural network is shown in Fig. 1.

3.1 Deep coarse component network

The basic idea for designing deep coarse component network is to extract global high-level features. There are many exist deep networks used for large-scale image datasets which have achieved outstanding performances on fine-grained classification. Their main disadvantage needs to train a large number of weights. So a complex system cannot be learned properly on a mid-scale fine-grained ship dataset. To deal with this issue, we present deep coarse component network with a moderate depth and a moderate number of parameters to avoid overfitting.

The deep coarse component network consists of three convolutional and two fully connected layers. The first convolutional layer is composed of 64 filters, filters of the first convolutional layer have size 7×7 and compute convolution at a global scale. Then the ReLU activation function is applied to the convolved patch. Next, we use a 2×2 max-pooling function in each response map with stride 2, which yields position invariance over the patches. We additionally apply convolution layer with a 5×5×64 kernel and the activation function ReLU. Max-pooling is performed the same way as in the first convolutional layer. The last convolutional layer comprises 64 filters of size 5×5. Finally, these output values are passed through the two fully connected layers and then classified using softmax. Batch Normalization (BN) is applied to the activations of these convolutional layers, following by the Rectified Linear Unit (ReLU) for non-linearity. The outputs of this network are finally normalized through a softmax step, computed as follows: $f_{soft max} (v) = {[\frac{exp (v)}{\sum_{i}^{n} exp (v)}]}^{T}$ (1)

Where v is the score of filter i from previous layer and f_softmax is the corresponding output. This structure is summarized in Table 1. For training our network, the stochastic gradient descent method is used for optimization, and dropout [26] and weight decay methods are utilized for regularization.

Table 1

DCCN structure

Block	Conv1	Conv2	Conv3	Fc4	Fc5
	64 filters	64 filters	64 filters	Output size	Output size
	7×7,st.2	5×5,st.1	5×5,st.1	256	Class number
Details	BN,ReLU	BN,ReLU	BN,ReLU	ReLU	softmax
	max.pool	max.pool	max.pool
	[2 2],st.2	[2 2],st.2	[2 2],st.2

Table 2

The statistics of fine-grained datasets used in this paper

Datasets	#Category	#Training	#Testing
Vessels dataset	169	11900	5100
FGVC-Aircraft	100	6667	3333
Stanford Cars	196	8144	8041

3.2 Deep fine component network

The DCCN extract global high-level features, but may discard local details, such fine-scale object parts, which are significant cues to discriminate fine-grained categories. So we propose deep fine component network to capture local visual information.

In practical application, many datasets don’t have bounding box or part annotation, besides, human-defined regions may not be optimal for image classification. Bolei Zhou et al. [27] pointed out that convolutional units in convolutional neural networks can play the role of target detector without supervised information of target location, but the ability of the convolutional layer to locate the target is lost when adding the full connection layer to the target classification. They used the global average pooling (GAP) layer to instead of the full connection layer. And the use of GAP can identify the regions in the images which are significant to the classification only through a single forward-pass.

In deep fine component network, we use class activation maps (CAM) which can identify the discriminative regions of the image by projecting back the weights of the output layer on to the convolutional feature maps. We use a simple threshold technique to segment the heatmap, and generate a bounding box. Lijun Wang et al. [28] proposed that CNN features at different levels/depths have different properties. Such a top convolutional layer captures more abstract and high-level semantic features, and a lower layer provides more detailed local features which help to separate the target from distracters. So we map the box to the lower convolution layer of the convolution neural network to get the local features information.

By using a RoI pooling layer, we extracts a fixed-size feature map from the feature maps for the corresponding box, as shown in Fig. 2. The features inside the box are converted to a small feature map with a fixed spatial extent of of the RoI pooling layer. Then the small feature map is feed into a fully connected layer. A final fully connected layer computes the score of the input image for each class. In our framework, we train three deep fine component network by using bounding boxes with three scales separately. For the training of DFCN, we use transfer learning [29] to overcome the deficit of training samples for fine-grained categories. So we use ImageNet-trained CNN and fine-tune the parameters of CNN on target datasets.

Fig. 2

The framework of deep fine component network (DFCN).

3.3 Feature fusion layer

A lot of method concatenated multiple extracted features together simply to form a high dimensional feature, usually result in limited performance since the intrinsic relations among the features extracted from the multiple models are overlooked. Therefore, we use an integration method [24] to fuse the features from DCCN and DFCN that fully leverages the complementary clues from various features.

Below we use a regularized variant which is able to accommodate the deep fusion process of multiple features from these trained networks. We utilize one additional layer for the fusion of all the features, as shown in Fig. 3. Let $X = [x_{1}^{1}, \dots, x_{i}^{j}, \dots, x_{N}^{M}] \in R^{D \times N}$ denote the M features from N input images.

Fig. 3

The fusion layer.

Fig. 4

Examples from (left) Vessel dataset, (center) FGVC-Aircraft dataset, and (right) Stanford Cars dataset used in our experiments.

This fusion layer can be written as the following:

$Y_{F} = σ (\sum_{m = 1}^{M} W_{P}^{m} Y_{P}^{m} + B_{P})$ (2)

Where Y_F, $Y_{P}^{m}$ denotes the output of the fusion layer and the last layer of feature extraction network respectively, $W_{P}^{m}$ , B_P are the trainable parameters of the last layer of feature extraction network. Y_F uses a sigmoid function. In addition, as mentioned earlier, different features could also be complementary because they have distinct characteristics. Instead of concatenating multiple features simply, we specifically formulate an objective function that can regularize the fusion process to explore such correlations among the multiple features simultaneously. Finally we formulate the following objective as the following:

$\begin{matrix} min_{W, Ψ} L + \frac{λ_{1}}{2} (\sum_{l = 1}^{P} \sum_{m = 1}^{M} ∥ W_{l}^{m} ∥_{F}^{2} + \sum_{l = F}^{L - 1} ∥ W_{l} ∥_{F}^{2}) \\ + \frac{λ_{2}}{2} tr (W_{p} Ψ^{- 1} W_{P}^{T}) \\ s . t . Ψ \geq 0, \end{matrix}$ (3)

Where $L = \sum_{i = 1}^{N} ℓ ({\hat{y}}_{i}, y_{i})$ is an tranditional objective, we add one regularization term. The matrix W_P represents the coefficients over all the features, and the matrix Ψ ∈ R^M×M is used to model the feature correlation. Note that the entries with large values in Ψ indicate strong feature correlations, while small-valued entries denote different features are less correlated. The coefficients λ₁ and λ₂ control the contributions from different regularization terms.

3.4 Optimization

The learning speed of DCCN is different from DFCN, because they have different layers. To address this issue, we develop a learning algorithm for optimizing the model parameters efficiently in two steps. First, the DCCN and DFCN are trained seperatly. Next, we retrain the entire network.

(W_P, Ψ) are coupled with each other. Therefore, we adopt the alternative optimization method to iteratively minimize the objective with respect to $W_{l}^{m} (l = 1, \dots, L, m = 1, \dots, M)$ and Ψ.

By fixing Ψ, we first consider the minimization problem over $W_{l}^{m}$ . Denote $G_{l}^{m}$ as the gradient with respect to $W_{l}^{m}$ , the weight matrix for l-th layer and m-th feature is updated as: $W_{l}^{m} = W_{l}^{m} - η G_{l}^{m}$ (4) where η is the step length of the gradient descent.

We then minimize the objective function over Ψ with other variables being fixed. The problem in Equation 3 degenerates to: $min_{Ψ} tr (W_{P} Ψ^{- 1} W_{P}^{T}), s . t . Ψ ⩾ 0 tr (Ψ) = 1 .$ (5)

The optimization pipeline is summarized in Algorithm 1.

Algorithm 1

Training Procedure of DMCFN

Require:

x_{n}^{m}

: the representation of the m-th feature

for the n-th image sample; y_n: the semantic label of

the n-th image sample;

Step 1: Pretrain the components of DMCFN

1.1: Initialize DCCN

1.2: pretrain DCCN

1.4: Fine-tune DFCN

Step 2: Fine-tune DMCFN

2.1:Initialize

W_{L}^{m}

randomly,

Ψ = \frac{1}{M} I_{M}

and

Ω = \frac{1}{C} I_{C}

, where I_M and I_C are identity matrices

2.2: for epoch = 1 to K do

2.3:Back propagate the prediction error from layer

L to layer 1 by evaluating the gradient

G_{l}^{m}

and update the weight matrix

W_{l}^{m}

for each

layer and each feature as:

W_{l}^{m} = W_{l}^{m} - η G_{l}^{m}

2.4:Update the feature relationship matrix Ψ:

Ψ = \frac{(W_{P}^{T} W_{P})^{\frac{1}{2}}}{tr ((W_{P}^{T} W_{P})^{\frac{1}{2}})}

2.5:end for

4 Experiments

Datasets We carry out some experiments on three image datasets. The first dataset is constructed by gathering images from the internet, because there is no available database for both warships and merchant vessels. The optical images of ships are attained using cameras mounted on ships and in harbors. Finally we get the total 17000 ship images, which contain 169 different types of ships such as aircraft carrier, destroyer and frigate et al. We use the second dataset is FGVC-Aircraft [30], which is a dataset consists of 10,000 images of 100 aircraft variants such as the Boeing 737-300 from Boeing 737-400, and was introduced as a part of the FGComp 2013 challenge. The third dataset is Stanford Cars Dataset [31] was collected for fine-grained car classification. It contains 16,185 images of 196 classes of cars, which was also part of the FGComp 2013 challenge.

Baselines In our framework, we adopt three different CNN architectures for DFCN: the widely used architecture of Krizhevsky et al. [26] (AlexNet), the one of Simonyan et al. [32] (VGG16) and the deep architecture of C. Szegedy et al. [33]. The detail about the architecture can be referred to the corresponding papers. It is important to note that DFCN can be used with any CNN. We compared approaches into two categories, based on whether they use human-defined bounding box (bbox) or part annotation.

4.1 Comparison to previous works

Implementation details In the DMCFN, all input images are resized to 224×224. We train DCCN and DFCN respectively. The architecture of DCCN is mentioned in Section 3.1. We initialize DFCN by the pre-trained CNN network from ImageNet as mentioned in Section 3.2. Then we replace the last fully connected layer with actual number of categories, and fine tune DFCN with our fine-grained datasets. For the initialization of DCCN, the weights are drawn from a Gaussian distribution of mean 0 and standard deviation 0.01. An equal learning rate 0.001 is set for all layers. Momentum and weight decay are set as 0.9, 0.0005 correspondingly. A 50% ‘Dropout’ regularization method is used to reduce overfitting in fully-connected layers.

Vessels dataset The results of fine-grained recognition on Vessels dataset is given in Table 3. Because the images of vessels dataset is collected from internet, there are no exist works having recognition results for comparison. So we fine-tune the three CNN architectures – AlexNet, VGGnet-16 and GoogLeNet for comparison. In our experiments, we fine-tune the weaker architecture of Krizhevsky et al. and still reach 92.4% accuracy. We fine-tune the architecture of their very deep GoogleLeNet to obtain 89.5% accuracy. In comparison, DFCMN uses conv4-3 features from VGGnet-16 obtains 93.5% accuracy. Our approach with lower layer CNN features outperforms the fine-tuned models by a significant margin.

Table 3
Recognition comparison of different methods on Vessels dataset

Train. Anno. Test Anno. Method Accuracy

None None AlexNet 83.8%

None None VGGnet 92.4%

None None GoogLeNet 89.5%

None None DMCFN -Alexnet(conv3) 85.7%

None None DMCFN -VGGnet(conv4-3) 93.5%

None None DMCFN -GoogLeNet (icp2) 91.2%

Train. Anno.	Test Anno.	Method	Accuracy
None	None	AlexNet	83.8%
None	None	VGGnet	92.4%
None	None	GoogLeNet	89.5%
None	None	DMCFN -Alexnet(conv3)	85.7%
None	None	DMCFN -VGGnet(conv4-3)	93.5%
None	None	DMCFN -GoogLeNet (icp2)	91.2%

FGVC-Aircraft The results of fine-grained recognition on FGVC-Aircraft are shown in Table 4. In the area of fine-grained recognition, there are many approaches relying on additional annotation like ground-truth part locations or bounding boxes. Some of them required annotation in training is distinguished from the annotation required at test time. As can be seen in the Table 4, our approach with the conv4-3 layer features from vgg-16 improves the work of Tsung-Yu Lin et al. [11] by 1.5% significantly. Besides, our approach with the conv3 layer features from AlexNet can reduce computation redundancy and achieves 83.3% accuracy with smaller architecture. It is important to note that our work only need image labels for training, which outperform these methods by a significant margin.

Table 4

Recognition comparison of different methods on FGVC-Aircraft dataset

Train. Anno.	Test Anno.	Method	Accuracy
None	None	Tsung-Yu Lin et al. [11]	91.3 %
Parts	Parts	Yuning Chai et al. [34]	72.5%
None	None	Philippe-Henri Gosselin et al. [35]	80.7%
None	None	AlexNet	80.1%
None	None	VGGnet	90.4%
None	None	GoogLeNet	86.3%
None	None	DMCFN -Alexnet(conv3)	83.3%
None	None	DMCFN -VGGnet(conv4-3)	92.8%
None	None	DMCFN -GoogLeNet (icp2)	88.1%

Stanford Cars The classification accuracy on Stanford Cars are summarized in Table 5. The previous works [16] [36] have remarkable results, which can achieve 92.8% and 91.3% accuracy respectively. Without object bounding-boxes the Jianlong Fu et al.[20] and Tsung-Yu Lin et al. [11] achieve accuracy of 92.5% and 92.6% respectively. Our method DMCFN- VGGNet (conv5-1) obtain the highest recognition accuracy 94.3% by leveraging the power of feature fusion, which integrates gloabal features and local features. In FGVC-Aircraft dataset, features from conv5-1 of VGGnet-16 does well on this dataset compared to features from conv4-3. We elaborate the effect of CNN features at different levels/depths in next experiment.

Table 5

Recognition comparison of different methods on FGVC-Aircraft dataset

Train. Anno.	Test Anno.	Method	Accuracy
BBox	BBox	Jonathan Krause et al. [16]	92.8%
Parts	Bbox	FCAN [36]	91.3%
None	None	DVAN [12]	87.1%
None	None	FCAN [36]	89.1%
None	None	Jianlong Fu et al. [20]	92.5%
None	None	Tsung-Yu Lin et al. [11]	92.6%
None	None	AlexNet	82.4%
None	None	VGGNet	93.5%
None	None	GoogLeNet	90.1%
None	None	DMCFN-AlexNet(conv3)	86.2%
None	None	DMCFN- VGGNet (conv5-1)	94.3%
None	None	DMCFN- GoogLeNet (icp2)	92.3%

4.2 Effect of CNN features at different levels/depths

The previous works [28, 37] show that the low-level CNN features is more sensitive to appearance variations, and compared to high-level CNN features which can balance feature representativeness and generalization ability. Therefore, we verify this on these three datasets ranging from low-level to high-level vision. The classification accuracy using the feature maps for these three datasets are demonstrated in Table 6. In our experiments, we found that in FGVC-Aircraft datasets, features from conv4-3 does well on this dataset achieving 92.8% accuracy. However, in Stanford Cars dataset, features from conv5-1 achieving 94.3% accuracy. Maybe compared to aircrafts, cars are smaller and appear in a more cluttered background. The feature maps of conv5-1 may better separate car from non-car objects and preserve more middle-level information to achieve recognition that is more accurate.

Table 6
Classification accuracy using different feature maps on three datasets

Feature map Vessels dataset FGVC-aircraft Stanford cars

Vgg-16(conv3-3) 88.4% 82.9% 89.2%

Vgg-16(conv4-3) 93.5% 92.8% 91.6%

Vgg-16(conv5-1) 90.2% 91.7% 94.3%

AlexNet(Conv5) 79.2% 77.6% 81.5%

AlexNet(Conv4) 82.1% 80.4% 83.4%

AlexNet(Conv3) 85.7% 83.3% 86.2%

GoogLeNet (conv2) 79.4% 76.2% 81.6%

GoogLeNet (icp1) 86.1% 82.5% 88.6%

GoogLeNet (icp2) 91.2% 88.1% 92.3%

Feature map	Vessels dataset	FGVC-aircraft	Stanford cars
Vgg-16(conv3-3)	88.4%	82.9%	89.2%
Vgg-16(conv4-3)	93.5%	92.8%	91.6%
Vgg-16(conv5-1)	90.2%	91.7%	94.3%
AlexNet(Conv5)	79.2%	77.6%	81.5%
AlexNet(Conv4)	82.1%	80.4%	83.4%
AlexNet(Conv3)	85.7%	83.3%	86.2%
GoogLeNet (conv2)	79.4%	76.2%	81.6%
GoogLeNet (icp1)	86.1%	82.5%	88.6%
GoogLeNet (icp2)	91.2%	88.1%	92.3%

4.3 Effect of exploring feature relationships

There are a few methods proposed for fine-grained classification. Most of them directly concatenated multiple features together prior to classification. Such a direct combination does not adequately explore the relationship between global features and local features. Therefore, we conduct the experiment to test the fusion performance.

The λ₁ and λ₂ parameters of DMCFN used on three datasets are given in Table 7.

Table 7
The parameters of DMCFN used on vessels dataset, FGVC-Aircraft and Stanford Cars

Parameters Datasets

Vessels dataset FGVC-aircraft Stanford cars

λ ₁ 0.001 0.01 0.001

λ ₂ 0.1 0.1 0.1

Parameters	Datasets
λ ₁	0.001	0.01	0.001
λ ₂	0.1	0.1	0.1

In order to test the ability of the fusion layer based on our method, we evaluate the performance using regularizations on fusion layers and concatenating multiple extracted features on three datasets based on Vgg-16. We plot the performance w.r.t. the level of features in Fig. 5. Using regularizations on fusion layer clearly achieves higher performance than concatenating features straightly. And we also found when the low level features are selected, the improvement of our method is even more significant.

Fig. 5

Classification performance on the three datasets using feature maps from different levels. We plot the results of DMCFN without regularization (red), DMCFN with regularization (blue).

5 Conclusion

In this paper, we propose a multi-feature fusion convolutional neural network (DMCFN) for fine-grained ship recognition. The proposed architecture does not need bounding box/part annotations for training. We observe that convolutional features at different levels have different properties, and fuse representation of multiple features to find out the correlation between global feature and local feature, which improve fine-grained classification performance obviously. We do extensive experiments demonstrate the superior performance on other fine-grained recognition tasks such as FGVC-Aircraft dataset and Stanford Cars dataset. As part of future work, we intend to investigate the applicability of the DMCFN in more complex environment. Another direction is to reduce the multiple features redundancy, computational complexity and time complexity.

References

Corbane

, Najman

, Pecoul

, Demagistri

and Petit

, A complete processing chain for ship detection using optical satellite imagery, International Journal of Remote Sensing31 (2010), 5837–5854.

Wang

Q.J.

, Chen

D.Q.

and Chen

D.Q.

, Pattern recognition for ship based on Bayesian networks, International Conference on Fuzzy Systems and Knowledge Discovery4 (2007), 684–688.

Antelo

, Ambrosio

, Gonzalez

and Galindo

, Ship detection and recognition in high-resolution satellite images, Geoscience and Remote Sensing Symposium4 (2009), 514–517.

Chen

and Gao

, Ship recognition based on improved forwards-backwards algorithm, International Conference on Fuzzy Systems and Knowledge Discovery5 (2009), 509–513.

Dolgopolov

A.V.

, Kazantsev

P.A.

, Bezuhliy

N.N.

, Dolgopolov

A.V.

, Kazantsev

P.A.

and Bezuhliy

N.N.

, Ship detection in images obtained from the unmanned aerial vehicle (uav), Indian Journal of Science & Technology9 (2017), 1–7.

Tang

, Deng

, Huang

G.B.

and Zhao

, Compressed-domain ship detection on spaceborne optical image using deep neural network and extreme learning machine, IEEE Transactions on Geoscience & Remote Sensing53 (2014), 1174–1185.

Liu

, Cui

H.Y.

, Kuang

and Li

G.Q.

, Ship detection and classification on optical remote sensing images using deep learning, 12 (2017), 05012.

Branson

, Horn

G.V.

, Belongie

S.J.

and Perona

, Bird species categorization using pose normalized deep convolutional nets, In BMVC, 2014.

Reed

S.E.

, Akata

, Schiele

and Lee.

, Learning deep representations of fine-grained visual descriptions, In CVPR, 2016.

10.

Lin

Y.-L.

, Morariu

V.I.

, Hsu

, Davis

L.S.

, Jointly optimizing 3d model fitting and fine-grained classification, In European Conference on Computer Vision (ECCV), Springer, 2014, pp. 466–480.

11.

Lin

T.Y.

, Roychowdhury

and Maji

, Bilinear CNN Models for Fine-Grained Visual Recognition, pp. , IEEE International Conference on Computer Vision (2016), 1449–1457.

12.

Zhao

, Wu

, Feng

, Peng

and Yan

, Diversified visual attention networks for fine-grained object classification, IEEE Transactions on Multimedia19 (2017), 1245–1256.

13.

Xie

, Tian

, Hong

, Yan

and Zhang

, Hierarchical part matching for fine-grained visual categorization, pp. , IEEE International Conference on Computer Vision (2013), 1641–1648.

14.

Farrell

, Oza

, Zhang

, Morariu

V.I.

, Darrell

and Davis

L.S.

, Birdlets: Subordinate categorization using volumetric primitives and pose-normalized appearance, IEEE International Conference on Computer Vision23 (2011), 161–168.

15.

Yan

, Zhang

, Piramuthu

, Jagadeesh

, Decoste

and Di

, Hd-cnn: hierarchical deep convolutional neural networks for large scale visual recognition, IEEE International Conference on Computer Vision, 2014, pp. 2740–2748.

16.

Krause

, Jin

, Yang

and Li

F.F.

, Fine-grained recognition without part annotations, IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5546–5555.

17.

Lam

, Mahasseni

and Todorovic

, Fine-grained recognition as HSnet search for informative image parts, IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6497–6506.

18.

Huang

, He

, Cao

and Cao

, Task-driven progressive part localization for fine-grained object recognition, IEEE Transactions on Multimedia18 (2016), 2372–2383.

19.

Krause

, Gebru

, Deng

, Li

L.J.

and Li

F.F.

, Learning features and parts for fine-grained recognition, International Conference on Pattern Recognition, 2014, pp. 26–33.

20.

, Zheng

and Mei

, Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition, IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4476–4484.

21.

Liu

, Gao

, Meng

and Zuo

, Two-stream contextualized CNN for fine-grained image classification, Thirtieth AAAI Conference on Artificial Intelligence, 2016, pp. 4232–4233.

22.

Wang

, Cai

, Lu

and Cham

T.J.

, Modality and Component Aware Feature Fusion for RGB-D Scene Classification, Computer Vision and Pattern Recognition, 2016, pp. 5995–6004.

23.

Jung

, Lee

, Yim

, Park

and Kim

, Joint fine-tuning in deep neural networks for facial expression recognition, IEEE International Conference on Computer Vision, 2016, pp. 2983–2991.

24.

, Jiang

Y.G.

, Wang

, Pu

and Xue

, Exploring inter-feature and inter-class relationships with deep neural networks for video classification, Proceedings of the ACM International Conference on Multimedia, 2014, pp. 167–176.

25.

Yang

, Yang

and Wang

J.Z.

, RAPID: Rating pictorial aesthetics using deep learning, ACM International Conference on Multimedia17 (2014), 457–466.

26.

Krizhevsky

, Sutskever

and Hinton

G.E.

, ImageNet classification with deep convolutional neural networks, International Conference on Neural Information Processing Systems60 (2012), 1097–1105.

27.

Zhou

, Khosla

, Lapedriza

, Oliva

and Torralba

, Learning deep features for discriminative localization, IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2921–2929.

28.

Wang

, Ouyang

, Wang

and Lu

, Visual tracking with fully convolutional networks, IEEE International Conference on Computer Vision, 2016, pp. 3119–3127.

29.

Oquab

, Bottou

, Laptev

and Sivic

, Learning and transferring mid-level image representations using convolutional neural networks, IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1717–1724.

30.

Maji

, Rahtu

, Kannala

, Blaschko

and Vedaldi

, Fine-grained visual classification of aircraft, arXiv preprint arXiv:1306.5151, 2013. 5,6.

31.

Jan

Krause

, Michael

Stark

, Jia

Deng

and Li

Fei-Fei

, 3d object representations for fine-grained categorization, in Computer VisionWorkshops (ICCVW), 2013, pp. 554–561.

32.

Simonyan

and Zisserman

, Very deep convolutional networks for large-scale image recognition, Computer Science, 2014.

33.

Szegedy

, Liu

, Jia

, Sermanet

, Reed

, Anguelov

, Erhan

, Vanhoucke

and Rabinovich

, Going deeper with convolutions. arXiv preprint arXiv:1409, 4842, 2014. 3.

34.

Chai

, Lempitsky

and Zisserman

, Symbiotic segmentation and part localization for fine-grained categorization, IEEE International Conference on Computer Vision163 (2013), 321–328.

35.

Gosselin

P.H.

, Murray

, Jégou

and Perronnin

, Revisiting the fisher vector for fine-grained classification, Pattern Recognition Letters49 (2014), 92–98.

36.

Liu

, Xia

, Wang

and Lin

, Fully convolutional attention localization networks: Efficient attention localization for fine-grained recognition, CoRR, abs/1603.06765, 2016.

37.

Yang

, Yan

, Lei

and Li

S.Z.

, Convolutional channel features, IEEE International Conference on Computer Vision, 2015, pp. 82–90.

Multi-feature fusion of convolutional neural networks for Fine-Grained ship classification

Abstract

Keywords

1 Introduction

2 Related work

2.1 Fine-grained image recognition

2.2 Multi-feature fusion

3 Approach

4.1 Comparison to previous works

Table 7 The parameters of DMCFN used on vessels dataset, FGVC-Aircraft and Stanford Cars Parameters Datasets Vessels dataset FGVC-aircraft Stanford cars λ 1 0.001 0.01 0.001 λ 2 0.1 0.1 0.1

References

Table 7
The parameters of DMCFN used on vessels dataset, FGVC-Aircraft and Stanford Cars

Parameters Datasets

Vessels dataset FGVC-aircraft Stanford cars

λ ₁ 0.001 0.01 0.001

λ ₂ 0.1 0.1 0.1