Classification of vehicle types using fused deep convolutional neural networks

Abstract

Classification of vehicle types using surveillance images is a challenging task in Intelligent Transportation Systems (ITS). In this paper, Convolutional Neural Networks for Vehicle types classification are comparatively studied. Firstly, GoogLeNet, ResNet50 and InceptionV4 are exploited as baselines for comparison. Secondly, we proposed a new network architecture based on GoogLeNet, ResNet50 and InceptionV4, named Fused Deep Convolutional Neural Networks (FDCNN), to take advantage of the ‘Inception’ module on parameter optimization and ‘Residual’ module on avoiding gradient vanishing, and applied the model to vehicle types classification. Thirdly, we created a vehicle dataset under the conditions of complicated and varied weather and lighting conditions, and conducted comparative experiments using the SEU vehicle dataset. Experimental results show much better performance of the proposed FDCNN with RMSprop optimizer on recognizing vehicle types. Specifically, the average classification accuracies of six vehicle types, such as truck, coach, sedan, minivan, pickup and SUV, are over 96.8%. Among the six classes of vehicle types, sedan is the most difficult to classify and the proposed FDCNN achieved over 93.81% accuracy in comparative experiments.

Keywords

Vehicle types convolutional neural networks fused deep convolutional neural networks intelligent transportation systems

1 Introduction

The type of vehicles such as car, bus, truck, etc. is of great significance to the determination of highway toll, the management of large parking lot and the monitoring and control of highway traffic. Vehicle type classification based on images from surveillance cameras is an important aspect of intelligent transportation, which can improve the efficiency of many related works, including toll collection system, vehicle related crime stopping, traffic monitoring, traffic control and management [5 , 29]. Many researches have been carried out in recent years, and the advantages of image-based vehicle type classification comparing the traditional loop detector are generally recognized [25, 26].

In the field of image-based vehicle recognition, the traditional machine learning method is mainly based on the appearance features of vehicles. The commonly used vehicle frontal feature extraction methods, such as vehicle geometric features [6 , 31], HOG [1 , 28], fast Fourier transform [10], SURF [7], share the common disadvantages of low precision, complex calculation and low recognizability. Laopracha [18] proposed to use Gabor features of images, combined with SVM classifier and neural network classifier for classification, demonstrating better performance. Wen x et al. [27] exploited Haar-like features and AdaBoost classifier and proposed a fast-training scheme. Tungkastan [4] proposed an approach using the Hausdorff distance between the Harris corner point between the vehicle to be identified and the ground-truth samples of vehicle. The vehicle with the smallest distance is judged to be the same type. The disadvantage of minimum distance matching is that it is greatly affected by the random noise of samples, especially when the samples overlap in the feature space. Zhou [32] applied hidden Markov model (HMM) to classify vehicle types, training class-specific HMM models for each vehicle type. To sum up, the traditional machine learning methods for vehicle recognition extract features using some hand-designed algorithms followed by appropriate classification approach. The feature extraction, either geometric features or algebraic features, are heuristic in nature and suboptimal in characterizing different image invariances caused by illumination, scaling, translation and rotation.

Compared with traditional machine learning methods, deep learning technology has excellent performance in computer vision. As pointed out by Hinton [2], deep learning technology is more powerful in feature extraction and discrimination functions. Deep Convolution Neural Network (DCNN) is the most popular deep model which has been extensively applied. Some works have been reported of applying CNNs for vehicle type classification. Sasongko et al. [3, 23] used pre-trained ResNet models, including ResNet34, ResNet50 and ResNet 101, to classify toll station vehicle types, and have enhanced its accuracy over 95% using ResNet101. Lee et al. [24] has successfully applied AlexNet to vehicle plate recognition, also reaching a high accuracy. VGG-16 and AlexNet are both utilized by Seng et al. [8] for vehicle types classification in traffic videos. Xue et al. [14] have tested ZFNet performance and proposed a fused network designed especially for the low-resolution images. Several popular networks, such as ResNet, Inception-ResnetV2, InceptionV3 and NASNet were tested and compared by Ali et al. [15] for vehicle recognition while Inception-ResnetV2 has shown the best performance. Rong et al. [11] proposed an automatic sparse encoder to obtain a deep network for vehicle type classification, using the convolution kernel to generate features. Dong et al. [30] proposed a semi-supervised convolutional neural network based on the front image of vehicles, and introduced a sparse Laplacian filter to learn from unlabeled data. In 2017, Wang et al. [13] used transfer learning and established a convolutional neural network to classify vehicle images.

The transfer learning or semi-supervised learning by combining DNNs and filters do not achieve much higher accuracy on the classification of vehicle types, due to the similarity of vehicle front face and windshield. Some networks go deeper at architectures but with the cost of higher computation resources. Some researchers have studied feature fusion of different transfer learning networks and achieved better performance. Ali and Ragb [22] fused Densenet201, Resnet50, and their proposed model as three approaches. The output of each network gives a vote in the classification process. The experimental results show that the proposed approach has better performance than the networks used in the fusion process when they act individually. Banik et al. [20] proposed a fused convolutional neural network (CNN) model to classify the images of white blood cells and fuse the feature maps of two convolutional layers by using the operation of max-pooling to give input to the fully-connected neural network layer. The results show that the fused model trains faster than CNN-RNN model. The prior researches showed that fused models may have some superiorities, with improved performance or computing time.

In this paper, we aim to design a new scheme of Convolutional Neural Network to meet the high reliability requirements of vehicle types classification. A Fused Deep Convolutional Neural Network for Vehicle Types Classification (FDCNN-VTC) is proposed. To acquire the particular one-dimensional feature matrix from every single network, each deep neural network’s fully connected layer or global average pooling layer has been removed, which means only the feature extraction without classifier in the FDCNN-VTC. Then, a separate network is used to extract features of vehicle frontal image. The feature matrices are fused using parallel fusion strategies, and then retrained and reclassified by a global average-pooling layer. Eventually an N-dimensional vector is generated, where N is the number of classification results and the vector represents the possibility of each class.

The remainder of this paper is organized as follows. Section 2 briefly introduces the structure of proposed GoogLeNet, ResNet, InceptionV4 for vehicle type classification, and how they are adjusted to fit the input data and output categories. Section 3 introduces the whole structure of FDCNN-VTC, and expounds the process from input vehicle front images to single vehicle type classification models. Section 4 illustrates the experiment of proposed single vehicle type classification models, and compares the performance of each model, optimizers and classifiers. Meanwhile, the performance of proposed FDCNN-VTC and the top three single models are also compared to prove superiority of the fused model. Section 5 concludes the paper and attaches the acknowledgement.

2 Convolutional neural networks for vehicle types classification

2.1 GoogLeNet for vehicle types classification

The proposed structure of GoogLeNet for Vehicle Type Classification (GoogLeNet -VTC) using the core of Inception structure to facilitate addition and modification is shown in Table 1, where the inception modules are 1×1, 3×3 and 5×5, respectively to optimize parameters and operations, and the network finally adopts average pooling to replace the fully connected layer. Different from the traditional GoogLeNet model with output size of (1, 1, 1024), GoogLeNet-VTC is shown in Fig. 1, which changes the block dimensions in Inception modules, removes the two max pooling layers and doubles the convolution dimensions of the last Inception module. The final output from the model has the size of (1, 1, 2048). To avoid gradient vanishing, the network has remained two additional softmax blocks in the traditional GoogLeNet module for conducting the gradient forward (auxiliary classifier). The auxiliary classifiers use the output of middle layers (Inception4b and Inception4e) as a temporary classification result and adds it to the final classification result according to a small weight (0.3), which is equivalent to a self-model-fusion and a controlling signal for backpropagation gradient. Finally, the dropout is not used due to the removal of the fully connected layer. The input x is a 3D array with size s_i × s_j × p, and the first convolution process of GoogLeNet-VTC can be calculated as:

Table 1
The proposed structure of GoogLeNet-VTC

Type	Stride	Output size	Depth	1×1	3×3 Reduce	3×3	5×5 Reduce	5×5	Pool	Params	Ops
Input shape		224×224×3
Convolution	7×7/2	112×112×64	1							9K	119M
Max pool	3×3/2	56×56×64	0						m3×3
Convolution	3×3/1	56×56×192	2		64	192				115K	360M
Max pool	3×3/2	28×28×192	0						m3×3
Inception(3a)		28×28×256	2	64	96	128	16	32	32	164K	128M
Inception(3b)		28×28×320	2	64	96	128	32	64	64	228K	179M
Inception(3c)	3×3/2	14×14×640	2	0	128	256	32	64	m3×3	398K	108M
Inception(4a)		14×14×640	2	256	96	192	32	64	128	545K	107M
Inception(4b)		14×14×640	2	224	112	224	32	64	128	595K	117M
Inception(4c)		14×14×640	2	192	128	256	32	64	128	654K	128M
Inception(4d)		14×14×640	2	160	144	288	32	64	128	722K	142M
Inception(4e)	3×3/2	7×7×1024	2	0	160	256	64	128	m3×3	717K	56M
Inception(5a)		7×7×1024	2	384	192	384	48	128	128	1.6M	78M
Inception(5b*)		7×7×2048	2	768	384	768	96	256	256	3.2M	156M
Avg pool	7×7/1	1×1×2048	0

Fig. 1

The adjusted module in Inception (5b*) in GoogLeNet-VTC

Params = t_{i} \times t_{i} \cdot n \cdot p + bias

(1)

Ops = Params \cdot s_{1} \times s_{2} / S^{2}

(2) where s represents the size of the normalized 2D feature maps and p is the channel of the image. The convolution kernel size is a multiple, multi-layer square tensor which can be denoted as (t_i × t_i × p, n, S), where the channel p remains the same as the input array, the size is t_i × t_i and the stride is S. The calculations of parameters and operations in the inception module are as follows:

Params = p \cdot \sum_{i = 1} (t_{i} \times t_{i} \cdot n_{i} \cdot p + bias)

(3)

$Ops = \sum_{i = 1} \sum_{j = 1} (t_{i} \times t_{i} \cdot n_{i} \cdot p + bias) \cdot (s_{i} \times s_{j}) / {S_{j}}^{2}$ (4)

2.2 ResNet50 for vehicle types classification

The proposed structure of ResNet50 for Vehicle Type Classification (ResNet50 -VTC), the core of which is residual learning, is shown in Table 2. In the conventional network, the objective function is the optimal solution mapping H (x) when solving parameters at each layer. For the residual network, it does not directly match the optimal solution mapping H (x), but matches the residual map as in Equation (5). The residual mapping process is shown in Fig. 2. The optimal solution of the original map can be changed to F (x) + x, and the gradient of Loss is shown as Equation (6).

Table 2
The proposed structure of ResNet50-VTC

Layer name ResNet50-VTC Output size Params Ops

Input shape 224×224×3

Conv1 7×7,64, stride 2 56×56×64 9K 118M

3×3 max pool, stride 2

Conv2_x $[\begin{matrix} 1 \times 1.64 \\ 3 \times 3.64 \\ 1 \times 1.256 \end{matrix}] \times 3$ 56×56×256 16K 17M

Conv3_x $[\begin{matrix} 1 \times 1.128 \\ 3 \times 3.128 \\ 1 \times 1.512 \end{matrix}] \times 4$ 28×28×512 176K 137M

Conv4_x $[\begin{matrix} 1 \times 1.256 \\ 3 \times 3.256 \\ 1 \times 1.1024 \end{matrix}] \times 6$ 14×14×1024 0.5M 102M

Conv5_x $[\begin{matrix} 1 \times 1.512 \\ 3 \times 3.512 \\ 1 \times 1.2048 \end{matrix}] \times 3$ 7×7×2048 2M 102M

Avg pool 7×7 max pool, stride 1 1×1×2048

Layer name	ResNet50-VTC	Output size	Params	Ops
Input shape		224×224×3
Conv1	7×7,64, stride 2	56×56×64	9K	118M
	3×3 max pool, stride 2
Conv2_x	$[\begin{matrix} 1 \times 1.64 \\ 3 \times 3.64 \\ 1 \times 1.256 \end{matrix}] \times 3$	56×56×256	16K	17M
Conv3_x	$[\begin{matrix} 1 \times 1.128 \\ 3 \times 3.128 \\ 1 \times 1.512 \end{matrix}] \times 4$	28×28×512	176K	137M
Conv4_x	$[\begin{matrix} 1 \times 1.256 \\ 3 \times 3.256 \\ 1 \times 1.1024 \end{matrix}] \times 6$	14×14×1024	0.5M	102M
Conv5_x	$[\begin{matrix} 1 \times 1.512 \\ 3 \times 3.512 \\ 1 \times 1.2048 \end{matrix}] \times 3$	7×7×2048	2M	102M
Avg pool	7×7 max pool, stride 1	1×1×2048

Fig. 2

The residual mapping processes

$F (x) = H (x) - x$ (5)

$\frac{\partial Loss}{\partial x_{l}} = \frac{\partial Loss}{\partial x_{L}} \cdot \frac{\partial x_{L}}{\partial x_{l}} = \frac{\partial Loss}{\partial x_{L}} \cdot (1 + \frac{\partial F (x_{l}, w_{i})}{\partial x_{l}})$ (6)

where x_l represents the matrix in previous layer, and x_L represents the results after the residual. When the residual module $1 + \frac{\partial F (x_{l}, w_{i})}{\partial x_{l}}$ is in a very small extent, this term can be regarded as: $H (x) = F (x) + x$ (7)

For instance, as is shown in Fig. 3, the maximum parameter of this module, where the input has gone through 3 convolution kernels, is calculated as:

Fig. 3

The Structure of Conv2_x

$\begin{matrix} Max Params = & 1^{2} \times 64^{2} + 3^{2} \times 64^{2} + 1^{2} \\ \times 256 \times 64 = 57 K \end{matrix}$ (8)

In the residual process, where the input has been convoluted once by conv (1×1, 256), the parameter

can be optimized and calculated as: $Min params = 1 \times 1 \times 256 \times 64 = 16 K$ (9)

The ResNet50-VTC uses the main residual module to optimize parameters and operations, and removes the last fully connected layer with a global average-pooling layer. Finally, the ResNet50-VTC outputs a 1×1×2048 array, which represents the results of this no-top model.

2.3 InceptionV4 for vehicle types classification

The proposed structure of InceptionV4 for Vehicle Type Classification (InceptionV4 -VTC) combining the core of Inception and Residual module is shown in Table 3 where 1×1 blocks are used for parameter optimization and residual blocks are used to avoid gradient vanishing. The Inception-resnet modules of InceptionV4 -VTC contain 1×1, 3×3, 1×7 and 1×3 blocks, which are formed as in Fig. 4, one channel for combined #1×1 linear extraction, one for 1×1 extraction and one for 1×n extraction. The InceptionV4-VTC has changed the linear block dimension in the final module Inception-resnet-C to 2048, and then has an output array (1,1,2048) after an average pooling 8×8.

Table 3
The proposed structure of InceptionV4-VTC

Layer name Depth Output size #1×1 1×1 #3×3 3×3 1×n n×1 1×1(Linear) Max pool

Input 229×229×3

Stem 35×35×256 64 32 64 64 64 3×3 m,s=2

Inception-resnet-A 5 35×35×256 32 32,2 32,2 32 256

Reduction-A 17×17×896 192 384 192 3×3 m,s=2

Inception-resnet-B 10 17×17×896 128 128 128 128 896

Reduction-B 8×8×1792 256,3 384 256,2 3×3 m,s=2

Inception-resnet-C 5 8×8×2048 192 192 192 192 2048

Average pooling 1×1×2048

Layer name	Depth	Output size	#1×1	1×1	#3×3	3×3	1×n	n×1	1×1(Linear)	Max pool
Input		229×229×3
Stem		35×35×256	64		32	64	64	64		3×3 m,s=2
Inception-resnet-A	5	35×35×256	32	32,2	32,2	32			256
Reduction-A		17×17×896		192	384	192				3×3 m,s=2
Inception-resnet-B	10	17×17×896	128	128			128	128	896
Reduction-B		8×8×1792		256,3	384	256,2				3×3 m,s=2
Inception-resnet-C	5	8×8×2048	192	192			192	192	2048
Average pooling		1×1×2048

Fig. 4

The structure of Inception-Resnet-A module

3 Fused deep convolutional neural networks for vehicle types classification

As the module ‘Inception’ shows great performance on parameter optimization, and the ‘Residual’ module of convolutional neural networks shows great improvement on avoiding gradient vanishing, an innovative Fused Deep Convolutional Neural Networks for Vehicle Types Classification (FDCNN-VTC), based on the GoogLeNet-VTC, ResNet50-VTC and InceptionV4-VTC, is proposed in this paper to improve the performance of single convolutional neural network. The proposed structure of the FDCNN-VTC, consisting of Resnet50-VTC, GoogLeNet-VTC and InceptionV4-VTC, is shown in Fig. 5. The single vehicle type classification models, such as GoogLeNet-VTC, ResNet50-VTC and InceptionV4-VTC, have regulated the final module structure in order to fit the output size (1,1,2048), and three 2048-dimensional arrays are generated in parallel as a (1,3,2048) array in the output layer. As to evaluate combination weights, a fully connected layer is then used with a dropout rate 0.5, which is set as the norm of single deep convolutional neural networks.

Fig. 5

The Structure of Fused Deep Convolutional Neural Networks for Vehicle Types Classification (FDCNN-VTC)

For Resnet50-VTC in the structure of FDCNN-VTC, the input vehicle image is 224×224×3, which consists of 3 conv2_x units, 4 conv3_x units, 6 conv4_x units, and 3 conv5_x units. The first layer of Resnet50-VTC in the structure of FDCNN-VTC is a 7×7 convolution, and the last layer is a fully connected layer, where conv2_x unit, conv3_x unit, conv4_x unit and conv5_x unit all include 3 convolutional layers, the convolution operators are 1×1, 3×3 and 1×1, respectively. Finally, the output characteristics of 1×2048 dimensions are obtained. For InceptionV4-VTC, the input vehicle image size is 229×229×3 (Padding = SAME), using the structure of layer-convolution and layer-pooling in Stem module. After Stem module, 5 Inception-resnet-A units, 1 Reduction-A unit, 10 Inception-resnet-B units, a Reduction B unit, 5 Inception-resnet-C units and an average pooling layer are connected respectively, and finally an array of 2048-dimensional characteristics is obtained. For GoogLeNet -VTC, the input vehicle image size is 224×224×3. The first stage performs two convolutions of 7×7, stride = 2, max pooling of 3×3, stride = 2, then with a convolution of 3×3, stride = 1, max pooling of 3×3, stride = 2, which aims to improve the nonlinearity through the activation function. The deep separable convolution operation units are then contacted, where each operation unit contains a #1×1 convolution operation, a 3×3 convolution operation after 1×1 linear operation, a 5×5 convolution operation after 1×1 linear operation and a maximum pooling operation with ReLU activation function. Two softmax activation functions with a small weight (0.5) relating to the results are attached between modules Inception 3, 4 and 4, 5 as an auxiliary classifier. After the average-pooling layer, a 2048-dimensional output is obtained.

In the structure of FDCNN-VTC, the middle Inception, ResNet, and Inception-resnet modules consist of several deep separable convolution operation units, including a 3×n depth separable convolution operation with ReLU activation function, and continuous convolution operation units. The continuous convolution operation units contain n times 1×1 depth convolution operation with ReLU activation function, sub-convolution operation and maximum pooling, followed by m times 3×3 depth convolution operation with ReLU activation function, respectively. Each single pre-model obtains 2048-dimensional output characteristics (m, n is decided by back propagation). The three pre-models output the 3×2048-dimensional features, and then use a dropout layer with rate 0.5. Finally, a fully connected layer is connected to the output node with the softmax function, and the 6-dimensional vector is output. The classification of vehicle type corresponding to the largest component in the vector is the final result.

4 Experiments

In this section, we evaluate the performance of Convolutional Neural Networks for vehicle types classification on SEU vehicle dataset, which is provided by Jiangsu Transportation Science Research Institute. The SEU vehicle dataset contains 9,665 raw (with blurred, sombre or multi-vehicle congested) vehicle images with sizes of 1600×1200 and 1920×1080 captured from cameras at different time and places in Jiangsu expressway. The images in SEU vehicle dataset contain changes in the illumination condition, the scale, the surface colour of vehicles, and the viewpoint, and the images contain frontal views of a single vehicle captured from variable distances. The vehicle front-face dataset contains six types of vehicles: Truck, Coach, Sedan, Mini-Van, Pickup, and SUV. Examples of SEU vehicle dataset are shown in Fig. 6.

Fig. 6

Examples of SEU vehicle dataset. (a) Truck. (b) Coach. (c) Sedan. (d) Minivan. (e) Pickup. (f) SUV.

This section describes the measurement strategy for Region of Interests (ROI) of vehicle frontal image. Specifically, the license plate is firstly segmented by finding a rectangle using the Maximally Stable External Regions (MSER) method [16, 21]. Based on MSER method, the characters in an image which are under the length of 7 (the length of Chinese number plate) are recognized as the license plate area, and are covered with a gray bounding rectangle. The left and

right lower edges of the license plate area are edges of the vehicle frontal part area, and the upper edge height is located 2×h above the license plate (h is the height of the license plate), as shown in Fig. 7. In addition, the SEU dataset of vehicle frontal part includes images not only on different light conditions, such as strong light, normal light, and low light, but on different weathers such as sunny, foggy and cloudy days, which are shown in Fig. 8. This section describes the measurement strategy for Region of Interests (ROI) of vehicle frontal image. Specifically, the license plate is firstly segmented by finding a rectangle using the Maximally Stable External Regions (MSER) method [29, 30]. Based on MSER method, the characters in an image which are under the length of 7 (the length of Chinese number plate) are recognized as the license plate area, and are covered with a gray bounding rectangle. The left and right lower edges of the license plate area are edges of the vehicle frontal part area, and the upper edge height is located 2×h above the license plate (h is the height of the license plate), as shown in Fig. 7. In addition, the SEU dataset of vehicle frontal part includes images not only on different light conditions, such as strong light, normal light, and low light, but on different weathers such as sunny, foggy and cloudy days, which are shown in Fig. 8.

Fig. 7

The ROI of vehicle frontal image. (a) The height of license plate. (b) The height of vehicle frontal part.

Fig. 8

Vehicle frontal part image under different weather and illumination conditions. (a) Strong light. (b) Foggy. (c) Normal light. (d) Sunny. (e) Low light. (f) Cloudy.

The following experiments are conducted on conditions of Jupyter Notebook (Ipython), based on Python 3.6, Keras deep-learning packages (Tensorflow as backend). The CPU of experimental computer servers is Intel Core i7-10700 2.6 GHz. The GPU is NVIDIA GeForce Titan 1080 s with 8 G memory and the overall RAM is 32 GB. In the following experiments, 3500 images in SEU dataset which contain clear and single vehicle per image are finally adopted, 60% of which are used as training data, 20% as validation data, and the rest 20% as testing data. Among the 3500 images, 6 types of vehicles are included, which are Coach, Minivan, Pickup, Sedan, SUV and Truck, with number of samples 500, 500, 500, 1000, 500, 500, respectively.

4.1 Performance evaluations of convolutional neural networks for vehicle types classification

In the following experiments, Convolutional Neural Networks are tested in a fixed batch size 64, determined by the best computing capacity of the sever GPU. While the batch size does not affect the ACC and ROC values as well as Adam and RMSprop optimizers, which are required in the next fusion experiment, batch size scales are not discussed in this section. The convergence curves of Convolutional Neural Networks for vehicle types classification with optimizer Adam are shown in Fig. 9, and the convergence curves of Convolutional Neural Networks for vehicle types classification with optimizer RMSprop are shown in Fig. 10. As are illustrated in Fig. 9(a) and Fig. 10(a), the vertical axis represents cost function values of each single network, while the horizontal axis represents the epochs. In Fig. 9(b) and Fig. 10(b), the vertical represents accuracy. After 20 times of epochs, the curve of cost function value continues to decrease as epoch grows, while the curve of accuracy continues to increase, suggesting that the single networks basically fit the dataset. At about epoch 12, the RMSprop optimizer has a more convergence performance than the Adam optimizer, achieving more efficiency and adaptability towards SEU vehicle dataset.

Fig. 9

The training and validation convergence curve (optimizer = Adam). (a) The training and validation loss. (b) The training and validation accuracy

Fig. 10

The training and validation convergence curve (optimizer = RMSprop). (a) The training and validation loss. (b) The training and validation accuracy

Table 4 shows the accuracy list of GoogLeNet for vehicle types classification (GoogLeNet-VTC), InceptionV4 (InceptionV4-VTC), and ResNet50 (ResNet50-VTC), compared with ZFNet (ZFNet-VTC), AlexNet (AlexNet-VTC), ResNet101 (ResNet101-VTC), VGG16 (VGG16-VTC) under the two optimizers. In terms of the average training and validation accuracy, the Adam optimizer exceeds in training accuracy at about 1% but lags in validation accuracy more than 5%, revealing a higher possibility of overfitting on the validation dataset. Compared to Adam optimizer, the results above show that RMSprop optimizer has a smaller

Table 4

Accuracy of different single networks

	optimizer= RMSprop		optimizer= Adam
	train-acc	val-acc	train-acc	val-acc
VGG16-VTC	0.7031	0.7396	0.7302	0.6840
ResNet50-VTC	0.9434	0.8226	0.9533	0.7884
ResNet101-VTC	0.9637	0.7919	0.9732	0.7633
InceptionV4-VTC	0.9679	0.9129	0.9568	0.8386
AlexNet-VTC	0.7618	0.7278	0.7767	0.7017
ZFNet-VTC	0.8211	0.7825	0.8232	0.7410
GoogLeNet-VTC	0.9101	0.8944	0.9077	0.8223
Average	0.8673	0.8102	0.8744	0.7628
Average of Top3	0.9405	0.8766	0.9393	0.8164

variance between training and validation accuracies (about 5% compared to 11%), and a greater average accuracy on the top 3 single networks (1% on training and 6% on validation). RMSprop leads to a better performance and can be applied as the optimizer of fused deep neural network.

In order to test the stability and robustness of the above seven Convolutional Neural Networks, such as GoogLeNet-VTC, InceptionV4-VTC, ResNet50-VTC, ZFNet-VTC, AlexNet-VTC, ResNet101-VTC and VGG16-VTC, the 100 cross-validation test experiments are performed. Each experiment randomly extracts 1800 images from the test set, and calculates accuracies of the seven Convolutional Neural Networks. Bar plots of top classification accuracy and box plots of classification rates are shown in Fig. 11 (a). The graph illustrates that AlexNet-VTC, VGG-16-VTC and ZFNet-VTC share the lowest accuracy and variance. As epoch grows, ResNet101-VTC overfits due to the contradiction that training accuracy increases while validation and testing accuracy decrease. The Area Under Receiver Operating Characteristic (AUROC) values for abnormality detection in the vehicle frontal part testing dataset are shown in Fig. 11(b). The AUROC is designed to comprehensively evaluate the True Positive Rate (TPR) and False Positive Rate (FPR) values for abnormality detection. According to AUROC values, ResNet50-VTC, InceptionV4-VTC and GoogLeNet-VTC perform comprehensively better than ZFNet-VTC, AlexNet-VTC, ResNet101-VTC and VGG16-VTC on this testing dataset.

Fig. 11

The accuracies of the seven Convolutional Neural Networks. (a) The box plot of networks accuracy. (b) The bar plot of AUROC values.

4.2 Performance evaluations of fused deep convolutional neural networks for vehicle types classification

In this experiment, 3500 images in SEU vehicle dataset are adopted, 60% of which are used as training data, 20% as validation data, and the rest 20% as testing data. Categories are: Coach, Pickup, Minivan, Sedan, SUV and Truck. Due to the complex structure and layers of fused network, a large batch size could be very challenging in terms of the server computing capacity. A smaller batch size (24), about one third of the previous one (64), is adopted in the following experiment, which leads to the extension of epoch convergence. The training and validation rates are illustrated in Fig. 12. On optimizer RMSprop, the accuracy convergence curves start to flatten at about epoch 50, which is about three times of the converging epochs on single networks. Table 5 is the confusion matrix of the FDNN-VTC model. Each row of the confusion matrix represents the predicted class, and each column represents the true class. The diagonal indicates the rate of correctly classified samples. From Table 5, it can be seen that the correct classification rates of Coach, Truck and Pickup are the highest, and the correct classification rate of Sedan and SUV

Fig. 12

The convergence curve of FDCNN-VTC. (a) train-loss. (b) train accuracy. (c) validation-loss. (d) validation accuracy.

Table 5

The confusion matrix of FDCNN-VTC (%)

Class	Coach	Pickup	Minivan	Sedan	SUV	Truck
Coach	99.09	0	0.91	0	0	0
Pickup	1.89	98.03	0.08	0	0	0
Minivan	0.19	3.02	96.79	0	0	0
Sedan	0	0	0.21	93.81	5.98	0
SUV	0	0	0.08	5.19	94.73	0
Truck	0	0	1.53	0	0	98.47

with a large number of samples is slightly worse. Due to the high consistency of the frontal part information of SUV and Sedan, a small number of SUV and Sedan appear misclassification. The Recall and Precision Rates in multi-class classification should be regarded as a binary classification that, samples belonging to one specific class against samples belonging to the other classes.

Figure 13 shows the Recall and Precision Rates of FDCNN-VTC, compared with top-3 models ResNet50-VTC, InceptionV4-VTC, GoogLeNet-VTC, and the latest proposed architecture [33] ResNet152(no-top) on SEU vehicle dataset. There are no clear improvements (about 2%) on categories of the highest accuracy (Coach, Pickup and Truck), while the SUV and Sedan show a great increase (about 8% in average) in Recall and Precision. In terms of AUROC in Fig. 14, the FDCNN-VTC achieves the largest area and performs better on the True Positive Rates (TPR).

Fig. 13

The recalls and precision rates of FDCNN-VTC, compared with ResNet50-VTC, InceptionV4-VTC, GoogLeNet-VTC and Butt et al. [33]. (a) Recalls. (b) Precisions.

Fig. 14

AUROC of FDCNN-VTC, compared with ResNet50-VTC, InceptionV4-VTC, GoogLeNet-VTC and Butt et al. [33].

5 Conclusions

With the purpose of accurate and robust classification of vehicle types from surveillance images, a fused deep convolutional neural network is studied in this paper. Firstly, with transfer learning, some existing DNNs have been studied for the performance comparison on vehicle front face dataset. Secondly, by using parallel fusion strategies on output features, a new scheme of Fused Deep Convolutional Neural Networks (FDCNN) based on GoogLeNet for vehicle types classification, ResNet50 for vehicle types classification and InceptionV4 for vehicle types classification, has been proposed. Finally, the comparative experiments have been carried out based on SEU vehicle dataset under the conditions of complicated and changeable weather and lighting. The experimental results show that the proposed FDCNN for vehicle types classification with RMSprop optimizer is more capable of recognizing vehicle types and shows greater performance, especially on undistinguishable classes, which is an 8% increase on similar samples (SUV and Sedan).

Footnotes

Acknowledgment

The authors would like to thank Key Science and Technology Projects of Transportation Industry of Ministry of Communications (Grant No. 2020-MS5-134), Transportation Science and Technology Plan Project of Shandong Province (Grant No. 2020B36) and Science and Technology Project of Shandong High Speed Group Co., Ltd (Grant No. (2020) SDHS-GSGF-09). Their assistances are gratefully acknowledged.

References

Arunmozhi

and Park

, Comparison of HOG, LBP and Haar-Like Features for On-Road Vehicle Detection, 2018 IEEE International Conference on Electro/Information Technology (EIT) (2018), 0362–0367.

Krizhevsky

, Sutskever

and Hinton

, ImageNet classification with deep convolutional neural networks, Communications of the ACM 60(6) (2017), 84–90.

Sasongko

and Fanany

M.I.

, Indonesia toll road vehicle classification using transfer learning with pre-trained resnet models, 2019 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI) (2019), 373–378.

Tungkastan

, Jongsawat

and Premchaiswadi

, A proposed framework for automated real-time detection of suspicious vehicles, 2014 Twelfth International Conference on ICT and Knowledge Engineering (2014), 90–93.

Hicham

, Ahmed

and Mohammed

, Vehicle type classification using convolutional neural network, 2018 IEEE 5th International Congress on Information Science and Technology (CiSt) (2018), 313–316.

Chen

, Cai

, Zhao

, Lv

and Shu

, Vehicle type recognition based on multi-branch and multi-layer features, 2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC) (2017), 2038–2041.

Jang

D.M.

and Turk

, Car-Rec:Areal time car recognition system, 2011 IEEE Workshop on Applications of Computer Vision (WACV) (2011), 599–605.

Seng

, Lin

and Chen

, Convolutional neural network and the recognition of vehicle types, NeuroQuantology 16(6), 720–727.

Armin

E.U.

, Bejo

and Hidayat

, Vehicle type classification in surveillance image based on deep learning method, 2020 3rd International Conference on Information and Communications Technology (ICOIACT) (2020), 400–404.

10.

Kazemi

F.M.

, Samadi

, Poorreza

H.R.

and Akbarzadeh-T

, Vehicle recognition based on Fourier, Wavelet and curvelet transforms – a comparative study, Fourth International Conference on Information Technology (2017), 939–940.

11.

Rong

and Xia

, A vehicle type recognition method based on sparse auto encoder, Proceedings of the International Conference on Computer Information Systems and Industrial Applications 1 (2015), 323–326.

12.

Arróspide

, Salgado

and Camplani

, Image-based on-road vehicle detection using cost-effective Histograms of Oriented Gradients, Journal of Visual Communication and Image Representation 24(7) (2013), 1182–1190.

13.

Wang

, Zheng

, Huang

and Ding

, Vehicle type recognition in surveillance images from labeled web-nature data using deep transfer learning, IEEE Transactions on Intelligent Transportation Systems 19(9) (2913), 2913–2922.

14.

Xue

, Zhong

, Wang

and Hu

, Low –resolution vehicle recognition based on deep feature fusion, Multimedia Tools and Applications 77(20) (2018), 27617–27639.

15.

Ali

and Hammad

, A deep learning approach for vehicle detection, 2018 13th International Conference on Computer Engineering and Systems (ICCES) (2018), 98–102.

16.

Donoser

and Bischof

, Efficient Maximally Stable Extremal Region (MSER) Tracking, 2016 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) 1 (2016), 553–560.

17.

Lin

and Zhao

, Application research of neural network in vehicle target recognition and classification. 2019 International Conference on Intelligent Transportation, Big Data & Smart City (ICITBS) 2019, 5–8.

18.

Laopracha

, Thongkrau

, Sunat

, Songrum

and Chamchong

, Improving vehicle detection by adapting parameters of HOG and kernel functions of SVM. 2014 International Computer Science and Engineering Conference (ICSEC) (2014), 372–377.

19.

Mithun

, Rashid

and Rahman

, Detection and Classification of Vehicles from Video Using Multiple Time-Spatial Images, IEEE Transactions on Intelligent Transportation Systems 13(3) (2012), 1215–1225.

20.

Banik

P.P.

, Saha

and Kim

, Fused Convolutional Neural Network for White Blood Cell Image Classification. 2019 International Conference on Artificial Intelligence in Information and Communication (ICAIIC) (2019), 238–240.

21.

, Yang

, Kong

and Cui

, Multi-scaled license plate detection based on the label-moveable maximal MSER clique, Optical Review 22(4) (2015), 669–678.

22.

Ali

and Ragb

H.K.

, Fused Deep Convolutional Neural Networks Based on Voting Approach for Efficient Object Classification. 2019 IEEE National Aerospace and Electronics Conference (NAECON) (2019), 335–339.

23.

Pan

J.S.

and Yang

, A Survey on Transfer Learning, IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345–1359.

24.

Lee

, Son

, Yoon

and Park

, Video Based License Plate Recognition of Moving Vehicles Using Convolutional Neural Network, 2018 18th International Conference on Control, Automation and Systems (2018), 1634–1636.

25.

Awang

S.M.

and Nik

A.N.

, Automated Toll Collection System based on Vehicle Type Classification using Sparse-Filtered Convolutional Neural Networks with Layer-Skipping Strategy, Journal of Physics: Conference Series 1061(1) (2018), 1–6.

26.

and Guo

, Vision-Based Method for Forward Vehicle Detection and Tracking. 2013 International Conference on Mechanical and Automation Engineering (2013), 128–131.

27.

Wen

, Shao

, Fang

and Xue

, Efficient feature selection and classification for vehicle detection, IEEE Transactions on Circuits and Systems for Video Technology 25(3) (2015), 508–517.

28.

Wang

Y.C.

, Han

C.C.

, Hsieh

C.T.

and Fan

K.C.

, Vehicle type classification from surveillance videos on urban roads, 2014 IEEE 7th International Conference on Ubi-Media Computing and Workshops (2014), 266–270.

29.

Chen

, Zhu

, Yao

and Zhang

, Vehicle type classification based on convolutional neural network, IEEE 2017 Chinese Automation Congress (CAS) (2017), 1898–1901.

30.

Dong

, Wu

, Pei

and Jia

, Vehicle type classification using a semisupervised convolutional neural network, IEEE Transactions on Intelligent Transportation Systems 16(4) (2015), 2247–2256.

31.

Luo

and Chen

, System design of real time vehicle type recognition based on video for windows (AVI) files, Intelligent Computing and Information Science: International Conference (ICICIS) (2011), 681–686.

32.

Zhu-yu

, Tian-min

and Xian-yang

, Study for vehicle recognition and classification based on Gabor wavelets transform & HMM. 2011 International Conference on Consumer Electronics, Communications and Networks (CECNet) (2011), 5272–5275.

33.

Butt

Z.M.

, Khattak

, Shafique

, Hayat

, Abid

, Kim

, et al. Convolutional neural network-based vehicle classification in adverse illuminous conditions for intelligent transportation systems, Complexity (2021), 1–11.