A vehicle recognition algorithm based on fusion feature and improved binary normalized gradient feature

Abstract

A vehicle detection method based on the fast extraction of object-oriented candidate window and fused feature of HOG-LBP is proposed for the vehicle detection algorithms based on the single shape feature in the video monitoring of expressway may lead to mistaken inspection and the detection algorithm using the support vector machine (SVM) sliding window is quite time-consuming. Firstly, the vehicle candidate window is quickly extracted based on the binary normalized gradient feature and the background difference, then the histograms of oriented gradients (HOG) feature of the candidate window image and the local binary pattern (LBP) feature are calculated and the feature fusion is carried out, and finally the vehicle detection is taken combing with the SVM classifier. The experimental results show that the fusion of shape and texture features can effectively improve the performance of vehicle detection, and the detection speed of SVM can be raised about 8 times by fast extraction of the candidate window, which can meet the requirements of real time engineering.

Keywords

Vehicle detection feature fusion binary normalized gradient feature histograms of oriented gradients

1. Introduction

Video monitoring is one of the main components of expressway intelligent traffic detection system, which collects real-time traffic flow data to judge the traffic state of the road and identify the occurrence of traffic accidents intelligently, provides valuable auxiliary decision-making information for the operation and management of the road, and greatly improves the management efficiency. Vehicle automatic recognition technology based on video sequence is the basis of traffic data acquisition and traffic incident detection in video surveillance. Therefore, it is of great significance to study a high precision and real-time vehicle target recognition technology to improve the reliability of detection.

The expressway vehicle detection algorithms are originally realized by the simple object extraction methods such as the difference method and the edge detection, etc., but the effect of these methods is not ideal for the complicated outdoor conditions like frequent change of light intensity and much messy noise. With the development of computational vision technology, more algorithms are used to extract vehicle shape and texture features and combine machine learning methods for vehicle detection. The common method is to calculate features of samples like Haar-like [1], HOG and LBP etc., and to use SVM or cascade Ada-boost method to train to get classifiers. And then image frames are detected and identified by sliding window, which can improve detection accuracy effectively. Viola proposed to use convolutional image to calculate some simple features of samples such as Haar-like, which is used to train the Ada Boost classifier, and then several simple classifiers are used to synthesize the complex classifier through cascade to detect the target. Hakki Can Karaimer et al trained the classifier based on the KNN algorithm and the HOG $+$ SVM classifier, and form the final classifier through the fuzzy fusion of the decision layer to detect vehicles [2].

However, the vehicle detection method based on the single feature combined with machine learning may lead to mistaken detection and missing detection for some road environment with complex conditions and much interference, which is still difficult to meet the requirements of the engineering practice [3]. In order to solve this problem, scholars have proposed a higher level vehicle feature detection algorithm, such as convolutional neural network (CNN) [4] and Faster RCNN [5] vehicle recognition and tracking algorithm based on deep learning extraction window. Although the detection accuracy of such algorithms has been greatly improved under complex environment, its real-time performance is still low and its cost is quite high, which is not suitable for engineering practice. Therefore, considering the performance of the algorithm and engineering practice, and the shape features and the texture features, the vehicle detection used the SVM classifier can effectively improve the accuracy.

In this paper, a vehicle recognition algorithm based on the binary normalized gradient features and background difference is proposed to quickly extract the target candidate window of the object. The detection accuracy and real-time performance of the SVM classifier is significantly higher than that of the classic HOG +SVM algorithm [6]. The algorithm first collects a large number of positive and negative samples (including vehicle samples and background samples) and converts them into images with 64 $\times$ 64 pixels, extracts the HOG features and LBP features of each sample, and combines the features and trains the SVM classifier. In the detection process, the candidate windows are quickly extracted from the object based on the binary normalized gradient features and the background difference method, and then the candidate windows are converted into the sample size and the vehicle targets are detected by the trained SVM classifier.

The algorithm flowchart is shown in Fig. 1.

Figure 1.

Flow chart of the vehicle detection algorithm.

2. Feature extraction and training of SVM classifier

2.1 Features of HOG

Histograms of oriented gradients are feature descriptors composed of series of gradient histograms of all parts of segmentation images, which can be well applied to the recognition of vehicle targets because of its insensitivity to light intensity and strong description ability to the shape features of objects. The process of extracting HOG features of vehicle samples is shown as below [7]: normalizing the gamma space and color space of images; convolution calculating gradient’s size and direction of each pixel by using gradient operators [ $-$ 1, 0, 1] and [ $-$ 1, 0, 1] ${}^{T}$ ; segmenting images with 8 $\times$ 8 pixels cell and count gradient histograms containing 9 channels. For the connected interval blocks consisting of adjacent 2 $\times$ 2 cells, the HOG feature descriptors of interval blocks are obtained by the gradient vectors of cells, and the HOG feature descriptors of each block are connected in series to obtain 1764 dimensions of feature vectors of HOG. $\{h_{i}|i=0,1,2\ldots l\}$ represents the HOG feature vector of sample set and is taken as the input vector of SVM classifier, in which $l$ is the sample size and $h_{i}$ is the HOG feature vector of the $i$ th sample.

2.2 Features of LBP

Local binary pattern (LBP) is an operator representing local texture features in binary form, which has the characteristics of simple computation, good stability and strong identification, and is suitable for vehicle target detection. Firstly, calculate the LBP feature of the sample images, which means to calculate the LBP eigenvalues of 8 sampling points in the 3 $\times$ 3 neighborhood of the sample images in turn; then divide the LBP feature images into 16 $\times$ 16 local blocks, and calculate the histogram of each local block as the feature vector of the corresponding local block; finally, compose the feature vector of each local block in series to form 2891 dimensional LBP eigenvectors of the sample. Take $\{l_{i}|i=0,1\ldots m\}$ as the LBP eigenvector of the sample, m as the number of samples, and $l_{i}$ as the LBP eigenvector of the ith sample.

2.3 Fusion of feature vector

Fusion of feature vector means calculating the HOG and LBP eigenvalues of the sample images respectively, normalizing the feature vectors and fusing them into row vectors, which is expressed as below:

$\displaystyle X_{i}=[h_{i};l_{i}]\quad 1,2\ldots m$ (1)

In Eq. (1), $X_{i}$ is the fusion feature vector of the $i$ th sample, which is used as the input for training SVM classifier; $h_{i}$ and $l_{i}$ are the corresponding feature vectors of HOG and LBP respectively.

2.4 SVM classifier training

The principle of image classification using support vector machine (SVM) is to take nonlinear transformation to the input image and to map the input image to high dimensional space, and to find a hyper plane to classify the image. The mapping relation in nonlinear transformation is realized by kernel function, and the hyperplane is the classifier between positive and negative samples [8]. The vehicle recognition can be seen as a binary classification problem, that is to judge whether the target image block is a vehicle target or not, then the SVM classifier can be shown as:

$\displaystyle y=\text{sign}\left[\sum_{i=1}^{N}y_{i}a_{i}K(x,x_{i})+b\right]$ (2)

In Eq. (2), $y$ refers to the type of the corresponding sample image; $y=\{+1,-1\}$ represents vehicle samples and non-vehicle samples respectively; $(x,x_{i})$ refers to the ith sample and its eigenvector; $K(x,x_{i})$ is the selected kernel function; N is the total number of samples; $a=\{a_{1},a_{2},\ldots a_{n}\}$ and $b$ are the coefficients obtained from the sample training process.

3. Improving BING to extract alternative window

One of the important factors affecting the real-time performance of SVM detection is to scan the whole image through sliding windows. In this paper, we proposed a fast extraction method of object selection window based on BING and background difference that can solve the problem of slow detection speed of SVM [9].

The normalized gradient feature is a normalized gradient in a local area, and its NG features have good stability for the position, length and width and scaling of the target images. Moreover, due to the tightness of NG features, it keeps high efficiency in the process of computation and verification. In order to improve the computation speed of NG features, binary normalized gradient (BING) can be used for approximate calculation:

$\displaystyle g_{1}=\sum_{k=1}^{N_{g}}2^{8-k}b_{k,l}$ (3)

In Eq. (3), $g_{1}$ and $b_{k,l}$ are NG values and BING features respectively; $N_{g}$ represents the first $N_{g}$ digits of the binary normalized gradient. The computation of BING feature is realized by simple operation like bit operation or bit movement, and its process is shown in Algorithm 1.

Algorithm 1: The calculation of BING feature with image of $w\times h$
Input: Image of binary normalized gradient $b_{w\times h}$
Output: BING feature matrix $r_{w\times h}$
Initialization: $b_{w\times h}=$ 0; $r_{w\times h}=$ 0
Executing loop:
Computing point $(x,y)$ in a line;
$r_{x,y}=(r_{x-1,y}\leqslant 1)\|b_{x,y}$ ; $b_{x,y}=(b_{x,y-1}\leqslant 8)\|r_{x,y}$

As shown in Algorithm 1, the BING feature takes advantage of the cumulative relationship with each line, effectively avoids the loop operation by simple original operations, and greatly improves the real-time performance of the algorithm. In the process of searching image objects, two class cascading SVM is used. First, linear SVM is used to train the BING features of positive and negative samples to obtain the linear template $w$ . Then the preset window slides to traverse the image and calculates the position score of each window through the linear template $w$ .

$\displaystyle s_{1}\approx\sum_{j=1}^{N_{W}}\beta_{j}\sum_{k=1}^{N_{g}}C_{j,k}$ (4) $\displaystyle C_{j,k}=2^{8-k}[2(a_{j}^{+},b_{k,l}]-|b_{k,l}|$ (5)

In Eq. (5), $a=a_{j}^{+}+\overline{a_{j}^{+}}$ is a base vector; $a_{j}^{+}=\{0,1\}^{N_{W}}$ , where $N_{W}$ represents the number of base vectors. Because windows with different size may contain different objects, it is necessary to correct the score of the window with different size in order to improve the probability that windows contain different objects [10].

$\displaystyle O_{1}=v_{i}s_{1}+t_{i}$ (6)

In Eq. (6), $v_{i}$ and $t_{i}$ are the learning coefficient and the error coefficient of the corresponding window respectively. The classifier can be obtained by calculating the coefficients $\alpha_{j}$ and $\beta_{j}$ . Its process is shown in Algorithm 2.

Algorithm 2: The calculation process of classifier
Input: $w,N_{W}$
Output: $\{\alpha_{j}\}_{j=1}^{N_{W}},\{\beta_{j}\}_{j=1}^{N_{W}}$
Initialization: $\varepsilon=w$
For $j=$ 1 to $N_{W}$ do
$\alpha_{j}=\text{sign}(\varepsilon),\beta_{j}=\frac{[\alpha_{j},\varepsilon]}{% \\|\alpha_{j}\\|^{2}}$
$\varepsilon\leftarrow\varepsilon-\beta_{j}\alpha_{j}$
End for

The video surveillance in expressway is usually fixed camera, so the most parts of background images can be eliminated by the background difference method. On the basis of using BING to extract the suspected target window, combined with the foreground object image obtained by the background difference method, the window screened by BING will be used as the final alternative window if there is a large intersection between them.

$\displaystyle\text{Proposal}=\left\{\begin{array}[]{ll}\ +1&T_{1}\geqslant% \frac{s[\textit{pro}(i)\cap\textit{sub}(j)]}{\min\{S[\textit{pro}(i)],S[% \textit{pro}(j)]\}}\geqslant T_{2}\\ -1&\text{else}\end{array}\right.$ (7)

In Eq. (7), proposal $=+1$ denotes that the window extracted by BING is selected as the final alternative window; proposal $=-1$ indicates that the window is filtered without final detection; $\textit{pro}(j)$ and $\textit{sub}(j)$ are the window images of object selection extracted respectively from the BING feature and background difference method, whose sizes are respectively $S[\textit{pro}(i)]$ and $S[\textit{sub}(j)]$ ; $T_{1}$ and $T_{2}$ are test thresholds.

4. Experiment results and analysis

In order to verify the effectiveness of the algorithm, the samples are collected from the KITTI vehicle database, in which there are 1360 positive samples and 3850 negative samples. The validation set contains 250 positive samples and 250 negative samples. We verify the accuracy and real-time performance of the algorithm by comparing the classical HOG $+$ SVM vehicle detection algorithm, and train the vehicle images and road environment images under different conditions in the real-time video monitoring system of the expressway. And then we detect the videos under five different conditions, including different light change, rainy day, and cloud platform movement, foggy day and normal state, which is tested to verify the robustness of the algorithm in different environments.

4.1 Analysis of SVM classifier training

From the KITTI vehicle database, the required data samples are obtained, including the training set and the validation set. All the samples are converted to 64 $\times$ 64 sizes. The HOG features and LBP features are calculated and the features are fused. The RBF kernel function is selected and the parameters are adjusted to train the SVM classifier. The detailed process is illustrated as follows:

Step 1: Collection and preprocessing of positive and negative samples. $T=\{(x1,y1)$ , $(x2,y2)$ , … $(xn,yn)\}$ represents a set of samples; $x_{i}\in R^{n}$ represents the $i$ th sample; $y_{i}\in Y=\{+1,-1\},y=+1$ denote positive samples; $y=-1$ denotes negative samples.

Step 2: The HOG and LBP feature vectors of all samples are calculated, and normalization and fusion processing are performed to get row vector $X_{i}=[h_{i};l_{i}]$ as input of SVM classifier. Where $h_{i}$ and $l_{i}$ represent the HOG and LBP eigenvectors of the $i$ th sample respectively.

Step 3: Classifier training. The first job is the choice of kernel function, and the most commonly used kernel functions are linear kernel and RBF kernel. By comparing the test results, we found that the RBF kernel is about 0.45% higher than the linear kernel in detection accuracy. Therefore, the RBF kernel is selected as the SVM classifier to train the kernel function. The next job is the selection of the penalty factor $C$ . The penalty factor is the loss control parameter of the outlier samples. The greater the value of the parameter is, the greater the impact of the outliers on the loss of the target function is. In order to prevent over fitting, the penalty factor $C$ can get better results when it is between 0.01 and 10. After determining the parameters, the SVM classifier can be trained, which is a long process.

Step 4: In order to test the detection performance of trained SVM classifier, we should compare and analyze the precision and recall rate of different algorithms for the same set of tests.

$\displaystyle\textit{precision}=\textit{TP}/(\textit{TP}+\textit{FP});\quad% \textit{recall}=\textit{TP}/(\textit{TN}+\textit{FN}$ (8)

In Eq. (8), TP, FP, TN and FN denote comparatively real positive sample, false positive sample, real negative sample and false negative sample. In order to avoid the possible contradiction between precision and recall, the comprehensive evaluation indexes measure and accuracy are used to indicate the detection performance of different algorithms [11].

$\displaystyle\textit{accuracy}=(\textit{TP}+\textit{TN})/(\textit{TP}+\textit{% FP}+\textit{TN}+\textit{FN});$ (9) $\displaystyle\textit{measure}=\textit{precision}\times\textit{recall}/[(1+% \alpha)\textit{precision}+\alpha\times\textit{recall}]$

In Eq. (9), $\alpha$ is the adjustment factor, usually set as 0.5.

Table 1

Comparison of classifiers with different features

Detection method	TP	FP	TN	FN	Accuracy (%)	Measure (%)
LBP	216	21	182	20	90.66	90.16
HOG	186	8	196	15	94.32	94.53
Fast RCNN	205	3	200	3	98.54	98.51
HOG $+$ LBP	197	5	195	7	97.03	97.01

Table 2

Comparison of classifiers in different conditions

Detection method	Accuracy (%)
	Light	Rainy	Cloud platform	Foggy	Water droplet on	Normal
	change	day	movement	day	camera lens	state
LBP	46.52	40.25	65.42	45.22	34.27	72.34
HOG	65.82	51.24	71.35	57.17	36.26	81.67
HOG $+$ LBP	60.21	52.02	75.69	59.25	35.64	88.96
Fast RCNN	84.19	83.77	91.87	90.63	40.81	94.23
Our algorithm	88.57	78.68	89.24	87.48	37.58	92.68

Table 1 provides the training and validation results of different algorithms for KITTI vehicle database. The data show that fast RCNN uses deep learning to extract the advanced features and obtains the best performance, whose cost is the sharp increasing in computation. In addition, the detection algorithm based on fusion feature has better performance than the single feature algorithm, and the comprehensive evaluation index and the accuracy rate improved a lot. The result indicates that the segmentation between vehicle samples and non-vehicle samples can be enhanced and the detection performance of the algorithm can be optimized by integrating HOG features and LBP features.

4.2 Real time detection and analysis of vehicle targets

In order to verify the real-time detection performance of the proposed algorithm, we collect videos under different conditions in expressway as a verification database, including different light change, rainy day, and cloud platform movement, foggy day, influence of water droplet on camera lens and normal state. Each video time has 1 $\sim$ 2 min, which means, each video has about 3000 frame images.

We compared the difference among different algorithms for the same database, and the result is shown in Table 2.

The results show that the detection rate with fusion feature is obviously higher than that with a single feature, but the features of the target objects have been seriously blurred under the two conditions (rainy day and water droplet on camera lens), so all the algorithms above do not perform well, and the detection rate is about 40% lower than that of the normal state.

This proposed algorithm uses the detection method of alternative window instead of the sliding window, which has improved the detection effect under the three conditions of foggy day, cloud platform movement and light change. Especially in the foggy day, the detection rate of the algorithm has improved about 28%. Still it can also be seen that deep learning algorithm using advanced features achieves the best level in all kinds of environments.

The detection rate and real-time are two basic requirements for video surveillance in expressway. However, the deep learning network with the best detection performance is not only expensive, but also has much computation cost, which cannot guarantee the requirement on real-time [12]. In addition, the target detection algorithm based on SVM classifier usually traverses the image through the sliding window to extract features and detect, and the high latitude features lead to excessive computation, which seriously affects the real-time performance of the detection system.

Figure 2.

Detection time of different algorithms.

In view of the fixity of the video surveillance camera gun in the freeway, the object window is quickly extracted from the object based on the improved BING method, which can solve the time-consuming problem of the SVM sliding window detection quite well. First of all, the BING feature is used to quickly extract the alternative window. The number of windows is less than 1% of the sliding window method under the premise that the target detection rate is above 99% [13]. On this basis, the background difference technology is used to narrow the region of interest, further reducing the number of alternative windows to 102 or even less. Finally, all the candidate windows are preprocessed and their features are extracted, and the trained SVM classifier is used to determine whether the objects are vehicle targets or not. The algorithm based on improved BING has fast speed on extracting object target windows, and the average processing time of each image is 0.003 $\sim$ 0.004 s.

The comparison of different algorithms for detection time of videos is shown in Fig. 2. From the figure, we can see that in the same detection method, the increasing of feature dimension will make computation increase dramatically, mainly in the process of feature extraction. In this paper, we propose an algorithm using a quick object extraction window instead of a sliding window detection method. The average detection time is reduced to 0.043 s/frame. Compared with the method of fusion feature combing with traditional SVM algorithm, whose detection time is 0.341 s/frame, the speed boost is about 8 times, which greatly reduces the time consuming and improves the real-time performance of the algorithm.

5. Conclusions

In this paper, a vehicle recognition algorithm in expressway based on HOG $+$ LBP $+$ SVM is proposed. First, a sufficient number of positive and negative samples are collected and their HOG and LBP features are extracted, which are mainly used in feature vector fusion and training SVM classifier. Then the object alterative windows are extracted based on the BING feature and the background difference approach, and the vehicle targets are identified by the trained SVM classifier so as to realize the vehicle recognition of the expressway. The experimental results show that the algorithm combines the shape feature and the texture feature to detect the vehicle, which increases the discrimination degree between positive and negative samples, and improves the real detection rate of this algorithm in different environments. At the same time, the detection time of SVM algorithm is optimized by fast extraction of object target window. The detection speed increased 8 times, and the real-time requirement of the system is guaranteed. In the follow-up work, a more advanced detection algorithm can be used to detect the target vehicles, further improving the detection rate of the expressway video surveillance system in complicated environment.

References

Qiu

Q.J.

Liu

and Cai

D.W.

, Vehicle detection based on LBP features of the haar-like characteristics, Intelligent Control and Automation, New York, IEEE, 2015, pp. 1225–1227.

Ding

and Tao

, Robust face recognition via multimodal deep face representation, IEEE Transactions on Multimedia 17(11) (2015), 2049–2058.

Karaimer

H.C.

Cinaroglu

and Bastanlar

, Combining shape-based and gradient-based classifiers for vehicle classification, in: International Conference on Intelligent Transportation Systems, New York, IEEE, 2015, pp. 800–805.

Song

Rui

and Zha

et al., The AdaBoost algorithm for vehicle detection based on CNN features, in: International Conference on Internet Multimedia Computing and Service, New York, ACM, 2015, pp. 1–5.

Ren

and Girshick

et al., Faster RCNN: Towards real-time object detection with region proposal networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 39(6) (2015), 1137.

Farabet

Couprie

and Najman

et al., Learning hierarchical features for scene labeling, IEEE Transactions on Pattern Analysis and Machine Intelligence 35(8) (2013), 1915–1929.

Lee

S.H.

Bang

M.S.

and Jung

K.H.

et al., An efficient selection of HOG features for SVM classification of vehicle, IEEE International Symposium on Consumer Electronics, New York, IEEE, 2015, pp. 1–2.

and Kiros

et al., Show attend and tell: Neural image caption generation with visual attention, in: Proc International Conference on Learning Representations, Lille, France, 2015, pp. 2048–2057.

Cheng

M.M.

Zhang

and Lin

W.Y.

et al., BING: Binarized normed gradients for objects estimation at 300fps, Computer Vision and Pattern Recognition, New York, IEEE, 2014, pp. 3286–3293.

10.

Lalimi

M.A.

Ghofrani

and Mclernon

, A vehicle license plate detection method using region and edge based methods, Computers and Electrical Engineering 39(3) (2013), 834–845.

11.

Sivaraman

and Trivedi

M.M.

, Active learning for on-road vehicle detection: A comparative study, Machine Vision and Applications 25(3) (2014), 599–611.

12.

Taigman

Yang

and Ranzato

et al., Deepface: Closing the gap to human-level performance in face verification, in: IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 2014, pp. 1701–1708.

13.

Vinyals

Toshev

and Bengio

et al., Show and tell: A neural image caption generator, in: IEEE Conference on Computer Vision and Pattern Recognition(CVPR), Boston, 2015, pp. 3156–3164.