Abstract
To overcome the missing detection of traditional external feature based stereo vision vehicle detection method while there were multiple vehicles on the road, this paper proposed a road vehicle detection method based on mean shift clustering and semi-global disparity map. Firstly, the semi-global optimization function was established to find the WTA solution based on the principle of optimal in scan lines, and the sub-pixel accurate semi-global disparity map was obtained by quadratic curve fitting. Then, the three dimensional environment and the point cloud were reconstructed from the stereo camera calibration data and the disparity map. After that, a road plane extraction method was proposed based on the lane detection, and the ground point was removed. At last, based on the road vehicle feature analysis, the kernel function of Mean Shift clustering algorithm was optimized to realize the multiple vehicle positioning and detection. The real vehicle road experiment results indicated that this method has a good robustness to the road environment. Also it can effectively distinct multiple vehicle even if there is occlusion.
Introduction
The intelligent driver assistance system which is based on road environmental perception is a topic of great interest to researchers. A variety of sensing modalities has become available for on-road vehicle detection, including radar, laser, and computer vision [1–3]. The computer vision usually contains abundant environmental information, but its data processing algorithms are usually complicated. Fortunately, imaging technology has immensely progressed in recent years. Cameras are cheaper, faster, and of higher quality than ever before. Concurrently, the computing platform is gearing toward parallelization, such as multi-core processing and graphical processing units (GPUs). Such hardware advances provide a good hardware foundation for computer vision and allow the algorithms to pursue real-time implementation, especially for the stereo vision environmental perception [4–6].
The difficulty of the stereo vision vehicle detection algorithms is how to extract valid vehicle pixels from Complex environmental information. Generally, vehicle detection approaches can be divided into two broad categories: appearance-based and motion-based methods. Motion-based methods usually detect motion characteristic information of the interest points in image based on the optical flow. Then, utilizing Kalman filter [7] and Particle filter [8], position and motion features of these image points can be tracked. Finally, ground modeling [9], dynamic stixels [10], occupancy grids [11] and elevation maps [12] are used for scene segmentation to detect the movement of the vehicle. These approaches can detect vehicle state, realize the vehicle tracking and behavior analysis, but they usually require accurate ego-motion estimation of the vehicle and also need some appearance-based techniques for initial scene segmentation.
Appearance-based approaches to vehicle detection include three key steps: stereo matching, disparity data preprocessing and vehicle detection. Among them, data preprocessing is to identify and filter troubling points (for example, ground plane points) from disparity data to improve the accuracy and computational efficiency of the vehicle detection. In the V-disparity maps (the profile projection from the disparity map of stereo vision), vehicle and road plane are respectively represented by vertical segments and oblique straight lines. Hough transform and the RANdom SAmple Consensus (RANSAC) can be used to identify the ground plane from the disparity maps [13, 14]. Nevertheless, due to the installing accuracy and the change of the pitching angle of the camera in the running process of a vehicle, it’s hard to guarantee stereo vision rig horizontal and parallel to the road, which results in that the projection of the ground plane in V-disparity map is no longer a straight line. Therefore, it’s necessary to improve the methods of filtering ground plane for stereo vision rigs in the non-horizontal state.
After finishing disparity data preprocessing, Bayesian model [15], histogram of depths [16], U_V-disparity and other scene segmentation method of motion-based features were used for vehicle detection based on the appearance features of the vehicle, such as image intensity, color, corner point, size and so on [17]. It’s hard to differentiate vehicles and realize accuracy detection of the vehicle in the disparity data without combining motion detection while there are occlusions. Thus, in order to realize multiple vehicle detection without motion state detection, it’s necessary to assign each 3D point which has vehicle features to a cluster directly in the reconstructed image points cloud. The Mean Shift algorithm has been implemented on scene segmentation [18], monocular vision-based vehicle tracking [19, 20] and clustering because of the advantages of small amount of computation and easy to realize. Hence, in this paper, the Mean Shift clustering algorithm is utilized for the vehicle detection from the reconstructed 3D points cloud, and the kernal function of the alogrithm is optimized for a better accuracy.
According to above analysis, this paper proposed a stereo vision vehicle detection method based on mean shift clustering. Firstly, the semi-global optimization function was established to find the WTA solution based on the principle of optimal in scan lines, and the sub-pixel accurate semi-global disparity map was obtained by quadratic curve fitting. Then, the three dimensional environment and the point cloud were reconstructed from the stereo camera calibration data and the disparity map. After that, a road plane extraction method was proposed based on the lane detection, and the ground points are removed. At last, based on the road vehicle feature analysis, the kernel function of Mean Shift clustering algorithm was optimized to realize the multiple vehicle positioning and detection.
On-road environmental reconstruction
The stereo vision matching algorithms can be classified into three groups: local matching, semi-global matching and global matching. The local matching algorithm has some advantages such as simple arithmetic, fast computing speed and easy to realize by hardware. But the matching result is rough and sometimes the vehicles are undetected; the global matching algorithm has the advantage of high matching precision, but large computation increase the matching time. It’s hard to meets the real-time requirements for vehicle detection. The semi-global matching algorithm is somewhere in between. It has high matching precision, simple arithmetic, high computational efficiency, less sensitive to the change of the light ray, a good robustness to noise, and can maintain high matching precision even if there is occlusion. Therefore, considering the real-time requirement, robustness and the needs of enough amounts of the reconstructed points, this paper computes disparity by using semi-global matching algorithm. The semi-global stereo matching has the features of real-time and high precision by utilizing one dimensional local optimization procedure from multiple scan lines to displace two dimensional local optimization procedure in the image plane.
The key for semi-global matching algorithm is the clustering of the matching costs, building the energy function and computing the matching costs respectively along the scan lines from 8 different directions based on the principle of optimal in scan lines. As the Fig. 1 shows, using the mode of multi-directional merged 1D smoothness constraint to approximate 2D smoothness constraint in the image plane. In another word, according the change value of the disparity in the scanning direction, a constant penalty term is added to the matching costs. The computation of the matching costs is a recursive procedure in each scan paths, the matching costs include the minimum matching cost of the previous pixel point based on disparity smoothness constraint in the scan path and the original matching cost of the current point.
Basic steps of the algorithm are as follows: Window-based local algorithm is used to compute gray homogeneity matching costs of each pixel. Establishing a global energy function by performing clustering on matching costs based on smoothness constraint in multi-directional scan lines. Using the WTA (winner takes all algorithm) to select the disparity that the energy function is the minimum, and sub-pixel disparity is estimated by quadratic curve fitting. Comparing the right disparity map with the left disparity map, relieving the error matching due to occlusion by eliminating the abnormal points according to the consistency constraints.
The window-based local algorithm computes correlation scores using 1D window with size 5×5. The method which is based on sum of absolute differences is used to compute the gray similarity. At the same time, if the minimum matching cost is larger than the other disparity in the second disparity window, the disparity would be assigned the value zero in order to improve the matching precision and remove the error matching. The Fig. 2 shows the experiment result using the matching method above. After the camera calibration we can get the extrinsic and intrinsic parameters of camera which can be used to rectify the image pairs filmed by stereo camera so that the image pairs satisfy the restriction of epipolar geometry. The first image is a rectified left-hand image; the next is the corresponding disparity map in which the black points are the non-matching points. The non-matching area usually full with barely noticeable textures and the disparity values fail to satisfy constraints. The lager the gray value is, the lager the disparity value is, which means the lighter the point is, the closer the point distant from camera. The disparity can show the real depth cues of the image points. It creates a solid foundation for estimating accuracy 3D points cloud in 3D word coordinate system.
After getting the disparity maps, and utilizing the formula 1, from the disparity map, it is possible to estimate, the coordinates (X, Y, Z) in the 3D word coordinate system of each points P (x, y, d) thanks to the projection relationship of the pinhole camera model. Among them, the positive direction of X is the horizontal-left direction of the car, the positive direction of Y is the vertical upward direction of the car, and the positive direction of Z is the forward direction of the car. (x, y) is the x- and y- values of the point P relative to the projective centers of the camera in the image coordinate system of disparity map. d is the disparity computed by the matching method. f and b which can be given by camera calibration represent the focal length of the camera and baseline length of the two cameras. In the disparity maps, the values of the 3D coordinates of the point of which the disparity value is 0 are set (0, 0, 0), then the pixels in the left image rectified correspond to coordinates in the 3D world coordinate system of which the origin is the imaging center of the left camera. Finally, the 3D point clouds of which the amount is same with the pixels of original image can be given.
In order to reduce complexity of vehicle detection, it’s necessary to decrease the amounts of the points cloud Ω and remove the points without vehicle information. Assuming that the road plane in the range of stereovision-based environmental perception is a plane and the height of the vehicle is no more than 4 meters from road plane. Under this hypothesis, only the points between the two planes in points cloud Ω are reserved to detect vehicles. However, with the change in pitch angle of the vehicle, the position of the road plane in the image would be change, it is the same with the 3D coordinates of road plane. So, the first step is to detect the lane markings in the rectified left view. Then benchmarked against two lane markings, finding the points of two lane markings in points cloud Ω so as to determine the coordinate equation of the ground plane, filter the points on the ground and too high above the ground. In this way, the vehicle position is more accuracy.
First, the median filter method is used to filter the noise for rectified left view, which improves the quality of image. Then, canny operator is used to extract edge of image and an interesting area is set in image against the area where the lane markings might be present. Finally, with the angle of the straight line be limited, Hough transformation is used to detect the straight line against the interesting area. In this way, the pixel coordinates (u, v) of the starting and ending points of two lane markings can be given in rectified left image. The test results of Fig. 2 can be seen in Fig. 3.
According the starting and ending points of two lane markings, equations of each lane markings in image coordinate system can be computed. Find the points represent the lane markings correspond to the points cloud Ω L in the points cloud Ω and remove the non-matching points. Assuming that the equation of ground plane in 3D world coordinate system is , and a0, a1, a2 can be solved by substituting P (Xi, Yi, Zi) in points cloud into Equation (2).
The equation of the plane located at 4 meters from the ground determined by the direction of normal can be seen in Equations (3).
Points cloud Ω v can be given by filtering the troubling points near the ground and 4 meters above the ground, then the preprocessing for original points cloud are finished. The dept information in Z direction of points cloud in Fig. 2 can be shown in Fig. 4. Contrast with the disparity map in Fig. 2, it’s easily to find the points on the ground too high above the ground have been filtered. Cutting out distractions can decrease complexity of vehicle detection using points cloud clustering and improve the efficiency and the accuracy of detection.
In the points cloud after the preprocessing, each vehicle is represented by a set of sparse 3D points. The compactness of the sets depends on the appearance of the vehicle and is directly linked to the stereo matching method we use. During the clustering proceeds, aiming at size features of vehicle, we optimize the clustering to adapt to vehicle detection by grouping the points of 3D cloud include in this template to distinguish different vehicles in scene. After finishing the clustering, the features of the position of the center of each cluster and the number of points can be used to select category belong to the subset of vehicle points, and detecting the multi-vehicle at the same time can be realized.
The Mean Shift algorithm is a nonparametric iterative clustering technique. It has the advantages of low computation and easy to be achieved. It does not require to know the number of clusters such as the number of vehicles in the scene. The method has been introduced by Fukunaga [21] and has been adapted recently by Cheng [22] by introducing the kernel to the method. As the distance change between samples and offset points, contributions of simples are different for Mean Shift vector. The application of Mean Shift range is extended. Recently Comaniciu [23] has adapted the Mean Shift algorithm for color image segmentation and tracking. The Mean Shift method considers that a d-dimensional feature space is represented by a probability density function for which each mode corresponds to a cluster. It performs, iteratively, a gradient ascent on a randomly initialized data point in the feature space with regards to the neighbour points. The mode is sought until convergence. The data points associated to the mode are finally assumed to belong to the same cluster.
Let i = 1, ⋯ , n be the d-dimensional elements of our points cloud Ω
v
. Then, choose a point in the space and the basic form of Mean Shift vector can be shown:
ck,d/ - nh2 represents unit density, b represents Kernel bandwidth, and Kernel k (x) consists of additional weighting of simple points in spherical area of radius h. For vehicle is well detected, this paper improve the probability density function by assuming that a vehicle can be modeled as a cuboid in the 3D world. According to the size of vehicle, assuming the length, the width and the hight values of cuboid model are respectively l, w, h. Adjust the parameter in Kernel k () by providing additional weighting for simple points within cuboid model renewedly. The adjusted probability density function is as follows:
Usually, Epanechnikov or gaussian kernel is used. gaussian kernel is defined by:
For getting the maximum value of probability density function, it computes x, iteratively, along a gradient ascent direction. The Mean Shift vector mh,G (x) is as follows:
Where g(x) = —k′(x) is the first derivative of k (x). The convergence is reached when ∇f = 0. It is obtained by iteratively applying the following steps:
Choose a initial point x0 and find all the points x i within vehicle cuboid model at the center of x0.
Compute the Mean Shift vector mh,G (x) of each point satisfying the criteria. Stop iteration when mh,G (x) < ɛ. Otherwise, set the initial point to mh,G (x) + x0 and return 1.
After finishing clustering, according the position of clustering center corresponding to the lane markings, the number and density of points, the troubling points in road scene can be filtered to achieve clustering subset of points affiliated with vehicle. For the points cloud of the first frame image from continuous image sequence, the sequence, the Mean Shift is applied by randomly initialize each mode. For the following images, the position of each cluster in the 3D world coordinate system is predicted and the modes are initialized for each predicted position. This solution reduces the complexity of the process and makes it faster. In order to detect the new objects more effective from continuous image sequence, the Mean Shift process is launched with a random initialization in specific areas in the image in which new vehicles could appear. For overtaken vehicles we define areas to the left and right of the 3D world coordinate system which is at the center of stereo camera. For approaching vehicles we define areas farthest away from camera-equipped vehicle. (The perception of distance is determined by stereo vision system parameters, such as focal length, baseline distance and so on.)
This paper presents results of vehicle detection in real road scene aiming to prove the effectiveness of the method. The stereo vision system is shown in Fig. 5. The focal length of camera is 12 mm and the baseline distance is 40 cm, the image resolution is 640×480. Before experiment, the two cameras have been calibrated by Mr. Zhang calibration methods [24] to get the intrinsic parameters. The chessboard calibration board consists of 9×7 black and white square spaces of which the size is 150 mm×150 mm. The extrinsic parameters of camera are computed by Bouguet method [25] which has simple calibration process and high precision suitable for a field calibration of vehicle. The calibration results are shown in Table 1. The average projection error of checkerboard corners is 0.123 mm. The epipolar rectification is performed on original images using intrinsic and extrinsic parameters of camera. Then the stereo vision system can be considered perfect because of its parallel optical axis. The 3D coordinates of object can be reconstructed utilizing the principle of triangulation and the disparity.
After implementing stereo matching, lane markings detection and points cloud preprocessing with rectified images in Fig. 2, Mean Shift clustering method is used to detect vehicle and obtain points cloud belong to the one vehicle. Then the points cloud are attached with one color, the results is shown in Fig. 6.
The detection results in the road environment where there are multiple vehicles and occlusion happens can be shown in Fig. 7. The rectified left image is on the left, the disparity map computed by semi-global stereo matching method is in the middle, and the image on the right is the vehicle detection result where the points cloud on the ground and too high above the ground have been filtered. Based on good matching results, we can see that the Mean shift vehicle detection method can effectively distinct multiple vehicles even if there is occlusion. Figure 8 shows the points cloud detection result without preprocessing from the third image in Fig. 7, we can see that the ground points are clustered to the vehicle mistakenly, which has a great impact on vehicle detection. Through observing the V-disparity map, we can find that the lane markings are represented as two straight lines do not overlap in the image, this is because that the installation error of stereo vision system and the left and right tilt of vehicle lead to the consequence that the baseline of stereo camera is not parallel with the ground. So the projection of ground in V-disparity maps in not a straight line, which makes the ground filter and lane markings detection based on V-disparity maps are difficulty to carry out. The method of using double lane to fitting ground plane allows the situation in which the baseline of stereo camera is not parallel with the ground. The method has a good robustness. At the same time, we can see from the V-disparity maps that the vehicle with small disparity values on the left-hand side of image cannot be easily distinguished from the background because that the disparity expresses a reciprocal relationship to actual distance. Moreover, it’s difficult to detect the vehicle only by Hough transform and other geometric shape detection method when there is occlusion. So the vehicle detection method based on V-disparity maps is difficult to finish the vehicle detection compared with the method this paper proposed when there is occlusion.
Finally, the vehicle detection results from consecutive 100 frames image sequence are used to estimate the distance between the vehicle and the vehicle ahead, the results is shown in Fig. 9. The statistical results indicated that the rate of miss detection is lower than 5% and the rate of false detection is lower than 0.5% even if there is occlusion in the consecutive image sequence.
Conclusion
To overcome the poor detection results problem of traditional feature-based stereo vision vehicle detection method while there were multiple vehicles on the road, this paper proposed an on-road vehicle detection method based on mean shift clustering. The semi-global optimization function was established to find the WTA solution based on the principle of optimal in scan lines, and the sub-pixel accurate semi-global disparity map was obtained by quadratic curve fitting. Then, the three dimensional environment and the points cloud were reconstructed from the stereo camera calibration data and the disparity map. After that, a road plane extraction method is proposed based on the lane detection, and the ground points are removed. At last, based on the road vehicle feature analysis, the kernel function of Mean Shift clustering algorithm was optimized to realize the multiple vehicle positioning and detection. The experiment results indicated that this method can filter the ground more effective, compared with traditional V-disparity-based detection method, distinct multiple vehicles at the same time and also be used to vehicle tracking. This method has a good robustness.
Footnotes
Acknowledgments
This project is supported by the National Natural Science Foundation of China (Grant Nos. 61203171, 61473057) and the China Fundamental Research Funds for the Central Universities (Grant Nos. DUT15LK13).
