Abstract
Object detection and tracking are critical and challenging problems in vehicle environment perception systems, and have received broad attention in recent years. A novel detection and tracking algorithm taking both accuracy and real-time performance into account is proposed in this paper. First, we employ a fusion algorithm based on stereo vision and deep learning in object detection, which achieves high accuracy using two complementary algorithms. Then, a prediction-association algorithm which uses a Kalman filter and Hungarian assignment for multiple object tracking is employed for object tracking. In addition, a detection and tracking framework based on stereo vision improves the robustness of environmental perception system. Experimental results demonstrate that the proposed algorithm has high accuracy and can meet the real-time performance requirement.
Introduction
Environment perception is a critical capability for intelligent vehicles, and it mainly includes object detection and tracking [1, 2]. As a key technology, the object detection and tracking also has important value in computer vision research. Thus, improving the accuracy and real-time performance of object detection and tracking has become a top problem for environment perception for intelligent vehicle technology.
Object detection algorithms based on deep learning have become mainstream at present, due to their optimal performance in terms of accuracy [3–6], and can overcome detection effectiveness problems caused by illumination variation or occlusion. However, deep learning algorithms also have limitations. Although, there are many scholars working on building methods to interpret deep learning [7–9], it still suffers from poor interpretability, especially in deeper networks, and feasibility analyses are currently based on tests and experience. In addition, deep learning requires large amounts of computation [10], and largely depends on the computing capacity of modern computers, while applications of deep learning in intelligent vehicles mainly depends on costly embedded platforms. This inhibits the adoption of deep learning algorithms in vehicles severely. Many researchers simplified the network structure to cut down the computational cost of deep learning algorithms, resulting in solutions such as Tiny-Yolo and Tiny-SSD [11]. This however also brings a series of negative effects, as it reduces accuracy.
Stereo vision-based object detection refers to a class of simple and fast methods to extract target features from disparity images [12, 13]. These methods perform well in complex background scenes, especially in complex traffic scenes, and are also applied moving object detection. However, they suffer from a serious limitation, namely the pathosis problem caused by the complicated non-linear characteristic of stereo matching, which means these algorithms have poor robustness.
Fusion detection algorithms combining deep learning and traditional algorithms are becoming more advanced. These algorithms overcome the drawbacks of deep learning [14], which is mainly reflected in improving the interpretability of deep learning, and have high object detection accuracy. For example, a feature pyramid network, a novel network structure fusing deep learning with a feature pyramid, is proposed in [15], which can improve object detection accuracy.
In this paper, a novel vehicle object detection algorithm is proposed, which can improve the real-time performance and accuracy of object detection by fusing stereo vision with deep learning. Since Tiny-Yolo is a typical deep learning algorithm for object detection, which has a simplified network structure with high real-time performance, in this paper, we fuse Tiny-Yolo with stereo vision to detect objects. In addition, missing detection is a potential issue for object detection algorithms, which becomes more apparent in videos with continuous sequences. However, this issue can be solved well through fusing detection with tracking, which will be described in Section 4.
Continuous object tracking is a challenging task in computer vision, especially in situations with multiple moving objects. The appearance of shadow, illumination variation, challenging weather, motion blur and dynamic background are difficult challenges in multiple object tracking. The main tracking algorithms are split into three groups: early object tracking algorithms, kernel correlation filter (KCF) algorithms, and deep learning algorithms. Early object tracking algorithms mainly include optical flow and MeanShift, which first extract the target features and search for similar features in the next frame image to determine target position. However, these algorithms do not adapt to situations where illumination changes or the target is deformed, and, due to the lack of predictive models, it is difficult for these algorithms to deal with occlusion. Thus, early object tracking algorithms have limitations both in accuracy and real-time performance.
Compared with these algorithms, the tracking speed of KCF-based algorithms is faster and their accuracy is greatly improved [16, 17]. Tracking algorithms based on KCF use visual features to formulate a correspondence between object instances across frames, and can thus handle effectively partial or full occlusions [18]. However, feature extraction is required for each frame image, which causes poor real-time performance. In addition, the complexity of KCF-based tracking is high, because it is related to the number of pixels processed to extract the visual feature and to match them, which comes at a high computational cost [19].
Tracking algorithms based on deep learning have higher accuracy, but object tracking is different from object detection. In object tracking there is only the initial image used for training. Thus, it is difficult for deep learning algorithms to perform high-performance classification when there is not enough data to train; their real-time performance is poor [20] and cannot perform real-time tracking of fast-moving objects.
In order to further improve real-time performance, we employ tracking algorithms based on Kalman filter [21, 22] and Hungarian assignment [23] to achieve faster multiple object tracking. This process does not involve feature extraction, which means this tracking algorithm has a low computing complexity and can improve the real-time performance of objection tracking. Stereo vision is also utilised to improve the Hungarian assignment algorithm, solving potential missing detection issues. Hence, the detection accuracy and the system robustness are also improved, i.e. the vehicle environment perception method proposed in this paper can achieve a substantial performance improvement through the fusion of detection and tracking.
This paper is organized as follows: Section 2 provides the integrated detection and tracking framework, and the flow of environment perception is presented. Section 3 provides the fusion detection algorithm, while Section 4 provides the tracking algorithm, which takes accuracy and real-time performance into account, and is effective for online multiple object tracking. Section 5 shows how the new algorithm improves detection accuracy and achieves favorable performance in terms of tracking accuracy and real-time performance through corresponding experiments. Section 6 provides a summary and discusses future improvements
Integrated detection and tracking system framework
Our integrated environment perception system framework is shown in Fig. 1. It mainly includes two modules: the object detection module and the tracking module, and each module includes many sub-modules. The object detection module employs a fusion detection algorithm based on Tiny-Yolo and stereo vision. Compared with detection algorithms based on Tiny-Yolo only, the detection effect is obviously improved through the use of stereo vision to make up for the limitations of Tiny-Yolo. The stereo vision mainly includes the UV-disparity algorithm, which aims to extract object features from the disparity image and filter out redundant information such as the ground and the sky. The fusion of stereo vision and Tiny-Yolo improves the detection accuracy. The output of the object detection module is the detection image, which includes the class name, position and size of targets and is also the input of tracking module. The tracking algorithm mainly includes a Kalman filter and a Hungary assignment algorithm. In order to associate the interframe data, we build the motion state model of the bounding boxes generated through object detection. The Kalman tracker prediction sub-module will predict the state of the current detected target when it moves to the next frame. In the assignment sub-module, the associations between the currently detected object bounding boxes and the bounding boxes predicted by the Kalman tracker prediction sub-module on the previous frame are created by calculating the similarity between corresponding boxes. The motion states of the predicted bounding boxes are updated with the matched detection bounding boxes. Considering the potential missing detection issues and occlusion issues, we introduce stereo vision to determine the object tracking termination time and thus manage the tracking’s lifespan. In order to improve the accuracy of subsequent tracking, the parameters of the Kalman Filter are updated in the Kalman tracker update module.

Integrated detection and tracking system framework.
A detailed analysis of the integrated framework is provided in the following sections.
In this paper, we employ Tiny-Yolo as the deep learning algorithm [24]. Compared with the Yolo algorithm, Tiny-Yolo has a superior detection speed because it has a simplified network and fewer parameters than traditional Yolo. In addition, Tiny-Yolo consumes less storage and has a lower computation cost in embedded platforms. However, Tiny-Yolo also has some limitations in terms of accuracy. In order to compensate for the limitations of Tiny-Yolo, we introduce stereo vision into object detection
Limitations of deep learning detection algorithm
Fig. 2 shows typical examples of the main limitations of deep learning-based object detection. In Fig. 2(a), the car has two different bounding boxes, which demonstrate the duplicate detection phenomenon i.e., the same object has several different detection results. There are many algorithms to solve this issue. Among them, non-maximum suppression (NMS) [25] is a simple and effective method. The aim of NMS is to suppress the elements without maximum confidence. In the process of object detection, there are many candidate bounding boxes for the same object. Depending on the position information of the candidate detection bounding boxes, the bounding box with the highest confidence is selected as the final detection result and other bounding boxes are filtered out. The effect is shown in Fig. 3. Compared with Fig. 2(a), there is only one detection bounding box, while the other box is filtered out through NMS.

Limitations of deep learning detection algorithms.

Improved detection with NMS.
In Fig. 2(b), the car in the current lane is not detected. This phenomenon called missing detection and is a serious limitation; the method to solve this issue will be described in Section 4. In Fig. 2(c), the car in left road is detected and marked by a rectangle, and the center of rectangle shows the position of object; however, the boundary of rectangle does not correspond to the outline of object. To compensate for this drawback, we introduce stereo vision to target detection.
In stereo vision, we need to extract the outline of an object accurately from a complex image. There are many algorithms used to achieve feature extraction, such as HOG features, optical flow, frame difference, background subtraction, etc. These traditional algorithms, however, do not work well in the presence of background noise and target occlusion. Thus, we employ the UV-disparity algorithm to extract the feature of objects from the disparity image, as shown in Fig. 4.

Stereovision imaging model.
The stereo vision imaging model employed in this paper is demonstrated in Fig. 4, where P (X, Y, Z) is a point in the world coordinate system; f is the binocular camera focal length; O l and O r are the center of the left and right images, respectively; and b is the baseline distance of the binocular camera. Suppose the reference image I (u, v) is the right image, where u is the column coordinate and v is the row coordinate [26].
According to the principle of stereovision imaging, the disparity d can be calculated by using
An example disparity image generated is shown in Fig. 5(b). The disparity image gives depth of each pixel through expression (2):

The process of stereo vision.
The disparity image Δ (U, V) can be generated using expression (3).
The V-disparity image rows correspond to the rows of Δ (U, V); the column number is the max disparity value of Δ (U, V); the value of the pixel in the V-disparity image is that of the pixel with the same disparity in the corresponding row of the disparity image [27]. The generation a V-disparity image is shown in Fig. 5(c). The process of forming the U-disparity image is similar to that of V-disparity generation, and we only use the information of V-disparity image in this paper.
The advantage of UV-disparity algorithm is that it is easier and faster to extract the ground pixels from the V-disparity image [25]. As show in Fig. 6, the ground appears as a red oblique line in V-disparity image, and other obstacles appears as vertical lines. The analytic expression of this oblique line can be captured using the Hough transform, and can help us filter out the ground and objects at certain heights such as the sky. Given that our focus is on the obstacles within a certain range and the accuracy of the binocular camera, pixels without disparity are also filtered out. After the image is processed using the UV-disparity algorithm, the ground and sky are filtered out as shown in Fig. 5(d).

Fusion algorithm detection result.
Although there are disconnected limitations in the objects detected with the disparity image using the UV-disparity algorithm as shown in Fig. 5(b) and Fig. 5(d), the aim of stereo vision in this paper is to extract the boundary of the objects to make up for the limitation of Tiny-Yolo rather than to achieve object detection independently.
There are two critical pieces of information contained in an image after the application of Tiny-Yolo and the UV-disparity algorithm; these are the rectangles from the Tiny-Yolo detection and the disparity regions from the UV-disparity algorithm. Next, we fuse these two pieces of information. Considering the advantages and disadvantages of Tiny-Yolo detection, i.e., that the center position of each object can be obtained accurately but the boundary position has a lower accuracy, the rectangles from the Tiny-Yolo algorithm are regarded as an initial region of interest (ROI) with a fixed center. The ROI is an area selected from image, which is the focus of image processing. The establishment of the ROI can reduce the computational complexity of image processing and increase the accuracy, so it is important to determine the suitable size of the ROI. In this paper, we employ a traversal algorithm to determine a suitable boundary of the fixed-center ROI.
In order to obtain the optimal ROI, a parameter I con is set as the confidence threshold to estimate whether the current ROI is suitable or not. I con is the minimum ratio between the amount of pixels with disparity and the total pixels in the ROI. Then, we employ a traversal algorithm to find a suitable boundary of the ROI. We traverse every pixel of the ROI starting from the middle pixel of the right boundary to the right until we find a pixel without disparity or its depth, calculated using disparity through expression (1), far exceeds the depth value of the center pixel. Then, the abscissa value of the pixel before this one is set as the new right boundary of the ROI. The left, top and bottom boundaries can be obtained in the same way. From the above, we obtain a new candidate ROI. In addition, we calculate the ratio I, i.e., the amount of pixels with disparity divided by the amount of all pixels in the current ROI. If I is greater than I con , we consider the ROI as the final result; otherwise, we reduce the size of ROI until the proportion is over the confidence threshold.
These constraint conditions can be described as:
Through the above fusion detection algorithm, the boundary position obtained is more accurate than that of Tiny-Yolo. As shown in Fig. 6, the car in the left lane is marked accurately. Compared with Fig. 2(c), the car in the left lane has a more accurate boundary position, which means that the fusion detection algorithm can compensate for the drawbacks of the deep learning algorithm and improve the detection accuracy.
The MOT (multiple object tracking) problem can be viewed as a data association problem, where the aim is to associate detections across frames in a video sequence [28, 29]. A tracking-by-detection framework is an efficient approach for online and real-time multiple object tracking applications, which requires that objects are detected in each frame and represented through their bounding boxes. In this paper, we employ a method based on a Kalman filter and Hungary assignment [30] to track multiple objects. This algorithm is described by propagating object states into future frames, associating current detections with existing objects, and managing the lifespan of tracked objects. Most tracking-by-detection algorithms employ a unidirectional framework, i.e. the detection result is the only information brought forward and it is assumed in the following tracking operations that it is absolutely reliable. This requires high detection accuracy and reduces the system robustness. Thus, in this paper, we propose a framework fusing stereo vision with tracking, and then using the tracking outcome as feedback to modify detection, which can improve the system robustness and the detection accuracy
As shown in Fig. 1, the tacking framework mainly includes four modules, i.e. the Kalman tracker prediction, the assignment of detections to trackers, the Kalman tracker update and the management of the tracking lifespan. The aim of Kalman Tracker prediction is to predict the position of objects in the next frame using the Kalman filter, while the assignment modules associate the detections with the objects predicted from the previous modules. The Kalman tracker update achieves more accurate tracking by updating the state of objects using the detection bounding box, and the state of the Kalman filter is also updated to achieve more accurate prediction and tracking. Management of the tracking lifespan is achieved by choosing the suitable time to commence or cease tracking.
Motion state model
In order to propagate object states into the next frame, the motion state model of an object is described as:
The location of each object’s bounding box will be updated by using the Kalman filter for prediction in current frame. When an object is associated with an object detected in the next frame, the detected bounding box is used to update the current object’s state. If no detection is associated to the current object, its state will be only predicted using the Kalman filter.
Kalman filter is well-known minimum mean square estimator with both stationary and non- stationary processing capabilities, and is widely used in object tacking [31]. Kalman filter is defined by the following observation and state equations:
The prediction and updating processes are as follows:
The key of tracking is to assign detections to exiting objects. We employ the Hungarian algorithm to solve this assignment problem. In the assignment process, the intersection-over-union (IOU) between each detection and all bounding boxes predicted using the Kalman filter is used as the evaluation criterion of the assignment, which can be described as:
In addition, occlusion and shadows are always difficult problems in object tracking, because the object cannot be detected when there is occlusion or shadow. The IOU by itself is not sufficient to judge whether there is occlusion or shadow. Thus, in order to ensure the tracking quality, the Bhatacharia distance (BD) between each detection and all bounding boxes predicted by Kalman filter is used to scale the degree of occlusion. BD is defined as:
Efficient tracking algorithms ensure a good trade-off between the tracking quality and the computing complexity. In order to ensure tracking quality, visual features such as the color histogram mentioned above and HOG features should be introduced to improve the matching effect. To reduce the computational complexity, the extraction of visual features is only carried out in the candidate areas rather than the whole image. The candidate areas include the detection bounding boxes and the bounding boxes predicted by the Kalman filter. In order to improve the robustness to deformation, the Bhatacharia distance (BD) between the color histograms of the two regions (detected and candidate bounding boxes) [32] is used as the evaluation criterion of the assignment. Furthermore, in order to improve the robustness to illumination variation, the Euclidean distance (ED) of HOG descriptors [33] between the detected and candidate bounding boxes is also used to estimate the assignment effect. ED and BD can be described as:
Similarly, EDmin and BDmax are set to reject assignment. From the three conditions above, the optimal candidate bounding boxes (BBoxopt) are described as:
It is also important to determine the time when tracked objects are created or destroyed. When there is a candidate bounding box whose BBoxopt is greater than others, this target is treated as a tracked object. When there is a detection with IOU less than IOUmin, it is considered as a new object, and the tracker is initialised using the object state information. Tracking will be terminated if an object is not detected for more than Tmax.
In addition, Tiny-Yolo suffers from the sudden missing detection issue mentioned in Section 3, which makes the environment perception of the system have poor robustness since tracking is based on accurate detection. Fig. 7(a) and Fig. 7(b) are two continuous frame images. In Fig. 7(a), there are three detected cars, but in Fig. 7(b), which is the next frame, there are only two detected cars, and the car in the left lane has been missed even though the car is still visible.

Missing detection issue encountered with Tiny-Yolo.
There are many methods to overcome this limitation; improving detection by deepening the network structure is a common such method. However, it causes an increase of computational complexity at the expense of real-time performance. Thus, in order to overcome the limitation of sudden missing detection and improve the system robustness, we propose a fusion algorithm for tracking and detection, and introduce the stereo vision feature generated during detection into tracking. When there is a tracked object not detected, the ratio I between the amounts of pixels with disparity and the total pixels in this tracked bounding box is calculated. If the ratio is more than Icon, a missing detection issue has occurred, and we use the tracking state predicted by the Kalman filter to update the detection state. Thus, the missing detection object can be redetected through updating the detection state using the Kalman filter and be subsequently tracked. In Fig. 8, the car that is not detected by objection detection algorithm in left road is redetected through the fusion of tracking and detection.

Improvement of detection issue achieved through the proposed fusion algorithm.
In addition, we counted the number of cars detected in 100 continuous image frames, as shown in Fig. 9. The total number of cars detected in 100 frames was 248 using only the detection algorithm, but using the proposed tracking and detection fusion algorithm, the number was 277, which means that detection can be improved by fusing it with tracking.

Number of cars detected by traditional and fusion tracking algorithms in 100 frames.
In this section, we show the advantages of our environment perception algorithm. We employed the VOC2012 dataset with an image resolution of 486 × 500, the KITTI dataset with an image resolution of 1224 × 370 and the MOT2015 dataset with the associated ground truth to evaluate the performance of detection and tracking in terms of accuracy and real-time performance. The VOC2012 dataset includes 17125 images; the KITTI dataset includes 7481 images, while the MOT2015 dataset includes 5503 with a sequence length of 6’29”. In addition, we also applied our algorithm on a dataset collected by our team with an image resolution of 1280 × 720.
We employed the VOC2012 database to the retrain Tiny-Yolo network and obtain high recognition accuracy by updating the network parameters. The results are shown in Table 1. The mean average precision (mAP) of Tiny-Yolo without retraining was 50.71%, but after retraining this was increased to 66.53%, i.e. an increase of 15.82%.
Object detection performance
Object detection performance
In order to obtain the accurate evaluation results, we refer to the standards of the MOT challenge and the object detection. In this paper, the evaluation metrics we selected in terms of accuracy are as follows [34]:
MOTA (Multi-object tracking accuracy): A summary of overall tracking accuracy in terms of false positives, false negatives and identity.
MT (Mostly tracked): Percentage of ground-truth tracking sequences that were given the same label for at least 80% of their life span.
ML (mostly lost): Percentage of ground-truth tracking sequences that were correctly tracked for at most 20% of their life span.
FN: number of missed detections.
FP: number of false detections.
IDs: number of times an object ID was switched to a different previously tracked object.
mIOU: mean intersection over union of the detection bounding boxes and the ground-truth bounding boxes.
The first six metrics were used to evaluate the tracking accuracy; the mIOU was used to evaluate the detection accuracy, and is defined as:
In order to better demonstrate the benefits of our environment perception algorithm, we compared it with other advanced tracking methods. NOMT [35], TDAM [36], STRN [37], and SMOT [38] were compared with our algorithm in terms of tracking performance. The results are shown in Table 2.
Performance of the proposed method on MOT challenge
For MOTA and MT, higher scores indicate better performance while for ML, FN, FP, IDs, lower scores indicate better performance. From the results of Table 2, we conclude that our algorithm outperforms the other algorithms on MOTA. In particular, the difference with SMOT is larger than 15.4%. In addition, our tracking algorithm has a better performance on MT, and the difference in MT value compared to TDAM is more than 11.5%. Our tracking algorithm has also fewer missed detections and false detections than other algorithms, and the number of times an ID is switched is also decreased compared to other algorithms due to the fusion of the color histogram and HOG features.
Regarding the evaluation of object detection, we compared the Tiny-Yolo fusion stereo vision with the traditional Tiny-Yolo, and the evaluation standard was mIOU. As for mIOU, higher scores denote better performance. The experimental results are show in Table 3:
Object detection performance
From Table. 3, we see that the mIOU of traditional Tiny-Yolo was 85.9%, while the mIOU of Tiny-Yolo using stereo vision was 56.5%, which is a difference of more than 29.4% for mIOU. This means that the detection accuracy has been obviously improved through fusing stereo vision.
Real-time performance is critical for online environment perception technology. Improving the performance of the embedded platform can improve real-time performance, but it increases the cost of development and application. Thus, the environment perception algorithms with simultaneous high real-time performance and accuracy can better meet the application requirements of embedded vehicle platforms. In this paper, the experimental platform was a 1.74 GHz GPU machine with 8GB memory and an image resolution of 1280 × 720. Experiments show that the performance achieved using our environment perception algorithm was 30.1 FPS and can meet the real-time performance requirements of vehicle environment perception systems.
Conclusion
In this paper, a novel environment perception algorithm is proposed to focus on online detection and tracking. The proposed algorithm includes two stages: object detection and tracking. We employ a fusion algorithm for object detection based on a deep network and stereo vision, and obtain highly accurate detection results, as stereo vision compensated for the main limitations of deep learning. For object tracking, we employ a tracking-by-detection framework with stereo vision to realize interframe data association through Kalman filter and Hungarian assignment algorithms. Besides IOU, HOG and color-histogram features are also used as the matching standards during the association process. In addition, the fusion of detection and tracking improves the detection accuracy by avoiding sudden missing detections effectively. Experimental results demonstrate that the proposed algorithm can achieve superior performance in terms of accuracy and real-time. We employ the original Tiny-Yolo as typical deep learning algorithm for object detection and do not simplify the network structure of Tiny-Yolo to further improve the real-time performance of object detection. Further work will be carried out to lighten the network structure using stereo vision to further improve real-time performance.
