Abstract
Real-time vehicle detection is one of the challenging problems for automotive and autonomous driving applications. Object detection using Deformable Parts Model (DPM) proved to be a promising approach providing higher detection accuracy. But the baseline DPM scheme spends 98% of its execution time in loop processing thus highlighting the drawback of higher computational cost for real time applications. In this paper, we have proposed a real time vehicle detection scheme for a low-powered embedded Graphics Processing Unit (GPU). The proposed scheme is based upon DPM approach using CUDA programming with different parallelization and loop unrolling schemes to reduce computational cost of DPM. Three loop unrolling schemes i.e. loosely unrolled, tightly unrolled and hybrid unrolled is proposed and implemented on two different datasets. Finally, we provided an optimal solution for vehicle detection with minimum execution time without having any impact on vehicle detection accuracy. We achieved a speedup of 3x to 5x as compared to state-of-the-art GPU implementation and 30x as compared to baseline CPU implementation of DPM on a low-powered automotive-grade embedded computing platform which features a Tegra K1 System on Chip (SOC), thus getting advantage of improved efficiency through parallel computation of CUDA.
INTRODUCTION
Autonomous driving is one of the major automotive innovations [1]. Autonomous vehicles can move around without any human input by sensing the environment intelligently. Such vehicles come up with multiple techniques and sensors to perceive their surroundings and navigate accordingly to avoid any obstacles. The transition from current-day driving to autonomous driving has already started with Advance Driving Assistance Systems (ADAS) [2] such as adaptive cruise control [3] and lane keeping assistance [4]. The benefits of autonomous driving can be achieved by making the detection systems more intelligent, efficient and perceptive to other vehicles and objects around it. Autonomous driving can be employed for cooperative driving where autonomous vehicles can navigate in coordination with each other by detecting other and maintaining a particular distance. The key benefit of this technology is that vehicles jointly optimize their movements to improve traffic efficiency and safety [5].
Computer vision based object detection is one of the most computationally intensive tasks [1] for vehicle detection systems. Usually, a lot of computational time is involved in real-time object detection also having the constraint of low power consumption. Detection of a single object from a standard 640×480 image requires billions of floating point operations [6]. Previous works on object detection show that the cascaded weak classifier method uses Haar-like features to compare image appearances over small local regions [7, 8]. To improve the performance, robustness and generality of the algorithm strip features [9] and edge lets [10] were developed as an extension of the Haar-like feature. Boosting algorithms like AdaBoost [11] and GentleBoost [12] were used later to select the optimal feature set in a cascade structure to produce efficient and accurate detectors. Boosting algorithms are still used in real time object detections due to their efficiency. The Haar-like feature uses the appearance of the target objects. Therefore, it limits its generality and robustness to intra-class variations [13]. The drawback of Haar-like feature and boosting based method is that it depends on the appearance of the target object which reduces its detection accuracy.
To remove the dependency on target object appearance, sliding window methods were used to create a feature based on the local image gradient information. Histogram of Oriented Gradients (HOG) features [14] based methods are the well-known sliding window method. To calculate HOG features, gradients of magnitude and orientation of the image is computed and the histogram of the oriented gradients is generated to represent the visual information. The HOG feature based method provides a reliable representation of the image providing many states of the art algorithms [15–17]. The HOG descriptor has a few key advantages over other descriptors. Since it operates on local cells, it is invariant to geometric and photometric transformations, except for object orientation. Such changes would only appear in larger spatial regions. The feature of HOG is computed on a dense grid of uniformly spaced cells and uses overlapping local contrast normalization for improved accuracy. DPM approach [18] is based on HOG including also the local parts information of the object. DPM further improves the flexibility of HOG by splitting an overall object template into multiple smaller component templates.
DPM is used recently by many researchers for real time vehicle detection [19, 20]. It became popular due to its high detection quality. However, the computational cost of DPM-based object detection remains a major concern. They provided an in depth analysis of DPM and its implementation techniques on different GPU. They obtained a speed up of 10.8x on GTX TITAN implementation. The method provided by [19] is another latest implementation of DPM on GPU. They have achieved 38fps for an image size of 640×480.
Many efforts have been made to improve the computational cost of DPM with different approaches. CascadeDPM [21] method eliminates unpromising hypothesis of DPM efficiently and achieves a speedup of 10x over original DPM implementation. CoarseDPM [22] method uses a hierarchical part based method with coarse to the fine procedure to minimize the cost of the dubious hypothesis. FastDPM method [23] accelerated main steps of DPM and achieved 5x speedup over CascadeDPM having comparable accuracy. The DPM detector provided by [23] uses hierarchical vector quantization, hash table methods, and priority queue to improve the execution speed of DPM. Vanilla DPM [24] method uses GPU based implementation of DPM to detect all 20 categories of PASCAL [29]. GPU DPM [25] method accelerate the computation of DPM using GPU and achieves a speed up of 10.6x over single core DPM implementation. The method provided in [26] presents a fast DPM-GPU implementation using FFT, look up tables and early classification techniques. In real time applications of autonomous driving, the objects need to be detected in order of milliseconds. While DPM is one of the good object detection approaches. The computational cost of the algorithm limits its application to deploy in the real world. The existing approaches tried to improve the computational efficiency in different ways but none of them focused on iterative behavior of DPM which is taking almost 98% of execution time. This give us a motivation to improve the computational efficiency of DPM on embedded GPU by minimizing the iterative behavior of DPM.
In this paper, we presented an optimized GPU based implementation of DPM. We begin with an analysis of traditional CPU implementations to find fundamental performance bottlenecks of DPM-based vehicle detection and provide an optimized implementation to speed up its execution time. We did not change the logic of the actual algorithm hence providing same detection accuracy as the original CPU implementation. The rest of this work is organized as follows. Section II provides an explanation of proposed setup and algorithm framework, section III provides experimental result and discussion, and section IV concludes this paper.
Proposed setup and methodology
The proposed setup consists of a Jetson TK1 which is NVIDIA’s automotive grade embedded Linux development platform featuring a Jetson Tegra K1 SOC, enclosing also a GPU. Jetson TK1 comes pre-installed with Linux4Tegra operating system that is basically Ubuntu 14.04 with pre-configured drivers [27]. The following subsections provide an overview of graphics processing unit, Jetson Terga K1 SOC, deformable parts model and loop unrolling and parallelization schemes proposed in this paper.
Jetson Tegra K1 system on chip
Jetson TKI SOC is a low-power embedded GPU development board. The main reason to choose this embedded platform is that it provides the same architecture and advanced features as a modern desktop GPU, while still using the low power draw of a mobile chip e.g. 1 to 5 watts, which is a requirement in autonomous systems in the automotive industry. The specification of Tegra board is shown in Table 1 [27].
Specification of Jetson Tegra K1 SOC
Specification of Jetson Tegra K1 SOC
Most of the iterative portions of the deformable parts model (DPM) are data independent. Therefore, the execution of the algorithm is highly parallelizable. Graphics Processing Unit (GPU) computation is much recommended for data independent and computationally intensive tasks. Because GPU computation provides maximum performance for data parallel and data intensive tasks by executing the algorithm with a lot of active threads in parallel. Due to this high computational and parallel computing capability of GPU, its importance in scientific applications for computation of highly intensive tasks has attracted a lot of attention. The use of GPU for general purpose compute-intensive task is referred to as General Purpose Graphics Processing Unit (GPGPU) [28]. In this research, we used an embedded GPU platform provided by NVIDIA to implement our proposed scheme for vehicle detection. The default programming language of our proposed embedded development board is CUDA.
CUDA programming and threads
CUDA is a C-based integrated programming language used in GPU accelerated computing environment. This programming language is developed by NVIDIA and is one of the most popular programming languages for GPGPU. Our implementation of DPM on Jetson Tegra K1 also uses CUDA as a programming language. During massively parallel computing for vehicle detection using DPM, thousands of threads are executed in parallel. CUDA employs a hierarchical structure to control a large number of threads simultaneously.
A Grid is the collection of multiple blocks and a block is a collection of multiple threads. Whenever the CPU issues a command to use GPU resources a grid is generated, while blocks and threads in the grid are mapped three-dimensionally. A 3-D representation of a CUDA grid is shown in Fig. 1 for visual illustration. Moreover, under the CUDA programming model, a block with a lot of idle threads may degrade the performance of GPUs. Therefore, the key to maximizing the performance of GPU is to keep a lot of active threads in each block. For this purpose, programmers need to map the computation to CUDA in such a manner that the idle threads on the GPU can be hidden as much as possible [23]. To efficiently utilize the performance of GPU we have tested DPM using different parallelization schemes. We will discuss these parallelization schemes in detail later in this section.

3-D representation of a CUDA Grid.
As mentioned earlier that we have employed Deformable Parts Model (DPM) as the baseline algorithm for vehicle detection from the images. The CPU implementation of DPM is presented by [3]. The algorithm is rewritten for GPU implementation and further optimization. Flow diagram of the DPM algorithm is shown in Fig. 2.
The Fig. 2 shows all the significant steps of DPM. First of all the HOG feature pyramid of the image is calculated. Feature map with twice the resolution is used for part score calculation. Root score is obtained by convolving HOG feature with root filter and the part score is obtained by convolving twice the resolution of HOG feature with each part filter. Then, final score summation is computed for object detection. The implementation method for each significant step is briefly described in the following.

Flow chart of deformable parts model (DPM).
Constructing the HOG pyramid is one of the most computationally expensive steps in the computation of DPM. It is composed of four stages: (1) First of all the input image is resized; (2) secondly, the histogram of oriented gradients of the resized image is computed; (3) In the third step the normalization of the histogram is computed; and (4) In the last feature vector of the image is calculated. All steps of (2), (3), and (4) is applied to each resized image for HOG pyramid generation. The computation is parallelized on GPU by using a massive number of threads.
Root and part score calculation
The root and part scores computation is one of the other time-consuming steps of DPM as schematically depicted in Fig. 2. The procedure for computing root score and the part score is same but part score calculation is more time consuming due to the double resolution of feature pyramid in part score calculation and five number of part filters. Figure 3 shows the pseudo code for computing execution time for a portion of code. The algorithm measures the time of the occasion when control of the compiler starts executing the portion of the code and again it measures the time at the end of the portion of the code. Finally, it subtracts both measured time in milliseconds to compute the total amount of time spent to execute that part of the algorithm.
Distance transforms and scores summation
Distance transform and score summation steps in the DPM also exhibits a high level of parallelism [29]. Therefore, all the loops of the algorithm need to be unrolled and all the iterations should be parallelized on CUDA thread. A structural overview of the pseudo-code for parallelization of unrolled code structure of DPM is shown in Fig. 4.
Parallelization schemes
We proposed the parallelization schemes based on loop unrolling. Loop unrolling is a loop transformation technique that attempts to optimize a program’s execution speed at the expense of its binary size. Loops in the algorithm are rewritten as a repeated sequence of similar independent statements using three type of unrolling schemes. The independent statements in the loop are executed in parallel using three type of parallelization schemes i.e. Loosely Unrolled Parallelization, Tightly Unrolled Parallelization and Hybrid Parallelization. A demonstration of the loop unrolling method we used is shown in Fig. 4. The algorithm is shown in Fig. 4 describes the loosely unrolled parallelization scheme where all the inner loops of the algorithm are unrolled. Each block of input HOG pyramid in each iteration is mapped with one CUDA thread on the GPU. In a single iteration, N number of threads is computed. For loosely unrolled parallelization N is the size of the CUDA grid generated in one CPU iteration.
Loosely unrolled parallelization
Loosely loop unrolling scheme unrolls all possible inner loops of each significant step of the algorithm. The number of inner loops to unroll may depend upon the number of iterations in the nested loop. In loosely unrolled parallelization each block in the HOG pyramid is mapped with one CUDA thread. Due to variation in width and height of the HOG pyramid, there will be many idle threads in each block as shown in Fig. 5. These idle threads limit the maximum capability of GPU reducing the throughput of the algorithm. The variation in the size of the width and height of HOG pyramid and implementation of all CUDA thread in one GPU kernel produces many idle threads on the GPU. For example, for an input image having a resolution of 640×480, we produce a HOG features having 12 pyramid levels. The GPU will produce the entire grid based upon the size of highest resolution level i.e. 640×480. In the first level of HOG pyramid, all the CUDA threads are active. The next HOG pyramid level will be half the size than the original resolution. But the block size on the grid of CUDA is same. Therefore, the 50% of CUDA threads for the second level will be idle and the remaining 50% will be active. For the third level of HOG pyramid, 75% of the CUDA threads will be idle. For the levels beyond this, the number of idle threads will increase. This limits the utilization of GPU resources for efficient computation. This is basically the drawback of loosely unrolled parallelization scheme.
Tightly unrolled parallelization
To incorporate the downside of loosely unrolled parallelization scheme, CPU threads are utilized with CUDA streams. A practical demonstration of this method is shown in Fig. 6. In the pictorial diagram, the left side of the image represents the HOG pyramid of the input image while the right side of the image represents grid on the GPU created by CPU in each iteration.
This scheme allows only cubes of the same size of HOG pyramid to be unrolled together to map on CUDA grid so that idle threads can be removed within a block. For example, for an input image having 12 HOG pyramid levels, each level of the HOG pyramid will execute on different GPU grid. In this parallelization scheme, for each input image pyramid level, CPU executes grid of the same size as that of HOG pyramid level. i.e. the size of the CUDA grid will be same as the size the HOG pyramid in each level. Therefore, none of the CUDA thread in a grid will be idle. All the threads will be active as shown in Fig. 6. We can see in Fig. 6 that due to this unrolling scheme, the utilization of GPU resources is optimized. But the problem in this scheme is that it imposes a large number of CPU threads on the host processor producing an overhead to the CPU. Due to this overhead on CPU, the resulting DPM computation speed degrades and the performance of this implementation scheme is lowered.

Computation of execution time for a segment of code.

Algorithm for unrolling and parallelization algorithm with loosely unrolled parallelization scheme.

Loosely unrolled parallelization scheme.

Tightly unrolled parallelization scheme.
The hybrid parallelization approach uses the performance of both loosely unrolled and tightly unrolled parallelization scheme to balance the trade-off between CPU and GPU workload. This scheme unrolls multiple inner loops relaxing the number of threads on CPU while avoiding many idle threads on the GPU. In this method we generated CUDA blocks in groups of 2n (where n denotes HOG Pyramid level) in a grid according to the depth of HOG Pyramid. Maximum number of idle thread in each grid will not exceed more than 2n in this approach. This concept of mapping minimizes the downsides of both of the other parallelization methods. The method of hybrid parallelization scheme of deformable parts model for vehicle detection is shown in Fig. 7. The left side of the block is describing the HOG pyramid of the input image. While the right side of the figure is showing the grids on the GPU in groups created by a CPU after each iteration. Each group of the grid contains some idle thread also. The transparent threads shown in the figure is representing the idle threads while the solid color shows the active threads in the grid. This method of parallelization does not contain many numbers of idle threads as loosely unrolled parallelization scheme has. The CPU is also busy in iterating the external loops and assigning tasks to GPU. This shows the efficiency of this parallelization scheme as compared to loosely unrolled and tightly unrolled parallelization scheme.
Experimental results and discussion

Hybrid parallelization scheme.
We have implemented DPM for vehicle detection on a low-powered automotive-grade embedded computing platform which features a Tegra K1 System on Chip (SOC), enclosing also a GPU. We have compared the overall results in four possible ways i.e. (1) Multi-Core CPU (2) NVIDIA GTX TITAN [20] (3) Jetson Tegra K1 SOC and (4) Jetson Tegra K1 SOC with optimization. The multi-core CPU implementation is obtained by executing the original implementation of DPM on MATLAB using a core i5 machine. The result of NVIDIA GTX TITAN implementation is given in [20]. While the Jetson TK1 implementation is obtained by implementing DPM using CUDA programming on Tegra TK1 development board and the optimized result is acquired using different parallelization schemes discussed in the proposed methodology section of this paper. We evaluated and compared our implementation with [20] using the same set of images used by [20] as input data and their average execution time is considered as a major performance metric. These images are extracted from PASCAL VOC 2007 [30] challenge dataset. We also evaluated our result using KITTI vision benchmark suit [31] to show the generality of our implementation. Figures 8 and 9 shows empirical detection result of Jetson Tegra K1 SOC for the dataset used by [20] and KITTI dataset respectively. The detected vehicle objects are at different viewing angle and each image shows a different scene environment. Accurate and multiple detections in each image with a different viewing angle of the target object shows the robustness of our implementation. While the detection results in Fig. 8 shows that our detector is not much good for partially occluded vehicles. The possible reason for not detecting the occluded vehicle is that our detector is not trained for the object of occlusion more than 20%. [20]. One of the approaches for improving the detection results for occluded objects is to train a set of occlusion-specific classifiers each with a specific occlusion-level.

Dataset [20] detection results.

KITTI dataset detection results.
The average accuracy and execution time of the algorithm is our key evaluation matrices for comparison. Our contribution in this paper is the reduced execution time of the algorithm without altering the average accuracy and PR curve behavior. The average accuracy for both type of dataset we have used while experimentation is shown in Table 2. The average PR curve behavior of the algorithm before and after optimization is shown in Figs. 10 and 11. The behavior of average accuracy and PR curve for both type of dataset before and after optimization shows that during optimization we did not change the original logic of the algorithm. We only optimized its performance using loop optimization and parallel programming techniques.
Average accuracy of DPM for three types of implementation
Average accuracy of DPM for three types of implementation

Average PR curve using KITTI dataset.

Average PR curve using dataset [20].
First, we make analyses of DPM to find out all significant steps of the algorithm that is the most time-consuming. Experiments show that the below four steps of the algorithm are contributing most amount of time in DPM execution. (1) HOG Pyramid generation (2) Root and Part Score Calculation (3) Distance Transform and (4) Score Summation. The measurement of time contribution of each step generally gives us an idea to focus on the most time-consuming part of the algorithm. The execution time of a specific portion of the algorithm is calculated using the built in functions in MATLAB as well as CUDA programming. A pseudo code of calculating execution time for a portion of code is shown in Fig. 3. The amount of execution time for each portion of code is shown in Fig. 12. The graph shows that HOG pyramid computation is taking the most amount of time after that root and parts score calculation is more time-consuming. The distance transform and score summation takes rest amount of time in DPM execution.
We tested and evaluated the algorithm with each type of parallelization scheme discussed in section 2.4. The effect of different parallelization schemes is shown in Fig. 13. Comparison of the results using each type of parallelization scheme in Fig. 13 shows that the effect of both loosely unrolled and tightly unrolled parallelization schemes are interchangeable. The hybrid parallelization approach outperforms the result in execution time. As discussed in section 2.4., the hybrid approach of parallelization scheme utilizes the performance of both the CPU and GPU to improve the execution time of the algorithm. Our experimentation shows that execution time of the algorithm varies with a change in input image size and change of block size i.e. a number of threads per CUDA block.

Execution time of each significant step of DPM before and after optimization.

Comparison of different parallelization scheme.
The increase in image size requires more computation due to a large number of pixels. Therefore, the execution time of the algorithm increases with increase in image size. Figure 14 shows the effect of increasing image size to the amount of execution time for all the four variants of DPM implementation. The graph shows interesting phenomena. i.e. The execution time of the Multi-Core CPU implementation increases exponentially with the input image size. While for the GPU implementation the graph nature is much smoother. Because, higher the image size, more we can parallelize its execution. Especially, for our optimized Tegra K1 implementation, the graph behavior is much smoother. i.e. for a 6x increase in the CPU implementation, there is only 2x increase for our optimized implementation. Therefore our implementation would contribute more for high-resolution images as compared to the GTX TITAN [20] and CPU implementation.

Execution time w.r.t varying image size.
In Fig. 15 the graph shows execution time w.r.t. a number of threads in each block for all the three variants of DPM implementation on GPU. The number of threads in each block is changed and execution time is calculated. We observed that for a block size of 8×8 our implementation exhibits the best performance. This is because the block size we choose for HOG feature extractions is 8×8 and the direct mapping of HOG pyramid to the CUDA block produces no idle threads on GPU side, utilizing the maximum capability of GPU.

Execution time w.r.t varying block size.
The graph shows that for a block size of more than 8×8, the algorithm consumes more amount of time. The increase in a number of threads in each block of GPU will produce idle threads within a block and reduces the maximum capability of GPU. Therefore the execution time of the algorithm gets increased. Similarly, when the number of threads in each block is decreased, the algorithm will produce an overhead on CPU side requiring most of the iterative calculation from CPU thus, most of the time, the GPU waits for the CPU instruction which may increase the total execution time. So, the best choice of block size for our implementation is 8×8.
The proposed work has resulted in the development of a vehicle detection scheme for a low-powered embedded GPU. Proposed three parallelization schemes for vehicle detection using DPM are implemented on an embedded GPU and proved to be acceptably accurate along with less computational complexity. Our best parallelization scheme is capable of running at 66.6 fps (15 ms per frames) for an image size of 640×480, thus confirming the real-time processing capability of proposed schemes. From a computational point of view, our implementation is better than another state-of-the-art implementation. Despite the very good performance, we need further improvements in computational time for practical use. For hardware devices without shared memory between CPU and GPU, a small speed gain can be achieved by streamlining the process of copying images between the two processing units.
The DPM detector we used in this implementation has limited capability of detecting partially occluded objects. Visual inspection of our detector running over the dataset [20] as shown in Fig. 8 indicate that objects that are more than 50% occluded are not detected. One of the approaches for improving the detection rate of occluded objects is to train a set of occlusion-specific classifiers each with a specific occlusion-level. Extending the current techniques to multi-class handling is of direct use in practical applications. With the addition of multi-class handling, there is an entirely new set of problems and possible solutions for improving eficiency. For example, if our goal is to detect both trafic signs and trafic lights we expect that there is a strong spatial correlation between the two. While our method is originally suggested for one-class detections, we believe it is interesting to generalize the idea to encode multi-class object detection and speed up the overall detection process.
