Abstract
In recent years we have witnessed significant progress in the performance of object detection in images. This advance stems from the use of rich discriminative features produced by deep models and the adoption of new training techniques. Although these techniques have been extensively used in the mainstream deep learning-based models, it is still an open issue to analyze their impact in alternative, and computationally more efficient, ensemble-based approaches. In this paper we evaluate the impact of the adoption of data augmentation, bounding box refinement and multi-scale processing in the context of multi-class Boosting-based object detection. In our experiments we show that use of these training advancements significantly improves the object detection performance.
Introduction
Visual object detection is the first step in many relevant Computer Vision problems such as facial facial landmarks detection [1], body pose estimation [2], self driving cars [3], vehicle type analysis [4], text reading [5], etc. The object detection pipeline can be described in terms of three basic components: 1) image features: Haar-like-features, Histograms of Oriented Gradient (HoGs) features, Integral Channel Features (ICF), Locally Decorrelated Channel Features (LDCF), Convolutional Neural Networks (CNNs), etc.; 2) classification algorithm: AdaBoost, RealBoost, Support Vector Machines (SVMs), Neural Nets, etc.; 3) detection strategy: sliding window, candidate window proposals and bounding box regression, direct regression from image features (Single Shot Detectors), Hough voting, etc.
Since the publication of Viola and Jones [6] seminal work, the performance of object detection algorithms has improved by changing several parts of the detection pipeline. Visual features have evolved from being hand-crafted [7] to automatically selected from a pool with Boosting [8], Random Forest [9] or learned with a CNN [10]. The brute force sliding window approach [6, 7, 8] has evolved into fast region proposal algorithms [5, 10] and more recently Single Shot Detectors [11, 12]. The Boosting approach has received much attention because it is computationally very efficient and achieves very good performance in various object detection problems such as pedestrians [13, 14], multi-view faces [15] or multi-view cars [16]. The key to its success is the use of the feature selection capabilities of Boosting together with different pooling strategies [13] and rich image descriptions such as the ICF [8]. Further, object detection has evolved from detecting a single object category to being able to detect multiple categories and different views at the same time [17]. The usual framework for Boosting-based object detection uses binary classification (e.g. AdaBoost). In this regard, essentially multi-class detection problems, such as face [15] or car [16] detection, have been usually solved with binary classifiers: either with a monolithic detector (i.e. Object-vs-Background) or with a detector per object view or positive class (i.e. using
Powered by the use of deep neural nets [18, 19], modern object detection algorithms have achieved remarkable precision [10, 11, 12]. This is due to the use of discriminative features produced by deep models and the adoption of new training strategies. However, these approaches require advanced computational resources, such as Graphical Processing Units (GPUs) to achieve real-time performance. This prevents their use in devices with limitations in computing power or energy supply, such as aerial vehicles, micro-robots or mobile phones. Hence, the necessity of developing very efficient object detection algorithms, such as, for example, those based on Boosting approaches.
One of such methodological advances in object detection is the refinement of the bounding box of the detected objects. Classifiers are usually trained with fixed bounding box size and aspect ratio (AR). The bounding box refinement step is crucial to achieve top performance when dealing with objects showing different aspect ratios depending on their pose or configuration. Modern CNN-based detectors such as Faster RCNN [10], YOLO [11] and SSD [12] perform bounding box regression. Also, some of the best results in the KITTI car detection benchmark [20, 21] iteratively adjust the detection bounding box. In Section 4 we introduce a bounding box refinement scheme. The most similar approach to ours was introduced by Juránek et al. [22]. They use a Real AdaBoost binary classifier for car detection. In their work the estimation of the 3D orientation of the car is performed using a similar scheme to our bounding box aspect ratio estimation. However, in their solution the bounding box aspect ratio is fixed and depends on the car orientation. This is a limitation, since cars with the same orientation may have different aspect ratios.
Another training improvement is the augmentation of training data by producing realistic image transformations that do not change their labels. In the SSD detector [12] the newly generated data accounts for an 8.8% increase of the performance in the VOC 2007 test dataset. It is also a common strategy in the top performing car detectors in the KITTI dataset [21]. An alternative way of using additional data is to pre-train the model in a related but different dataset. All top performing algorithms in KITTI cars detection use pre-trained models in the ImageNet classification problem [23, 20, 21]. In the Boosting detection literature it is typical to use the geometric augmentation with image translation, rotation [8], image horizontal flip [24] and aspect ratio changes [25].
Recently, theoretically sound results in the context of multi-class Boosting provide new tools to address the unbalanced and asymmetric classification problems arising in object detection [24]. In this paper we endow this algorithm with modern training strategies, such as data augmentation, bounding box regression and multi-scale processing and evaluate the increase in performance achieved with these improvements.
Multi-class boosting algorithm
A Boosting classification algorithm is a supervised learning scheme that builds a binary prediction model by combining a collection of simpler base models or weak learners [26]. It receives as input a set of
BAdaCost: Cost-sensitive multi-class Boosting classification
Cost-sensitive classification endows the traditional Boosting scheme with the capability to modify pair-wise class boundaries. In this way, we can reduce the number of errors between positive classes (e.g. different target orientations) and improve recall when object classes have different aspect ratios. To this end we use BAdaCost [27] (Boosting Adapted for Cost matrix), a recently introduced multi-class cost-sensitive Boosting classifier. In this section we briefly introduce it.
Costs are encoded in a
Let
In a cost-sensitive classification problem each value
The margin,
BAdaCost resorts to the CMELF Eq. (2) for evaluating classifications encoded with margin vectors. The expected loss is minimized using a stage-wise additive gradient descent approach. The strong classifier that arises has the following structure:
where
When building an object detector it is necessary to have a confidence measure or detection score. In BAdaCost the predicted costs incurred when classifying sample x in one of the K classes are given by the vector:
From now on, in multi-class detection problems, we assume that the background (negative) class has label
This score has desirable properties for detection problems: 1)
One of the main problems for training an object detector is the limited amount of training data, that could cause the classifiers to overfit. The solution adopted in the literature is to generate new synthetic data from the training set. This is known as data augmentation (DA).
Sample KITTI photometric data augmentation. In the first column we display the original training image, followed by 8 generated images.
We model the object to be detected as multiple positive classes depending on the orientation, see Fig. 6. To augment and balance our dataset we increase the number of images in the small classes. To this end we sequentially apply a combination of basic photometric changes to the image RGB values (see Fig. 1). We start by generating a random number
Brightness change, Contrast change, Saturation change, Hue Change.
Otherwise:
Brightness change, Saturation change, Hue Change, Contrast change.
These photometric changes are implemented as:
Brightness change. Generate a random number
Contrast change. Generate a random number
Saturation change. Generate a random number
Hue change. Generate a random number
As we show in Section 6.1 the generation of synthetic data is crucial to achieve a 5.5% improvement in detection results (see Table 2, Moderate). In our detection experiments some views have fewer examples. The use of photometric augmentation allows us to balance the dataset. Further, it allows us to increase the classifier capacity with more and deeper decision trees. This is consistent with similar results in the problem of pedestrian detection [25].
As we show in our experiments, refining the bounding box estimation increases the performance of our detector. To this end we modify the multi-class Boosting classifier introduced in Section 2.
The classifier learns
On each tree node
where Then we classify with the minimum cost rule on each leaf node
where
During the training phase of the baseline BAdaCost algorithm, for every decision tree and leaf node
Bounding box aspect ratio estimation. We learn an ensemble of multi-class cost-sensitive trees. We estimate the AR distribution using all the trees from 
The baseline BAdaCost detector computes the mean AR of each class from the ground truth training data bounding boxes in a vector Our new procedure to estimate the aspect ratio follows a similar approach to the computation of
We correct each window keeping the height constant and adapting the AR.
Note that in option B we may exclude the first
Multi-scale detection. We apply classifiers trained with different base scale (i.e. blue, green and red rectangles) to detect objects with various dimensions.
Car parts and location of the features selected by the detector.
Location of features selected by different car detectors. Top, middle and bottom images show respectively the LDCF features of the detector trained with 48 
The classifiers used in object detection are typically learned with a fixed scale. Both Boosting-based [8, 30, 15, 13, 25, 24] and CNN-based [10, 11] detectors are all trained with a fixed base image size. However, objects of the same class at different sizes usually need different features to be detected. CNN-based detectors overcome this problem using feature maps extracted from layers with different resolutions [12, 31]. Boosting-based detectors use a fixed size classifier in an sliding window on a pyramid of feature maps extracted at decreasing image resolutions. This enables the detection of objects at different sizes. The detection rate also improves using a classifier trained at several base sizes [32, 16].
In this paper we use a multi-scale (MS) detection approach training different BAdaCost detectors at different base sizes. For the KITTI car dataset, our multi-scale approach trains three classifiers with 1.75 aspect ratio: 48
As we show in Fig. 5, features selected at different scales show better definition of interesting areas as the resolution increases. In the car problem, these features are mainly around the rear lights and the edges that define the car shape at different orientations (see Fig. 4). In Section 6 we show that we achieve a significant improvement in recall using detectors trained at different scales.
Experiments
To make our experiments we have added BAdaCost’s cost-sensitive decision trees and multi-class detection to Piotr Dollar’s Matlab Toolbox.1 Our implementation is publicly available at
In the experiments we use detection problems where the target object changes its Bounding Box AR depending on the view angle. Car and cyclist detection in the KITTI dataset [34] are good examples of this kind of problems. The database presents three subsets: easy, moderate and hard (easy
For data augmentation we use the following parameters (see Section 3):
Car detection experiments
The car classes we use in our experiments. Note the relation between classes, i.e. car view orientation, and bounding box aspect ratios.
The cost matrix, C, used in the KITTI car dataset experiments.
Here we improve the results in our previous work [36] by increasing the number of view classes (
During training, we perform 4 rounds of hard negatives mining with the KITTI training image subset (KITTI-train90). The best number of cost-sensitive trees is
The cost matrix is set to weigh up gross errors between view classes. This is important because estimating the wrong class will produce a Bounding Box with the wrong AR (e.g. frontal car,
AP
Aspect ratio AP results depending on the first tree used in the estimation.
Precision-recall curves and (AP
First, we train only one detector with the mean aspect ratio of each view class stored in the tree leaves. We set the detection threshold to an intersection over union (IoU) value 0.7, as established in the KITTI benchmark. Then, we use different AR estimation strategies. First we test the AR estimation algorithm introduced Section 4 with different values of
Second, we compare different AR estimation strategies (we use here
To further analyze the performance of our procedure (Estimated-AR) with respect to the baseline (Fixed-Class-Mean), we have performed an additional experiment varying the IoU threshold (see Table 1). The good behavior of our approach is more evident when we look for higher overlapping in the detection. With a minimum required IoU
Car detection ablation study in KITTI’s validation set (KITTI-train10)
Key: stronger regularization (I), Bounding Box estimation (BB), Data Augmentation (DA), and multi-scale detection (MS), average precision with
Precision-recall curves and (AP
In an ablation analysis shown in Fig. 10 and Table 2, we evaluate different detector configurations and use synthetic data to enhance even further the performance of our detector. Data augmentation allows us to avoid overfitting and increase the Boosting classifier capacity with deeper decision trees using a finer quantization in car orientation. We use
To further improve the AP of the detector we have to take into account that a car at different resolutions may not need the same features. Therefore, we train three detectors at different base sizes: 48
Car detection comparison experiment in KITTI-testing evaluation server
Best BAdaCost results are shown in bold and overall best results in italic.
Car detection comparison in KITTI-testing using the evaluation server. We report (AP
Since the KITTI dataset is not very large, the best CNN based detectors use a VGG network [37] pretrained with ImageNet [38] and finetuned with KITTI. Thus, they are not fully comparable to the Boosting methods trained only with KITTI. Moreover, there are various of methods using LIDAR or the stereo image pair data. However, we are only interested in algorithms comparable with ours. In Fig. 11 and Table 3 we show the results of the top three detectors that use a single RGB image: TuSimple [23], DeepManta [20] and RRC [21]. The reason for their success is the use of convolutional features adapted for the problem at hand. However, the need for a powerful GPU could be determinant in some engineering applications with limited devices. In these cases, the Boosting algorithms presented in this paper represent a computationally efficient alternative.
We have also compared the execution time of RRC [21], one of leading deep learning-based car detectors, with our Boosting-based detector in a CPU. We have compiled the implementation of RRC3 with OpenBLAS support4, that adds thread-based parallelism to matrix operations. On average RRC takes 82 seconds to process an image, with 100% CPU usage (Intel Xeon E5620 2.4 GHz). Such long processing time is caused by the large size of KITTI images (1242
Finally, we compare our approach with other Boosting algorithms in KITTI testing (see Table 3): SubCat [16], Regionlets [17] and spLBP [39]. We use LDCF features [33], like SubCat, whereas Regionlets and spLBP benefit from a more discriminative set of features. However, we cannot fairly compare the Regionlets result with ours, since the KITTI experiment is not described in the original paper. Similarly, the spLBP paper does not describe the images used for training their test result. With all this in mind, with respect to Regionlets, our method has better precision up to a recall of 0.7 (see Fig. 11, Moderate). Compared with SubCat, our approach is better in the Easy subset and with recall values below 0.7.
Features selected by BAdaCost in the cyclists detection problem (top left) and sample training images (rest). Note that most of the features are around the rider and the wheels.
For completeness, in this section we evaluate the proposed technique in the cyclists detection KITTI dataset. As for cars, shrinkage is 0.05 and 1/32 the fraction of features to train each weak learner. There are only 1027 cyclists examples in the KITTI-train90 subset. Given the few available training images, the number of view classes is set to
Cyclist detection ablation study
Cyclist detection ablation study
Key: Stronger regularization (I), Bounding Box estimation (BB), Data Augmentation (DA), and multi-scale detection (MS), average precision with
Cyclist detection experiment comparison in KITTI-testing evaluation server
Best BAdaCost results are shown in bold and overall best results in italic.
Precision-recall curves and (AP
Again, we train one detector with the mean aspect ratio of each view class stored on the tree leaves. We use KITTI-train90 for training and KITTI-train10 for testing. We set the detection threshold to an IoU of 0.5, as established in the KITTI benchmark for cyclists, needing less precision in the bounding box than in the cars case. Thus, it is expected for the bounding box adjustment procedure to have less impact in this detection problem.
We show in Fig. 12 sample training data and the selected features. In Fig. 13 and Table 4 we display the result of this final set of experiments. Our detections deteriorate if we include the bounding box adjustment procedure, “BAdaCost
Finally, to improve AP
As in the car case, the best CNN based detectors with cyclists that use only RGB images [21, 40, 23] use a pretrained VGG network on ImageNet and finetune it with the KITTI data. It makes these results no fully comparable to the Boosting ones trained only on the KITTI dataset. In the cyclists, because of the lack of training data, almost all methods take advantage of the LIDAR measures and/or the stereo pairs available. We use only RGB images data. In Table 5 we show the results of the three best published cyclists detectors on KITTI testing using only RGB information: RRC [21], TuSimple [23] and SubCNN [40].
In Table 5 we cite [17] the top Boosting result in the KITTI evaluation server. However, we cannot fairly compare this result with ours, since the KITTI experiment is not described in the paper.
In summary, our results in this section support the use of new techniques borrowed from the deep learning literature for training Boosting based detectors.
Detection algorithms have evolved over time by changing various components of their pipeline. Some of these improvements, however, have been exploited only in the context of modern deep neural nets. In this paper we have improved the performance of Boosting-based detectors by using data augmentation, refining the detection bounding box, and using multi-scale processing. This is a relevant result in the construction of detectors, given the computational efficiency of this family of algorithms.
In the experiments we show that Boosting-based approaches significantly improve their performance with respect to the multi-class baseline Boosting scheme [24] when using the new training techniques. We achieve an improvement of 7.1% and 5% respectively in the AP
In spite of the remarkable accuracy achieved by deep learning-based detectors, we believe there is still room for research in approaches based on more efficient traditional machine learning techniques and their application in different problems [41]. In the future we will investigate better box regression algorithms, the use of efficient and more discriminative image features and detection strategies better than the brute force sliding window.
Footnotes
See BdCost
Acknowledgments
This work was funded by the Spanish Ministerio de Economía y Competitividad, project TIN2016-75982-C2-2-R. José M. Buenaposada was also partially funded by the Comunidad de Madrid project RoboCity2030-DIH-CM (S2018/NMT-4331).
