Abstract
It is a generic paradigm to treat all samples equally in 3D object detection. Although some works focus on discriminating samples in the training process of object detectors, the issue of whether a sample detects its target GT (Ground Truth) during training process has never been studied. In this work, we first point out that discriminating the samples that detect their target GT and the samples that don’t detect their target GT is beneficial to improve the performance measured in terms of mAP (mean Average Precision). Then we propose a novel approach name as DW (Detected Weight). The proposed approach dynamically calculates and assigns different weights to detected and undetected samples, which suppresses the former and promotes the latter. The approach is simple, low-calculation and can be integrated with available weight approaches. Further, it can be applied to almost 3D detectors, even 2D detectors because it is nothing to do with network structures. We evaluate the proposed approach with six state-of-the-art 3D detectors on two datasets. The experiment results show that the proposed approach improves mAP significantly.
Keywords
Introduction
With the development of 3D sensors and increased requirements for 3D understanding [22], 3D recognition [24], object detection [6,10], as well as segmentation [16] have become important issues of computer vision. Many applications such as auto pilot [26,27], robotics [26–28], and augmented reality [28], are heavily dependent on 3D understanding [3–5,7,8,11,15,17,20,23,25,29]. 3D object detection is usually the first and crucial step, as well as a basic problem for these tasks [22]. Some works have gained state-of-the-art 3D detection performance of point cloud or RGB-D images with advances in convolutional neural network (CNN) and deep learning techniques. In this work, we aim to improving the 3D object detection performance by assigning different weights to different samples during training process.
During most of training procedures of 3D detectors, massive samples are generated. Usually, these samples can be divided into two categories: positive samples and negative samples. If the IoU (Intersection over Union) between a sample and its target GT (Ground Truth) box is up to a specific thresh value, the sample is a positive sample. Otherwise, it is a negative sample. No matter positive samples or negative samples, usually only part of them is selected to participate in training because of limited computing resources and capabilities. Usually, this is implemented by assigning weight 0 to the unselected samples and weight 1 to the selected ones. For a selected negative sample, only its classification needs to be trained because it does not overlap enough with and isn’t assigned any GT box. For a selected positive sample, both its classification and bounding box regression need to be trained because it is assigned a GT as its target.
In inference, 3D Object detectors output many detections composed of classification and 3D bounding box, some of which detects a GT and others don’t detect any GT. A prediction must meet two conditions before it can be judged as a detected one: (1) its classification must be correct, i.e., the score of the correct class must be the highest among all classes; (2) the IoU between its 3D bounding box and a GT box must be up to a threshold, such as 0.25, 0.5, 0.7, and so on, depending on different datasets or scenes. Similarly, the samples generated during the training process can also be judged as detected or undetected. In detail, the samples meeting the two conditions above are detected, and others are undetected. The detected samples indicate that the detector has been able to detect their target GT correctly and precisely, and the undetected ones indicate that the detector has not been able to detect their target GT. The contribution of the detected samples makes the score of the correct class higher, and the IoU with their target GT box higher. However, to some extent, these improvement does not improve object detection performance because they have been able to detect their target GT. And the contribution of the undetected samples is relatively more important because they are unable to detect their target GT. Consequently, the detected samples should contribute less to the detector, on the contrary, the undetected samples should contribute more.
Based on the analysis above, we proposed an approach that assigns small weights to the detected samples, and large weights to the undetected samples during the training process of 3D detectors. The proposed approach is named as DW (Detected Weight), which makes detected samples contribute less and the undetected samples contribute more. As a result, more GT can be correctly and precisely detected by the detectors, and object detection performance can be improved. For the first time, a sample’s weight is computed based on whether it is detected, which differs from existing methods and is the main contribution of this work.
The rest of this paper is organized as the following: in Section 2, we describe related works on 3D object detection and the approaches of assigning different weights to different samples; Section 3 describes our approach in detail; in Section 4, the experiments are described and the results are analyzed; finally, our work is concluded in Section 5.
Related work
The main task of 3D object detection is localizing and recognizing objects in a 3D scene [14]. Some works have achieved state-of-the-art performances in 3D object detection [19]. In this section, we describe related works on 3D object detection. In addition, the approaches of assigning different weights to different samples are also described.
VoteNet is proposed as an end-to-end network combining deep point set networks with Hough voting [15]. By the voting mechanism, new points that are near to objects centers are generated, grouped, and finally aggregated to generate box proposals. In detail, the input point cloud is passed through a backbone network, then a set of seed points are sampled. And votes are generated from the sampled seed points to reach object centers. Finally, vote clusters emerge near object centers, and are aggregated to generate box proposal. ImVoteNet [14] is proposed on the basis of VoteNet. It is specialized for RGB-D scenes by combining 2D votes in images with 3D votes in point clouds. Both geometric and semantic features are extracted from the 2D images, and lifted to 3D with camera parameters. And a multi-towered network structure is introduced to balance the information from 2D and 3D sources, and fully utilize the 2D and 3D features. ImVoxelNet [19] is a fully convolutional method on the basis of posed monocular or multi-view RGB images. Different from available detectors using multi-view inputs only for inference, ImVoxelNet uses multi-view inputs for both inference and training. A voxel representation of the 3D space is constructed, and used to detect objects in indoor as well as outdoor scenes. This is implemented by choosing between an indoor and outdoor head with the same meta-architecture. And final predictions are generated from 3D feature maps. Part-aware and aggregation neural network (Part-A2 net) [21] is proposed based on PointRCNN. It consists of two stages: the part-aware stage and the part-aggregation stage. The former is used to predict good quality 3D proposals and precise intra-object part locations. The latter is used to reevaluate the box and refine the location of the box. MVX-Net [22] proposed PointFusion and VoxelFusion, which are two simple and effective early-fusion approaches in order to combine the RGB with point cloud modalities based on VoxelNet architecture. SECOND [25] is an improved sparse convolution method which is proposed to increases the speed of both training and inference significantly. A new form of angle loss regression as well as a new data augmentation approach are also proposed. The former aims to improve the orientation estimation performance. And the latter aims to enhance the convergence speed and performance.
The method of assigning different weights to different samples has been widely used in the training of object detectors. As mentioned before, selected samples are assigned weight 1 and the unselected samples are applied weight 0 during the training process. The contribution of a bounding box (
Gradient Harmonizing Mechanism (GHM) [9] is a counting-based approach, which suppresses gradients generated by negatives and easy positives. The loss of a sample is suppressed if its gradient is similar norm with many samples’ gradient norm as the following:
Detected weight
In this section, we firstly describe the motivation to propose DW. Then we describe how to calculate a weight of a sample depending on whether it detects its GT, as well as how to combine the weight with different loss function of variant 3D detector.
Motivation
As mentioned before, a sample must meet two conditions before it can be judged as a detected one: (1) its classification must be correct, i.e., the score of the correct class must be the highest among all classes; (2) the IoU between its 3D bounding box and its target GT box must be up to a threshold. The detected samples indicate that they have been able to detect their target GT correctly and precisely. In contrast, the undetected ones indicate that they are still unable to detect their target GT. The contribution of the former makes the score of the correct class higher, and the IoU with their target GT boxes higher. However, to a certain degree, these improvements are unable to improve object detection performance because they have been able to detect their target GT correctly and precisely. In contrast, the contribution of the latter is relatively more important because they are still unable to detect their target GT. Therefore, the former should contribute less to the detector, and the latter should contribute more to the detector. However, most available methods don’t consider this issue, and don’t distinguish detected samples and undetected samples during training process. Although Focal Loss [12] assigns more weight to the hard examples dynamically based on the predicted probability for the ground-truth class [13], it only considers classifications, and does not consider whether a sample can detect its target GT, i.e, it is unable to consider whether classification is correct, and meanwhile IoU is up to threshold.
In this work, we consider the issue of whether a sample can detect its target GT during training process for the first time, and proposed a novel weight approach named as DW (Detected Weight), which assigns small weight to detected samples and large weight to undetected samples. During the process of training, many boxes are proposed by 3D detectors. Then the IoU between the proposed boxes and GT boxes are computed. And the proposed boxes having high IoU with a certain GT box are selected and used as samples, as shown in Fig. 1. In the figure, the green boxes are GT boxes, and the red boxes are the bounding boxes of samples. As can be seen in sub figure (a), the left sample’s bounding box overlap enough with its target GT box, and its classification is also correct, i.e. the sample detects its target GT correctly and precisely. In contrast, the right sample’s bounding box does not overlap enough with its target GT box and its classification is also incorrect, i.e. the sample does not detects its target GT. Sub figure (b) is the result if we assign the same weight 1 to the two samples in this iteration. As can be seen in sub figure (b), the left sample overlaps more with its target GT box, and its classification is more correct. And the right sample overlaps more with its target GT, and its score of the correct class is higher. However, its IoU is still lower than the specific threshold, and its classification is still incorrect, i.e., it is still unable to detect its target GT. Sub figure (c) is the result if we assign a small weight to the left sample and a large sample to the right sample in this iteration. As can be seen in sub figure (c), the left sample still can detect its target GT although it’s suppressed compared to sub figure (b). However, the right sample is promoted, its IoU is up to the specific value and its classification is correct, i.e., it can detect its target GT. As a result, more GT can be detected and the performance measured in terms of mAP is improved.

Assigning small weight to detected samples and large weight to undetected samples can detect more GT, and improve mAP. Here sub figure (a) is the detections before an iteration. Sub figure (b) is the result if we assign the same weight 1 to the two samples in this iteration. And sub figure (C) is the result if we assign a small weight to the left sample and a large weight to the right sample in this iteration.
For each selected sample, we firstly judge whether its classification is correct, and calculate its IoU with its target GT box. If its classification is correct and its IoU is up to specific thresh value, then it is judged as a detected one. Otherwise, it is judged as an undetected one. For sample i, we calculate its weight as following:
Here α and β are modulating factors used to adjust the weight,
Here
Here
As can be seen from the formulas above, if i is an undetected sample, i.e, its classification is incorrect or its IoU is lower than specific threshold, then
We set

The weight dynamically decreases as wi and
Due to different network structures, available 3D detectors have their specific loss functions. These loss functions usually involve many terms, some of which are not directly used to predict 3D bounding boxes, such as the classification loss or semantic loss for each point, vote loss, etc. Consequently, we only combine DW with the terms that are directly used to predict 3D bounding boxes, such as classification loss and regression loss for bounding box, classification or regression loss for direction of bounding box, and so on. For these terms, we directly multiply them by DW. In this sense, DW can be integrated with available weight, such as focal loss. Further, DW is nothing to do with network structures of 3D detectors. Thus, it can be applied to almost 3D detectors, even 2D detectors.
Experiments
In this section, we evaluate DW by experiments. As mentioned before, the proposed weight can be applied to almost 3D detectors. Thus, we use indoor dataset Sun RGBD and outdoor dataset KITTI to carry out experiments. The following 6 state-of-the-art 3D detectors are adopted as our baselines: VoteNet [15], ImVoteNet [14], MVXNet [22], Part-A2 [21], SECOND [25] as well as ImVoxelNet [19]. We compare and analyze their original performance and the performance achieved by DW. Additionally, parameters selection as well as ablation studies are carried out, and qualitative results are also shown.
Datasets
Sun RGBD is an RGB-D dataset, which is indoor, monocular and for 3D scene understanding [14]. It consists of 37 object categories, more than 10k RGB-D images as well as their annotation.
KITTI is a challenging outdoor 3D detection benchmark [30], including 7k training samples and 7k test samples that have images and lidar point clouds. The training samples are further divided into train split and val split, nearly 4k images each. It consists 3 object categories: car, pedestrian and cyclist.
Baselines and implementation
We apply DW to the selected detectors, train and test them with the two datasets above respectively. Concretely, indoor detectors VoteNet and ImVoteNet are trained and tested with Sun RGBD, outdoor detectors MVXNet, Part-A2, SECOND as well as ImVoxelNet are trained and tested with KITTI, as shown in Table 1.
Detectors selected to carry out experiment
Detectors selected to carry out experiment
Each selected detector has its specific loss function. As mentioned before, we only combine DW with the terms that are directly used to predict 3D bounding boxes, such as classification loss and regression loss for bounding box, classification or regression loss for direction of bounding box, and so on. To implement DW, we adopt and modify the code of MMDetection3D [1] on GitHub, which is based on PyTorch. We find that VoteNet and ImVoteNet have exactly the same loss function, and use the same code to calculate it in MMDetection3D. Thus, we study the terms of the loss function, and determine which terms need to be combined with DW and which items do not, as shown in Table 2, in which the terms needed to combine are marked with ✓, and the terms unneeded to combine are marked with ×. MVXNet, Part-A2, SECOND and ImVoxelNet have exactly the same loss function, and use the same code to calculate it in MMDetection3D. We determine which terms need to be combined with DW and which terms do not, as shown in Table 3.
Combination of the loss function for VoteNet and ImVoteNet
Combination of the loss function for MVXNet, part-A2, SECOND and ImVoxelNet
We only add or modify about 20 lines of code for each loss function, which is very easy to implement. And the amount of calculation increased is negligible compared to the total amount of calculation for training.
For the original parameters, we follow the setting of MMDetection3D. As mentioned in Section 3.2, we set
Result for different value of β
Result for different value of β
As known to all, Sun RGBD has two evaluation protocols: IoU = 0.25 and IoU = 0.50. We adopt IoU = 0.25 so that more samples can be affected by DW. For KITTI, we follow its standard evaluation protocol IoU = 0.70. The values of it for different detectors are shown in Table 5.
The values of
Experimental results on Sun RGBD and KITTI are shown in Table 6–8, in which the results with suffix -DW are the results achieved with DW, and the results without suffix -DW are the original results. The original results are referenced from MMDetection3D except VoteNet and ImVoteNet, which are retrained and tested by us because MMDetection3D does not provide AP@0.25 or AP@0.50 of the two detectors. Note that MVXNet, Part-A2 and SECOND are trained and tested for 3 classes of KITTI, while ImVoxelNet is only for car, which we follow MMDetection3D.
As can be seen in the three tables, DW improves 0.5–3.8 percentage in terms of mAP, 1.6 percentage on average. It shows that the effectiveness of DW is pronounced in terms of mAP, especially on Sun RGBD with the evaluation protocol IoU = 0.50. This is because DW suppresses detected samples and promotes undetected samples. Thus, more samples detect their target GT and mAP is improved.
Results of AP@0.25 on Sun RGBD
Results of AP@0.25 on Sun RGBD
Results of AP@0.50 on Sun RGBD
Results on KITTI
DW involves two critical factors: classification weight its wci and IoU weight wii. First, we only consider classification weight wci. That is to say, for detected sample i, we calculate its weight as following:
Then we only consider IoU weight wii and calculate the weight as following:
Similar with parameters selection, we perform ablation studies with VoteNet. Table 9 shows the experimental results of ablation studies, in which the results with suffix -DW-c are the results achieved with equation (9), and the results with suffix -DW-i are the results achieved with equation (10). As can be seen in the table, the three results with evaluation protocols IoU = 0.25 are similar. However, the results with evaluation protocols IoU = 0.25 differ greatly. Concretely, the results achieved by the approach considering only IoU weight are almost the same as those achieved by the approach considering both IoU weight and classification weight. The latter is slightly better than the former. And the two approaches obviously outperform the approach considering only classification weight. That is to say, IoU weight is much more important than classification weight, and plays a major role in DW.
Ablation results with VoteNet on Sun RGBD
Ablation results with VoteNet on Sun RGBD
We present some representative results that are generated by SECOND-DW in Fig. 3. As a contrast, results generated by SECOND are also shown in the figure. The 3D bounding boxes generated by SECOND-DW are marked with blue in the left sub figures, and the ones generated by SECOND are marked with green in the right sub figures. As can be seen in the figure, our proposed DW approach detects more objects, and meanwhile, false detections are avoided.

Comparison of SECOND-DW with SECOND. The 3D bounding boxes generated by SECOND-DW are marked with blue in the left sub figures, and the ones generated by SECOND are marked with green in the right sub figures. The comparison shows that DW approach detects more objects, and meanwhile, false detections are also avoided.
In this work, we point out that discriminating detected and undetected samples is beneficial to improve the performance measured in terms of mAP. Consequently, we propose a novel approach to compute and assign different weights to samples based on whether it detects its GT, which suppresses detected samples and promotes undetected samples. The approach is named as DW (Detected Weight), which is simple, low-calculation and can be integrated with available weight approaches. Finally, we perform experiments with six state-of-the-art 3D detectors on two datasets. And the experimental results show the significant effectiveness of the proposed approach is pronounced in terms of mAP. For future work, we plan to evaluate DW with state-of-the-art 2D detectors. Additionally, we plan to apply the idea to other computer vision applications, such as image classification and segmentation.
Footnotes
Acknowledgement
The author would like to thank to the editors and reviewers for their considerations, comments and suggestions.
