Abstract
Detecting vehicle at night is critical to both assistant driving systems and autonomous driving systems. In this paper, we propose a deep network scheme assisted by light information with good generalization to detect vehicle at night. Our approach is divided into two branches, the object stream and the pixel stream. The object stream generates a batch of bounding boxes, and the pixel stream utilizes the vehicle light information to calibrate the bounding boxes of the object stream. In the object stream, we propose a new structure, Direction Attention Pooling (DAP), to improve the accuracy of the prior boxes. DAP leads into attention mechanism. The feature maps obtained from backbone network is divided into two branches. One branch obtains direction perception information through IRNN layer, and the other branch learns attention weights. The weights are multiplied with the direction perception features in an element-wise manner. In the pixel stream, we propose a corner localization algorithm based on Bayes to get more accurate corners with the vehicle light pixels. The locations of the corners are considered as a discrete random variable. When the mask of the object is known, solving the probability distribution of the corner of the object is the next step. The corners with the highest probability is the correct corner. On the nighttime vehicle detection datasets CHUK and SYSU, our method achieves the accuracy of 97.2% and 96.86%, which outperforms other state-of-the-art methods by at least 0.31% and 0.34%.
Keywords
Introduction
Intelligent Transport System (ITS) aims to achieve optimum traffic efficiency by avoiding traffic problems to the extent possible. The automated system fueled by Advanced Driver-Assistance Systems(ADAS) is proven to be able to reduce road fatalities, by minimizing the human error [1]. Although the traffic volume is lower at night, American National Highway Transportation Safety Administration (NHTSA) statistics show that, 52% of fatal crashes occur after dark [2]. Further more, several studies supported the conclusion that collisions generally are more severe at nighttime than during the day [3, 4]. From the above, nighttime vehicle detection is a major area of interest within the field of intelligent transportation systems (ITS). In this paper, we introduce a new deep network scheme, which aims to improve the accuracy of vehicle detection at night.
There are a number of mature image feature extraction approaches, in daytime scenes, the earlier method is Scale-invariant feature transform (SIFT) [61]. These features have scale invariance and rotation invariance, especially when the illumination and 3D camera’s angle of view change. They are suitable for different images or scenes. The following speed up robust features algorithm(SURF) [62] simplifies the Gaussian second-order differential in SIFT, so that the convolution smoothing operation only needs to be converted into addition and subtraction operation, so that surf algorithm has good robustness and low time complexity. Other subsequent algorithms are also widely used, such as histogram of oriented gradients (HOG) [5], local binary patterns (LBP) [6], and deformable parts model (DPM) [7]. However, nighttime images have low contrast between background and object, which increases the difficulty of object detection. In addition, the low luminosity leads some features of vehicles (e.g., edge, color, and shape features of vehicles) to become obscure. Without enhancement applicable to nighttime images, the traditional image feature extraction methods designed with daytime images do not perform well in nighttime scenes [8–10]. Therefore, in order to achieve good performance, additional auxiliary information should be added during training.
Daytime object detection methods rely heavily on the color information, gradient features, and texture features of objects. However, at night, the above mentioned visual features are weakened and most conspicuous areas are the bright areas, mainly car light areas. Therefore, it is necessary to assist vehicle detection at night with vehicle light saliency information. At present, some night vehicle detection methods are based on light detection [11–13]. However, this kind of method may leave out the vehicle lights or capture street lights and traffic lights in complex scenes, which demonstrates the difficulty of accurate vehicle light detection. To solve this problem, Chen et al. [10] proposed a strategy to distinguish vehicle lights and street lights. Specifically, they utilize the Nakagami [14] image to locate vehicle light areas and RPN (RegionProposal Network) based on CNN (Convolutional Neural Networks) feature maps to obtain the vehicle object proposals. Then, the two results are combined to generate the RoIs (Region of Interesting). We observe the fusion segmentation image of Nakagemi [49] image and HSI (Hue-Saturation-Intensity) image, and find that the vehicle light areas are very significant. In our work, the vehicle light information is very important. Then, we utilize the fusion segmentation image of Nakagemi [49] image and HSI image as instrument to obtain vehicle light information to assist nighttime vehicle detection. In the image dehazing task, there has been a similar idea. One image processing result feature is used to finetune another. Singh et al. [67] proposed a gradient profile prior to remove the haze from remote sensing images. The coarse estimated atmospheric veil has been refined by using gradient-based guided image filter. They combined transmission map estimation and atmosphere light estimation to get final image. In the follow-up work, they refined the transmission map by developing a local activity-tuned anisotropic diffusion based filter [68] and using a guided L0 filter [69].
Nighttime vehicle detectors based on nighttime image enhancement methods and SVM (Support Vector Machine) classifiers have achieved state-of-the-art results [8, 15]. For SYSU nighttime vehicle detection dataset, the accuracy of Combining Extraction method [8] achieved is 94.71%, which is 0.33% higher than the second best method. For CHUK nighttime vehicle detection dataset, the accuracy of Tensor Decomposition method [15] achieved is 95.48%, which is 1.41% higher than the second best method. The above methods are traditional computer recognition methods and they have the following limitations. First of all, there must be a process of feature extraction and expression, and then the features are put into the learning algorithm to train the classifier. Manual feature design requires a lot of experience, and the quality of feature engineering will affect the quality of classifier learning. Secondly, we should have a reasonable classifier. It is very difficult to fulfill the above two requirements to achieve the best effect. Image feature extraction based on CNN structure can avoid the design of complex manual features. The experimental results of Faster R-CNN [16], YOLO [17] and SSD [18], etc. frameworks show that CNN can achieve good performance in the field of object detection. Table 1 shows a attribute comparison with the CNN methods. However, in the night vehicle detection task, only using CNN feature extraction combined with SVM classifier can not get better results than manual feature extraction [8]. In this paper, we propose a new deep network, which is suitable for night vehicle detection tasks and improves the accuracy of night vehicle detection.
Attribute comparison with the CNN methods
Attribute comparison with the CNN methods
CornerNet [19] is a classic object detection network that can achieve a balance between speed and accuracy, and does not need to set a large number of anchor hyper parameters. The improvement based on CornerNet will be suitable for our nighttime vehicle detection task with high requirements of accuracy and speed. In addition, the detection box is rectangular, and the corner position can reflect the information of the rectangular. ConerNet is an anchor-free object detection method which detects an object bounding box as a pair of keypoints: the top-left corner and the bottom-right corner. Fig. 1 illustrates the overall pipeline of our approach.

Architecture of nighttime vehicle detection network. A convolutional backbone network applies Direction Attention Pooling to output two corner heatmaps. Similar to CornerNet, a pair of detected corners and the similar embeddings are used to detect a potential bounding box. The instance mask obtained from fusion image segmentation is used for refining the corners by the Bayes corner localization algorithm.
In training phase, RGB images are inputted to object stream and pixel stream respectively. In the object stream, the features are extracted by backbone network, and then go through a special pooling layer—Direction Attention Pooling. The network outputs a heatmap for all top-left corners, a heatmap for all bottom-right corners, and an embedding vector for each detected corner. The potential bounding box can be obtained by matching the embedding vectors of the top-left corners and the bottom-right corners. In the pixel stream, the HSI segmentation image and the Nakagami segmentation image obtained from the RGB image are fused to obtain the segmentation image of the vehicle lights. An accurate bounding box must contain all pixels of the light mask. According to the aforementioned analyses, pixel-level information has the potential to benefit the detection task. To this end, we develop a formulation for corner localization based on the Bayes’ theorem. In this formulation, we utilize vehicle light segmentation from the pixel stream to select more accurate bounding box of each object.
The contributions of this paper are summarized as follows:
1) Different from the previous method [8, 15] which extracts manual features from the image, and then uses SVM to classify the features, we use deep network to detect vehicles at night. A new special pooling layer—Direction Attention Pooling is proposed to make the corner points determination more accurate.
2) We use vehicle light segmentation image to finetune the bounding box of nighttime vehicle detection.
The rest of this paper is organized as follows. Related work is described in Section II. The details of the proposed nighttime vehicle detection approach based on deep network including network structure, loss function and corner localization algorithm based on Bayes theorem are introduced in Section III. The experiment results are shown and discussed in Section IV. Finally, Section V provides the conclusions and discusses future work.
Early stage night vehicle detection is based on the detection of headlights and taillights [20] which are the most salient areas of the night image. Pham et al. [72] noticed that bright blobs that may be vehicle lights are segmented in the captured image. Then, they proposed a machine learning-based method to classify whether the bright blobs are headlights, taillights, or other illuminant objects. Subsequently, the detected vehicle lights are tracked to further facilitate the determination of the vehicle position. The methods for vehicle light detection are divided into four parts: blobs detection, feature extraction, lights pairing and classification based on the common pipeline[20]. Antonio L‘opez et al. [12] used an image sensor to obtain different appearance features, which were used as the input of a classifier-based module. For each candidate blob, the module generates similarity with vehicle lighting to determine whether the blob is a vehicle light. Eum S et al. [21] integrated the partial region of the auto exposure(AE) image confined by the lane detection information and the low-exposure(LE) image. This approach enhances the performance of detecting the distant light blobs. VeDANt [22] employed AdaBoost cascade classifiers for detecting hypothesis blobs using gray-scale images only. Nakagami image combing with an HSI segmentation method is utilized to locate vehicle light areas [10]. Blobs feature extraction is used for lights pairing and classification [23–25]. Through these blobs feature extraction methods one can get the luminance, appearance, geometric and other attributes of blobs. Pairing is used to distinguish the vehicle light areas from other luminous areas. Andrea Fossati et al. [26] proposed pair filter to filtrate out pair bond which cannot be a vehicle of candidate light blobs. Yen-Lin Chen et al. [27] grouped the luminous regions by projection-based spatial clustering method to obtain potential pairing vehicle lights. Classification is used to distinguish between vehicle lights, other light sources and background. The most commonly used methods are boosting [12] and support vector machine [23].
Different from the transformation of vehicle detection at night into vehicle lights detection, there are some methods which are based on effective feature extraction and machine learning to directly detect the whole vehicle at night. Hulin Kuang et al. [8] proposed a nighttime vehicle detection approach, which combines vehicle light areas obtained by Nakagami segmentation image and object proposals by Edge Boxes [28] with nighttime image enhancement based on improved MSR(Multi-Scale Retinex, a method of image enhancement). In the following research [9], they developed a nighttime image enhancement method which was inspired by the retinal mechanism and a weighted feature fusion technique. Subsequent study [29] proposed a Bayes saliency-based object proposal generator for nighttime images to generate an exact set of proposals, which are more likely to be the locations of the vehicles. In the latest study, Hulin Kuang et al. [15] present a vehicle detection system which focuses on four types of vehicles based on tensor decomposition. They selected proposals by feature ranking after tensor decomposition and only extract features from these selected proposals. For the night vehicle detection task, Hossein Tehrani et al. [30] optimized the visible and potential structure of the detected object in the process of using deformable part models (DPM) to capture salient features.
Since the development of deep learning, CNN has shown good performance in object detection tasks, and a series of excellent frameworks have emerged. For example, the two-stage detection network RCNN [31], Faster R-CNN [16], FPN [32], Cascade R-CNN[33], CBNet [73], etc. And the single-stage network YOLO [17, 35], SSD [18], EfficientDet [36], DETR(DEtection Transformer) [74], etc. One-stage detectors place anchor boxes densely over an image and generate final box predictions by scoring anchor boxes. Anchor box structure has two disadvantages: one is that the number of anchor boxes is large, which is easy to cause uneven positive and negative samples; the other is that it will introduce many hyper parameters, such as quantity, size and aspect ratio. Therefore, anchor free methods[19, 75] which balance accuracy and speed began to emerge. In the field of automatic driving, the method of CNN feature extraction has achieved good performance [40–42] in the daytime vehicle detection task of KITTI [43] dataset. For example, Recurrent Rolling Convolution (RRC) [63] enhances single vehicle detection under occlusion. And Fan Yang al. [64] proposed Scale Dependent Pooling (SDP)to enhance the accuracy of proposal and Cascaded Rejection Classifiers (CRC) to eliminate negative samples to improve the detection accuracy. Inputting only CNN features into SVM to train a night-time vehicle classifier [31] is also effective. The latest innovative method used GAN(Generative Adversarial Network) structure. SHAO et al. [70] proposed a FTE(feature translate-enhancement) module based on CycleGAN and the OD(object detection) module to improve the accuracy of vehicle detection at night. Long Chen et al. [10] proposed a method to get nighttime vehicle proposals by fusing the light region and the proposals which obtained by using RPN method, and then classify them to get vehicle bounding boxes. In subsequent research, Miao et al. [71] combined the MSR image enhancement method with YOLOv3 to improve the accuracy of night object detection. According to CornerNet [19] and [10], the vehicle light information is used to assist network training. However, there is a problem when this advanced object detection network is applied to our nighttime vehicle detection task. When detecting a single class of objects, there will be repeated detection. To be specific, when the correct objects are detected, there will be a larger bounding box covering them, as example is shown in Fig. 2. To address the problem, we design improvement scheme in the object stream and the pixel stream.

Repeated detection. Red indicates the correct bounding boxes, and blue indicates the repeated bounding box.
Our method takes RGB images as input, which are conveyed to object stream and pixel stream respectively. The object stream detects the candidate region based on a convolution network. The convolution network is used to detect the heatmaps of corner locations of the candidate regions. In the process of detecting the heatmaps, we draw lessons from attention mechanism. Its core goal is to select the information that is more critical to the current task from a lot of information, and reach an agreement with our goal of finding the key points in the picture. And attention mechanism is also well used in the object detection task. For example, the last ImageNet champion model: SENet [65], proposed a new convolution operator, which learns new feature map from input feature map through convolution kernel. AC-FPN [66] designs a new network structure, which is called attention oriented context feature pyramid network. By fusing multiple different receptive field features, the network structure not only increases the receptive field of objects, but also makes use of the context information of objects. In addition, DSC [44] is used for shadow detection task. For each feature grid, the information of its surrounding grid can be obtained, so as to distinguish the shadow. Each corner has its own embedding vector. By matching the embedding vectors of the top-left corner and the bottom-right corner, whether the two corners belong to the same object can be determined. Pixel stream uses the vehicle lights segmentation image to assist the candidate regions which are generated in the object stream to get more accurate bounding boxes.
Our proposed nighttime vehicle detection method includes two important parts: 1) Object Stream; and 2)Pixel Stream. They are described in detail in this section.
Object stream
The object stream is used to detect the candidate regions in the nighttime image. The structure of object stream is similar to that of CornerNet. The features of the original image are extracted by backbone(Hourglass) network, and then the features pass Direction Attention Pooling to obtain direction perception feature. And then, the direction perception features pass a ReLU layer and a convolution layer to obtain three branches. They are heatmaps for locating corner key points, embeddings for pairing and offsets for position calibration, respectively. The procedure shown in Fig. 3.

The main structure of object stream. The features are processed separately into three branches for different usages: generating the heatmaps, extracting embeddings and predicting offsets.
The most critical structure used by CornetNet to detect key points is a special layer called Corner Pooling. The mechanism of Corner Pooling is as follows. Take the situation of detecting top-left corners as example (and similar in bottom-right case). There are two parallel feature maps from backbone network used in the procedure, one put through horizontal and the other put through vertical pooling, and the output is the element-wise sum of the two pooled maps. The horizontal pooling is that, given a grid in the feature map, the value is set as the maximum among the values to its right-side. In a similar manner, vertical pooling selects the maximum among down-side values. The scheme is shown in Fig. 4. The principle is that the intersection of the upper boundary and the left boundary of an object is the top-left corner. The maximum value in the Corner Pooling reflects the information of the top and left boundaries.

The top-left Corner Pooling layer.
It is obvious from Fig. 2 (b) that the straight line extending horizontally from the bottom-right corner of the blue bounding box intersects the lower boundary of the silver car, and the vertical extension straight line intersects the right boundary of another red car. At the same time, this repeated detection bounding box(blue) does meet the operation criteria of Corner Pooling. It can be explained that the Corner Pooling has defects to locate the wrong location of corner points. And Law et al. [19] replace the predicted heatmaps with the ground-truth heatmaps improves the AP from 38.4% to 73.1%, indicating that the main bottleneck of CornerNet is detecting corners. We analyze the reasons and find that Corner Pooling only focuses on the grid with the maximum value and ignores the information of other grids. But not all the information carried by other pixels is useless.
Inspired by the shadow detection method DSC [44], we propose Direction Attention Pooling(DAP). The direction-aware spatial context module (DSC) module pays attention to the difference between each pixel and the surrounding pixels. DSC focuses on the information carried by other grids in four directions of a pixel, and gives them different attention weights. We proposed method Direction Attention Pooling is different from it. We change its method of perceiving the surroundings into sensing the right and down directions (for the topleft corner), as shown in Fig. 5. The features of the original image after passing through the backbone network are divided into two branches, one for learning attention weight W, the other for computing direction-aware spatial context by adopting a spatial RNN. To learn spatial context in a direction-aware manner, we formulate the direction-aware attention mechanism in a spatial RNN to learn attention weights and generate features. Spatial RNN is a special RNN structure. It can sense the context information of other spatial directions. The operation of the structure is explained as follows. As shown in Fig. 6. Input feature

The schematic illustration of the Direction Attention Pooling. (top-left corner).

The schematic illustration of how information propagates in IRNN from two direction.
Formulas (1) and (2) are proposed in [45]. Our method is different from the original use of IRNN. Sean Bel et al. [45] originally wanted to use IRNN layer twice to make each grid of feature map get the global receptive field. We only focus on the information in the right and down directions of each grid. When detecting the heatmap of top-left corner, the features which are obtained from the backbone network pass the IRNN layer to get two new features h right and h down . Then, we split the W learned by the other branch into two maps of attention weights, denoted by W right and W down . The two maps of weights W right and W down are multiplied with the spatial context features h right and h down in an element-wise manner. Finally, we concatenate the features from the two directions. The above is the overall structure and operation mechanism of Direction Attention Pooling. Algorithm 1 is the Direction Attention Pooling pseudocode.
For each corner, there is one ground-truth positive location, and the other locations are negative. p
cij
denotes the probability of the pixel at location (i, j) belonging to a class c object. In our task, there are C types, and no background. y
cij
is the ground-truth heatmap augmented with the unnormalized Gaussians
Formulas (3) and (4) are proposed in [19], where N is the number of objects in an image, and α and β are the hyper-parameters which control the contribution of each point (we set α to 2 and β to 4 in all experiments).
When the feature extraction process of the original image is equivalent to down sampling the original image, the output is generally smaller than the original image. The original image position (x, y) is mapped to position
Formulas (5) and (6) are proposed in [19], where o
k
is the offset, x
k
and y
k
are the coordinate for corner k. o
k
is the true value and
The embedding vector is used for matching to determine whether the detected top-left corner and bottom-right corner belong to the same object. The relationship between feature map of heatmap and embedding vector is shown in Fig. 7. According to Newell et al. [47], the matching principle is that the relative distance between the top-left corner and the bottom-right corner of the same object will be smaller. Therefore, we can match the corners of the same object by "pull" loss, and separate the corners of different objects by "push" loss in training. e tk and e bk are the embedding for the corner of object k. e k is the average of e tk and e bk . Δ is set to be 1. Formulas (7) and (8) are proposed in [47].

The network outputs a heatmap for corners and embedding vectors.
For the automatic driving nighttime image, various light sources are significant information, and the light information has a strong positive impact on vehicle detection. In this section, we demonstrate how to enhance the detection results by utilizing the pixel-level vehicle light information. First of all, we use the same method in the article [10] to get the segmentation image of the vehicle light. As shown in Fig. 8. The contrast of the original image Fig. 8(a) is enhanced by the method [48] to get Fig. 8(b), which makes the distinction between the object and the background more obvious. Because Nakagami image [49] emphasizes the light and its surrounding scattering area, the Nakagami image Fig. 8(c) after processing can be used to detect the light location in the image. The segmentation result shown in Fig. 8(d) is obtained from the Nakagami images using an adaptive threshold [25]. Nakagami image contains other lights besides the vehicle lights, so it is necessary to distinguish the vehicle lights from other lights. When the nakagami image is converted to HSI color space, there are different hue values for different lights. According to the hue distribution of the article [10], most of the turn lights and tail lights are distributed between 0 to 0.05, 0.45 to 0.6, 0.96 to 1. We set the threshold based on this range. The segmentation results of the HSI color space are shown in Fig. 8(e). For HSI segmentation and Nakagami image segmentation, we use and(&) operation to fuse them to get Fig. 8(f). And then, for fusion image,we use 8-connected component labeling method to get vehicle light locations. We use this method to preprocess all the images in the dataset.

Vehicle lights segmentation image acquisition process: (a) original image, (b) enhanced image, (c) Nakagami image, (d) Nakagami image segmentation, (e) HSI segmentation, (f) fusion image.
From the above processing for the original image, we obtain the areas of vehicle light. From an intuitive point of view, if a bounding box contains more vehicle light pixels, this box is more likely to be the best bounding box. But in fact, most of the bounding boxes detected by the object stream contain all the vehicle light pixels. Secondly, we proposed the repeated detection problem in the related work, most of which also include all the vehicle light pixels. Therefore, the detection accuracy can not be improved by simply judging whether the vehicle light pixels are included or not, and more or less. Next, we will introduce how to use the pixel level information of the vehicle lights to calibrate bounding box of the vehicle detection. Inspired by mask based boundary modification module (MBRM) [50], we develop the formulation for corner localization based on the Bayes theorem.
Although the bounding box detected in the object stream have certain location error, it has provided rich prior information for the final accurate bounding box. Therefore, our formula combines the detection results of object stream and the segmentation result of pixel stream. Specifically, we consider the locations of the corners as a discrete random variable. From the probabilistic perspective, an object corner location is the argmax of the probability of a coordinate where the corner locates, namely
Now we take the calculation of the top-left corner as an example. After corner embedding vectors matching through the object stream, a top-left corner has been matched with a bottom-right corner, so a top-left corner can represent a bounding box. Following the Bayes’ theory, we have
For P (Z = zx,y), the definition is similar to section 3.1.1, and it reflects the distance between the detection point and the ground-truth. We simply adopt a discrete Gaussian distribution

Calculation diagram about single object detection bounding box and repeated detection one.
Training and inference
We implement our method in PyTorch. The network is randomly initialized under the default setting of PyTorch, except that we use pretrained backbone Hourglass [19]. As we apply focal loss, we follow [47] to set the biases in the convolution layers that predict the corner heatmaps. In training, we use Hourglass to be backbone network to extract features. We set the input resolution of the network to 511 × 511, which leads to an output resolution of 128 × 128. To reduce overfitting, we adopt standard data augmentation techniques including random horizontal flipping, random scaling, random cropping and random color jittering, which includes adjusting the brightness, saturation and contrast of an image. Finally, we apply PCA [59] to the input image. We use Adam [60] to optimize the full training loss:
We use Sun Yat-sen University Night-time 1 Vehicle Dataset. This dataset contains 5576 images including over 12000 nighttime vehicles in two different traffic situations, i.e., urban road with streetlight and without streetlight. All these images are captured by driving recorders set on the front windshield of vehicles. We used 4460 images as training set samples and 1116 images as test samples. The dataset is used to evaluate the proposed object proposal approach. In order to verify the generalization ability of this method, we also tested on another data set collected at night in Hong Kong named CHUK 2 and compared with other detection methods. This is a multi-class category dataset which contains 836 high resolution images with pixels. We use random clipping, flipping, and mirroring to enhance and expand dataset.
Results for Nighttime Vehicle Detection
Results of the Object Stream Proposal
In our method, the object stream provides proposal for nighttime vehicle detection task. The performance of the proposed method was measured using the detection rate versus number of proposals (ROIs) curve, as shown in Fig. 10. The method in reference [8] is superior to the original Edge boxes [28]. The method based on Bayesian saliency in reference [29] is superior to other methods based on vehicle light detection in references [51–53]. The performance of the method combining edgeboxes, local contrast feature and image region similarity proposed in reference [15] is better than those in reference [8, 29]. In this paper, we compare the proposed method with those in references [15]. We find that when we only use the 18 proposals, our proposal generation method can obtain 100% detection rate, while the other method in references [15] obtain the same detection rate when using 20 proposals. Meanwhile, when the number of proposals is less than 20, the detection rate of our method will be higher with the same number of proposals. The results show that our proposed method outperforms other methods.

When our method is applied to vehicle detection at night, we use the false positives per image (FPPI) vs. miss rate curve of each image to evaluate the detection performance of some of the most advanced vehicle detection methods. In this curve, under the same FPPI, the lower the miss rate is, the better the method is. As shown in Fig. 11(a), we compare the detection results of CornerNet, without pixel stream and using pixel stream. The latter shows a lower miss rate than the former two under the same FPPI. Therefore, it can be inferred that the pixel stream is effective for the detection task. In order to further verify the role of image enhancement in vehicle detection, we compare the missed detection rate and false positives per image (FPPI) curves in Fig. 11(b). For comparison, we evaluate the proposed detection methods on the test set. In Fig. 11(b), the curve of our method is lower than that of all other curves, and shows a lower miss rate than the other five methods under the same FPPI. These results also show that our night detection method is more effective than other methods in vehicle detection at night.

Comparison of the detection performance of our proposed method and other methods.(a) Detection performance of our proposed method. (b) Comparison with some state-of-the-art night-time vehicle detection approaches.
Fig. 12 shows some examples of the detection results of our proposed method. The bounding box of each object in each image has the highest IOU with the real ground-truth. From Fig. 12, we can see that the proposed method can detect vehicles with different sizes [Fig. 12(a)], different numbers [Fig. 12(b)], and different backgrounds [Fig. 12(c)]. Even partially obscured and blurred vehicles can be detected successfully. Multi-class detection in the CHUK dataset can also be detected [Fig. 12(d)]. The yellow, green and red bounding boxes represents the bus, the car and the taxi, respectively. The proposed method can detect cars, taxis, buses and minibuses successfully. But there are also cases of error detection, such as similar taxis and cars. [Fig. 12(e)] For example, the left car in the left picture is detected as a taxi, and the left most taxi in the right picture is detected as a car.

The detection results of our proposed method.
In this section, we report the results of our proposed nighttime vehicle detection method and compare it with some of the most advanced detection methods in terms of vehicle recognition accuracy and vehicle detection performance. As shown in Table 2. And We name our method DANet+BCL (Direction Attention Network and Bayes Corner Localization).The accuracy represents the ratio of correct detection bounding boxes to all ground-truth bounding boxes. And if the IoU of a bounding box and ground-truth is greater than 0.5, we judge it as the correct. In the same dataset, we retain the same number of detected bounding boxes as the number of gound-truth in each image. Many effective methods of night detection are using manual feature extraction. For example, vehicle light blobs classifier [8, 72] combines vehicle lights detection and image enhancement, [9] uses bio-inspired image to enhance candidate region and [15] conducts feature extraction based on tensor decomposition. There are also methods based on CNN framework, such as CNN feature extraction combined with SVM classification method, CNN combined with nakagmi image extraction fusion feature method [10], and single-stage network YOLOv3 [34] and image-enhanced YOLOv3 [71]. The GAN nighttime object detection method FTE [70] has also been added for comparison. From Table 2, we can see that the accuracy of our proposed method is 97.2%, which is 0.31% and 1.62% higher than that of the best manual feature extraction method Tensor Decomposition [15] and the best CNN detection method [10] respectively on CHUK dataset. The accuracy of our proposed method is 96.86%, which is 0.34% and 1.65% higher than that of the best manual feature extraction method and the best CNN detection method respectively on SYSU dataset.
Comparisons Of Accuracy(%) on SYSU Dataset and CUHK Dataset
Comparisons Of Accuracy(%) on SYSU Dataset and CUHK Dataset
We tested the inference speed of our algorithm and other solutions on NVIDIA Titan X single GPU. Table 3 shows a comparison with detection time. For each different method, we first use the train set for training and test set for verification. When the accuracy reaches the value of the paper, stop training. And then, to ensure fair comparison, we set the image input size of each method to 511 × 511. We record the inference time for each picture. For our method, the average inference time is 227 ms per image. At the same time, the use of Hourglass-52 backbone network can speed up the inference speed. Our Hourglass-52 takes an average of 180 ms to process one image, which is faster than Hourglass-104. When using GPU, the method of manual feature extraction has not improved greatly in speed. For example, the method [15] needs 180 ms per image in CPU and 165 ms per image in GPU. This is because the data transmission will have a great cost, and GPU is slower than CPU in processing data transmission. And GPU’s specialty which is computing in deeper and wider network can not be clearly reflected in simple parameter calculation. Compared with the original CornerNet, the learning speed does not significant increase in training time. Because learning attention weights and IRNN layer in the attention pool module is parallel computing. In addition, the calculation method of IRNN is faster than that of CNN which is because the element-wise calculation is time-consuming.
Detection Time of Methods Compared in This Paper
Since our method is based on deep network, we also need to compare it with advanced deep network. We use the open source network that has been trained, and fine-tune it with the train set of CHUK dataset. When the loss of train set and test set no longer decreases and are small enough, the training will be stopped. Table 4 shows a comparison with state-of-the-art detectors on the CHUK test-dev set. We use AP metrics to characterize the performance of a detector. AP represents the average precision rate for all categories. The precision rate is the ratio of correct detection bounding boxes to all detection bounding boxes. And if the IoU of a bounding box and ground is greater than threshold, we judge it as the correct. And we calculated the AP when the IoU threshold is set to be 0.3,0.5 and 0.8. Compared with the reference CornerNet [19], our proposed method has significant improvement. For example, Our Hourglass-101 (i.e., the input image resolution is 511 × 511, and the backbone is Hourglass-101) reports that under the same settings, our test AP, AP0.3, AP0.5 and AP0.8 is 98.43%, 97.2%, 80.68% respectively, which is 3.45%, 3.09%, 2.83% and 4.45% higher than origin CornerNet. When the mild network (Hourglass-52) was used, the AP improvement of CornerNet was 4.4% (from 86.96% to 91.27%). We also replaced the backbone with ResNet-101 [55], and the corresponding AP is 90.63%. These results demonstrate the effectiveness of our method and Hourglass. Our method achieves 92.1% testing AP. Compared to traditional two-stage approaches, our method is 4.93% higher than Faster-RCNN/FPN [32]. Compared with other relatively advanced two-stage approaches D-FCN [56], Mask R-CNN [57],Casecade-RCNN [33] and CBNet [73], our method improved by 4.33%, 2.46%, 2.01% and 1.5% respectively. As for single-stage network, our method is 7.93% and 5.41% higher than YOLOV3 and SSD513 [18] respectively. In the case of anchor-free network, our AP is 4.96% and 1.2% higher than FCOS [39] and FreeAnchor [75]. Compared with a series of improved networks based on CornerNet, such as Cornernet-Lite [58] and CenterNet [37], we perform better. Even with the most advanced single-stage network DETR [74] and EfficientDet [36], our Hourglass-101 AP is 3.68% and 0.4% higher.
Comparisons with object detection network
Our work mainly improves the night vehicle detection in two parts: the direction attention pooling layer of object stream and the pixel stream. In order to analyze the contribution of each individual component, we do the ablation studies. The standard is the accuracy when IOU is 0.5. We split the details of these two modules, as shown in Table 5. The direction attention pooling layer is divided into two small details, one is learning attention weights, the other is using IRNN layer. The pixel stream is also divided into two details, one is using pixels to assist training, the other is the Bayes corner localization algorithm. We take the proposed method CornerNet as the original method. In the later experiment, we use the original network, delete the modules that need to be discussed, and then retrain them. In the first experiment, we use CornerNet as object stream and add the pixel stream, and the accuracy is 95.12%. Compared with the original 94.37%, it is increased by 0.75%. In the second experiment, we only use the object stream containing direction attention pooling layer. The result is 94.67% which is 0.3% higher than original. In the third experiment, we got the bounding box from the object stream. Instead of using corner localization algorithm based on Bayes, number of vehicle light pixels covered in each bounding box is used for sifting. We determine that the top-10 bounding boxes containing the most pixels of vehicle light is the best bounding boxes. The result is 94.26%, which is worse than that of the second experiment. The reason is that some bounding boxes may cover the lights of other objects, resulting in a lot of pixels covered, and these improper boxes are retained. The fourth experiment is the method proposed in this paper, it is increased by 2.83% compared to the original method. In the fifth experiment, in the direction attention pooling layer of object stream, we do not use IRNN layer, but pass through two 3 × 3convolution layer. The final result is 93.38%, which is 3.82% lower than the method proposed in this paper. In the sixth experiment, in the direction attention pooling layer, attention weights were not learned. The final result is 89.72%. Through the fifth and sixth experiments, the performance is even worse than the original. We can see that the attention weights must cooperate with IRNN layer to obtain better effect. According to these experiments, the attention pool of the object stream has a great impact on the performance of the algorithm.
The Result of Ablation Study
The Result of Ablation Study
In this paper, we propose a new method for vehicle detection at night. Due to the characteristics of low contrast and low brightness between the background and the object of night image, unmodified CNN feature extraction method tends to suffer from information loss occurred mainly during down sampling. For this reason, most of the night vehicle detection methods are using complex manual feature extraction. Our method provides remedy for this problem by adding vehicle light information. We used Nakagami segmentation and HSI segmentation fusion to get the vehicle light segmentation image, and then use the Bayes Corner Localization algorithm to finetune the bounding boxes of the detection network.We improved the reference network CornerNet by introducing the directional attention pooling into the corner detection module, and obtained better detection performance than the original. In order to verify the generalization of the model, we experimented on two different datasets and achieved state-of-the-art accuracy. Finally, we also compared with the current advanced network, and show superior performance than them.
In the future work, the existing night vehicle datasets have the problem of insufficient scenes and low resolution. We plan to build a new data set covering more scenes, more categories and of higher resolution. In addition, brightness of the training images is still enough for relatively reliable detection. However, there are some extreme circumstances in reality where there is hardly any light. How to deal with the detection task in such situation will be investigated next.
Footnotes
Acknowledgments
This work is supported by National Key R&D Program of China (2018YFC0309400), National Natural Science Foundation of China (61871188), Guangzhou City Science and Technology Research Projects (20190 2020008).
The SYSU dataset can be found in wwww.carlib.net/?page_id=35
