Abstract
The Pixel-Based Adaptive Segmenter with Confidence Measurement (PBASCM) is proposed for vehicle detection in complex urban traffic scenes to efficiently address deficiencies of the background subtraction model, which is easily contaminated by slow-moving or temporarily stopped vehicles. The background is modeled based on the history of recently observed pixel values and each pixel in the background model is assigned a confidence measurement based on the current traffic state. The foreground decision depends on an adaptive threshold, whereas the background model is updated based on the current traffic state and whether the corresponding pixel point is in the confidence period. Using real-world urban traffic videos, the overall results of detection accuracy analyses demonstrated that PBASCM achieved better performance in both qualitative and quantitative evaluations, compared with other state-of-the-art methods. PBASCM can accurately detect slow-moving or temporarily stopped vehicles, and the similarity and F-measure results for PBASCM were 0.839 and 0.912 higher, respectively, than those obtained by other methods in a traffic light sequence during the daytime. Thus, our experimental results demonstrate that PBASCM is effective and suitable for real-time vehicle detection in complex urban traffic scenes.
Keywords
Introduction
The extraction of foreground targets from a complex background is very important for traffic control departments and it has many applications in public security in cities. Automatic moving foreground detection in urban traffic targets has been studied widely by the video surveillance community and various methods for foreground detection have been developed, such as the frame difference method, optical method and the Background Subtraction (BS) method. The frame difference approach, which compares the difference between consecutive frames in a video image sequence, is simple and fast, but it performs poorly when the light changes or when vehicles are motionless. The optical flow technique is based on the motion projected onto an image surface. However, this method is very sensitive to noise and it can’t be employed for vehicle detection in real time because of its high computational cost. BS is an effective technique for detecting moving objects in image sequences acquired by stationary cameras. In BS methods, the background models are computed and the input video frames are then compared to their current background models. Finally, the regions with significant differences are marked in the foreground [1, 2].
Many BS methods have been proposed and used in previous studies to detect vehicles in video sequences of traffic scenes. Among various BS methods that have been proposed, the Gaussian mixture model (GMM) is the most widely used statistical model because of its robustness and effectiveness [3]. However, in order to avoid the model’s practical disadvantages, many researchers have introduced improved GMM methods. Thus, to reduce the processing time, Zivkovic et al. used a scheme to dynamically determine the appropriate number of Gaussian models for each pixel based on the observed scene dynamics [4], while Salvadori et al. [5] proposed two novel integer computer arithmetic techniques for updating Gaussian parameters. In [6–8], the spatial information was incorporated into temporal information to build spatiotemporal GMM to handle highly dynamic environments. To facilitate the better regularization of background adaptation for GMM and to resolve the tradeoff, an adaptive learning rate was proposed in [9] for each Gaussian model, which does not affect the stability. A new rate control scheme based on the high-level feedback as well as using different learning rate settings for four pixel types in the real foreground was used in [10, 11]. Shah et al. [12] proposed a new local parameter learning algorithm for the GMM background mode. Chen [13] presented a new self-adaptive GMM, which uses the dynamic learning rate with adaptation to cope with sudden variations in scene illumination. Recently, other statistical distributions based on GMM have been proposed for the moving target recognition. For example, mixture of asymmetric Gaussian distributions [14] and Dirichlet process GMM [15] have been employed to enhance the flexibility and adaptation of mixture model in real scenarios, respectively. GMM has been used to effectively describe scenes with smooth behavior and limited variation, and the improved GMMs can handle critical situations robustly and adaptively. However, in the presence of urban traffic scene with rapid variations or nonstationary properties, GMM and its improved models are still affected by a tradeoff between robustness to background changes and sensitivity to foreground abnormalities.
Many non-probabilistic BS models, which are usually produced by collecting a set of recently observed historical pixel values, have also been employed widely to detect foreground objects. The Codebook (CB) model is one of the most popular real-time BS methods for moving object detection [16, 17]. In the CB model, the background model for each pixel is represented by many code words and the number of code words is different following the activities of pixels. Adaptive Light-Weight (ALW) is another real-time algorithm, which is highly robust in challenging non-static backgrounds [18]. A Visual Background extractor (ViBe) was introduced by Barnich et al. [19], which represents the background model of each pixel based on observed neighboring samples and a random scheme is used to update the background model. In ViBe+ [20], the gradient information is used to improve the update strategy. According to perceptual characteristics of the human visual system, Han et al. proposed an improved ViBe method that uses an adaptive distance threshold for each background sample [21]. To improve the robustness of ViBe, Cheng et al. proposed a background model re-initialization method based on the sudden luminance change detection [22]. Similar to ViBe, the Pixel-Based Adaptive Segmenter (PBAS) proposed by Hofmann et al. [23] employs a random update strategy, but the update probability depends on a learning parameter, which can change dynamically over time for each separate pixel. Unfortunately, these non-probabilistic models of BS yield unsatisfactory results for urban traffic scenes and they cannot handle slow-moving vehicles or those that have stopped temporarily. Moreover, Vargas et al. [24–27] proposed the Sigma-delta with Confidence Measurement (SCM) algorithm, which is based on the original sigma-delta BS method, for detecting vehicles in urban traffic scenes. However, this model cannot handle multi-modal backgrounds and it is slow to obtain the initial background model.
The urban domain is more challenging in terms of the traffic density, high degree of occlusion, and variety of road users. However, the development of a reliable and robust foreground object detection algorithm in urban traffic scenes remains an open problem [28]. To efficiently address deficiencies of the BS model, which is easily contaminated by slow-moving or temporarily stopped vehicles, we propose the PBAS with Confidence Measurement (PBASCM) for vehicle detection in complex urban traffic scenes. We also use an evaluative traffic flow scheme, as employed in SCM [24–27], as well as employing a similar background model and foreground decision rule to those found in PBAS. Based on these schemes, the proposed method can detect slow-moving or temporarily stopped vehicles more precisely and more completely than state-of-the-art BS methods because it can adapt successfully to the complex urban traffic environment.
The remainder of this paper is organized as follows. In Section 2, we introduce the PBAS, and in Section 3, we introduce the PBASCM for BS in urban traffic environments. In Section 4, we describe the experiments and results obtained using our proposed method, compared with other state-of-the-art BS methods in different traffic scenarios. Finally, we give our conclusions in Section 5.
Pixel-Based Adaptive Segmenter (PBAS)
In [23], the PBAS uses the history of N recently observed image values as the background model, as employed in SACON [32], and a similar random update rule to that used by ViBe. A highly efficient background modeling method can be obtained using two controllers with feedback loops for the decision threshold and the learning parameter. For every pixel (x, y), the background model B (x, y) is defined by an array of N recently observed image values b M (x, y) , M ∈ [1, N] as follows.
A pixel (x, y) is determined as belonging to the background if its pixel value I (x, y) is closer than a certain decision threshold R (x, y) to at least # min of the N background values. Thus, the foreground segmentation is defined as
In this section, we describe the proposed PBASCM method of BS for completely and precisely detecting slow-moving or temporarily stopped vehicles in complex urban traffic scenes. In the PBASCM method, a confidence measurement is introduced for each pixel in the background model and to quantify the current trust value of each background pixel. The traffic flow states in the traffic scene are evaluated to determine whether the background model should be updated. This updating scheme effectively prevents contamination of the background model in complex urban traffic scenes. PBASCM uses an array of N recently observed image values , M ∈ [1, N] for each pixel to describe the background model B′ (x, y), which can be initialized using recently observed image values in a specified time interval at location (x, y), as follows:
After the initialization of the background model, in order to prevent the background model from being polluted by complex environmental conditions, such as slow-moving vehicles, temporarily stopped vehicles and congested conditions, the updating mechanism is established for the background model based on complex urban traffic scenes at the pixel level using a confidence period. The confidence period c (x, y), which is referred to as the confidence measurement, is set for the background model at location (x, y). When the value of the confidence measurement c (x, y) is higher, there is less need to update the corresponding pixel’s background model. In addition, the stability and reliability of the corresponding location is determined by the parameter h (x, y), which is the number of times that the pixel’s state changes from the background to the foreground or from the foreground to the background. A low value for h (x, y) indicates that the background model is stable and reliable, whereas the high value for h (x, y) indicates that the background model needs to be updated to obtain a more stable background model. We use a similar scheme to that employed in SCM to evaluate traffic flow states. The detection ratio d (x, y)/f (x, y) ∈ [0, 1] is used to partition states of the traffic scene into “very light”, “light”, “moderate”, “heavy” and “very heavy”, where d (x, y) is the number of times for the foreground and f (x, y) is the number of current frames during confidence period frames. This partitioning method is effective in discriminating between qualitatively different traffic conditions with rather fuzzy boundaries. In [24–27], the partition of complex traffic conditions is defined as follows:
where p (x, y) is the condition of complex urban traffic scenes. At the end of each confidence period, the value of c (x, y) must be updated according to current urban traffic conditions and the stability at location (x, y). If h (x, y)/f (x, y) < τ d (τ d is a threshold and τ d = 0.3), which implies that the background model is reliable and that the current background model should be retained, then the updating of c (x, y) is defined as follows:
Otherwise, if h (x, y)/f (x, y) ≥ τ d , which indicates that the background model is unstable and the current background model should be updated to adapt to the dynamic scene, then the updating of c (x, y) is defined as follows:
After the incoming pixel has been classified, the background model needs to be updated according to changes in the background, such as lighting changes, shadows and moving objects, including trees and slow-moving or temporarily stopped vehicles. When traffic conditions are considered to be suitable, it is necessary to update the background model in a selective manner to handle any background changes accurately. The traffic conditions for a pixel location are considered to be suitable when the confidence measurement decreases to a minimum and the pixel value, which may be in the foreground of the corresponding location, is then used to update the background model; otherwise, no update is performed. When f (x, y) < c (x, y), the pixel in the current frame is during the confidence period. Updating only occurs when the refresh period expires (i.e., P frames in this study, where P = 10), F
t
(x, y) = 0, and the traffic condition is p (x, y) = 0. When f (x, y) = c (x, y), the pixel in the current frame is at the end of confidence period. However, if h (x, y)/f (x, y) < τ
d
, F
t
(x, y) = 0 and p (x, y) is equal to 0, 1 or 2, then the background model can be updated. Thus, when the confidence period expires, h (x, y) holds the number of state changes during the last c (x, y) frames. If h (x, y)/f (x, y) < τ
d
and p (x, y) = 0, which means that the state of the pixel at this location is highly reliable at this moment, then it makes sense to take advantage of the current scene to update the background model. If h (x, y)/f (x, y) < τ
d
and p (x, y) is equal to 1 or 2, then the background can be updated with a low risk of “contaminating” the background model. By contrast, if h (x, y)/f (x, y) ≥ τ
d
and p (x, y) is equal to 0, then regardless of F
t
(x, y) = 0 or F
t
(x, y) = 1, this means that the state of the pixel is unstable at this location, and thus traffic conditions may not evaluated be reliably, so the background model should be updated only when p (x, y) = 0. If p (x, y) is above 2, then regardless of h (x, y)/f (x, y), the background model should not be updated due to the high risk of “contamination.” However, in the case where the confidence period decreases to a minimum, background updating is forced, which is a necessary mechanism that aims to prevent the model from becoming locked into the obsolete background model.
If the updating action of the background model is taken at time t, then the pixel value I t (x, y) in the current frame is used to update B′ (x, y). A background sample value b′ M (x, y) (M ∈ 1, …, N) is selected uniformly at random and replaced by the current pixel value I t (x, y). This allows the current pixel value to be blended into B′ (x, y). At the same time, we also update the randomly selected neighboring pixel (x′, y′) ∈ N (x, y) where the neighboring update strategy is similar to the PBAS, i.e., the corresponding value b′ M (x′, y′) in the background model B′ (x′, y′) is replaced by its current pixel value I t (x′, y′).
The local traffic police detachment of Jining city in Shandong Province provided urban traffic scene videos, which were recorded using traffic surveillance cameras over a one-week period. The urban traffic scene videos were captured using a charge-coupled device cameras (720P) installed at three different intersections, between 7:00 am and 10:00 pm. These scenes included a wide range of weather and illumination conditions. The ground truth was constructed manually for the real-world urban traffic scenes videos to enable objective quantitative segmentation results. In this study, we used three typical situations in the urban traffic videos, which we designated as “Traffic Light Sequence in Daytime (TLSD)”, “Strong Shadows on a sunny day with Waving Trees (SSWT)” and “Traffic Light Sequence at Night (TLSN)”, where the scenes included a number of slow-moving and temporarily stopped vehicles. These scenes were used for testing in comparative experiments, i.e. qualitative analyses and quantitative analyses. The proposed PBASCM approach was tested rigorously using this urban traffic video dataset and compared with six state-of-the-art, real-time BS methods: GMM [3], CB [17], ALW [18], SDC [26], ViBe [19] and PBAS [23]. The PBASCM method does not implement a shadow removal mechanism, so the pixels in shadow or reflection areas were considered as belonging to the foreground in these experiments. Moreover, all of these parameters in each method were set to the optimum values according to the authors’ recommendations [3, 26]. All of the methods tested in this study were implemented using Visual C++ on a PC with a 2.66 GHz Pentium(R) Dual-Core CPU E5300 and 2 GB RAM.
Qualitative evaluations
The qualitative evaluation to compare different methods as a subjective measurement of efficiency was based on a visual assessment of binary object detection masks for all of the test video sequences in Figs. 1–3. Each figure includes five representative frames for each sequence. Note that the object masks used for all methods were pixel-based and morphological post-processing was not applied.
The first sequence from urban traffic scenes videos (TLSD) is shown in Fig. 1, where the detection masks for CB, GMM, ALW, SDC, ViBe, PBAS and the proposed PBASCM method are in the third row to the ninth row, respectively. Moreover, parts of original representative frames are shown in the first row and the corresponding ground truth in the second row. The first column in Fig. 1 shows that the current image where a vehicle is passing the intersection and a pedestrian is waiting for the green light in frame 657, which is the same for each row. With the exception of SDC, all of the methods successfully achieved detection in this frame. However, with SDC, the moving vehicles present in the image at the beginning of this sequence were not “forgotten” and they produced ghost vehicles. The second column represents the beginning of the red light in frame 938, where some of the vehicles moving in one direction have stopped while the others are slowing down to wait for the green light in the same direction, whereas in the other direction, the pedestrians and vehicles are passing the intersection at different speeds. The third column shows the middle of the red light sequence in frame 1236, where some vehicles have remained stationary for nearly 12 seconds in front of the red light. The results show that these waiting vehicles disappeared completely or partly using GMM, ALW, ViBe and PBAS, whereas PBASCM completely and precisely acquired foreground objects without ghost vehicles. The fourth column shows the end of the red light sequence in frame 1729, where the vehicles stopped in front of the green light have started to move again. The final column shows part of the next red light cycle in frame 2910. For representative frames of TLSD in Fig. 1, PBASCM completely and precisely detected slow-moving or temporarily stopped vehicles.
Another sequence featuring strong shadows on a sunny day with waving trees is shown in Fig. 2. In frame 340, a white vehicle has stopped at the beginning of the red light, while a vehicle queue forms gradually in frames 380, 850 and 1230. The vehicles at the front of the queue were barely detected by CB, GMM, ALW, ViBe and PBAS, whereas PBASCM and SDC acquired all of vehicles waiting in the queue. However, SDC produced ghost vehicles and also detected waving trees as the foreground. Frame 2910 shows a new queue that formed in the next red light cycle. Slow-moving or temporarily stopped vehicles meant that the GMM, ALW, ViBe and PBAS methods failed to generate effective detection masks and they misjudged the vehicle queue as the background. CB and SDC could not handle the waving trees effectively as the background. In contrast to these methods, the proposed PBASCM method achieved superior detection results.
Finally, we evaluated nighttime conditions using the traffic light (TLSN) sequence at night. The main problem in this situation is that vehicle headlights on the road surface and traffic lights above crossroads will produce strong glare, as well as the more obvious image noise and interference from street lamps. Five frames of TLSN with growing traffic level have been chosen to highlight the differences among the compared methods. In particular, frames 980, 1230, 1290, 2000, and 2400 are displayed in Fig. 3. At frame 980, a vehicle is stopping to wait for the green light, and this vehicle has remained in the lane for around 10 seconds in frame 1230. GMM, ALW, ViBe and PBAS ignored this waiting vehicle partly or completely. All of the methods expect GMM detected the moving vehicle in frame 1290. Parts of the next red light cycle are shown in frames 2000 and 2400. Clearly, the PBASCM method obtained the most satisfactory results for moving vehicles as well as slow-moving or temporarily stopped vehicles. This is because the PBASCM method updates the background model based on whether the current pixel point is in a suitable traffic flow state.
Quantitative evaluation
In additional to the qualitative assessment above, a quantitative evolution between different methods is also important. The metrics of Recall, Precision,F-measure and Similarity [31] are used to evaluate the performance of the background subtraction methods at a pixel level, where higher values for these metrics indicated better performance. The Recall metric offers the percentage of detecting the true positives as compared to the total number of the positive pixels in the ground truth as follows
These four average metrics for the TLSD, SSWT and TLSN using CB, GMM, ALW, SDC, ViBe, PBAS and PBASCM are shown in Tables 1–3, respectively. These metrics are calculated utilizing all the ground-truth references. That is, frames 500 to 7500 in the TLSD, frames 500 to 5000 in the SSWT and frames 500 to 4500 in the TLSN. The bold values indicate the top performing methods for the performance metric in the Tables 1–4. As illustrated in able 1, Tables 2, 3, The results for TLSD showed that the F-measure and similarity results using PBASCM were up to 0.912 and 0.839 higher, respectively, than the other methods. The results for the SSWT showed that the F-measure and similarity results using PBASCM were up to 0.657 and 0.512 higher, respectively, than the other methods. The results for the TLSN showed that the similarity results using PBASCM were up to 0.411 higher than the other methods. Thus, the average metrics in Tables 1–3 clearly demonstrate that the PBASCM method obtained better F-measure and similarity results compared with the other methods for each test video sequence. Clearly, PBASCM handled the slow-moving and temporarily stopped vehicles well. Finally, we compared the processing speed using the different methods to assess their suitability for real-time applications in urban traffic scenes. Full-resolution images (1280 × 720: width × height) were sub-sampled to a resolution of 288 × 144 before processing. Table 4 compares the results obtained by each method in terms of the average processing speed, which is expressed as the number of frames/second (fps). Table 4 shows that ALW processed the highest number of frames per second. Furthermore, PBASCM, SDC, ViBe, GMM and CB had fairly good processing speeds. However, PBAS is not suitable for real-time applications due to its low processing speed.
The BS model is easily contaminated by slow-moving or temporarily stopped vehicles, so we tested various vehicle detection methods using complex urban traffic scenes to resolve this problem in an efficient manner. First, we proposed the PBASCM method, where each pixel in the background model is set a confidence measurement according to the current traffic state. The foreground decision depends on an adaptive threshold and the background model is updated based on whether the current pixel point is within the confidence period. Second, using three real-world urban traffic videos, we conducted experiments using PBASCM, as well as with CB, GMM, ALW, SDC, ViBe and PBAS for comparison. Our experimental results showed that PBASCM outperformed other methods when dealing with slow-moving or temporarily stopped vehicles. Third, we used the F-measure and similarity metrics to evaluate the performance of PBASCM and the other methods. The F-measures and similarity results using PBASCM were up to 0.912 and 0.839 higher, respectively, than the other methods for the traffic light sequence during the daytime. The F-measures and similarity results using PBASCM were up to 0.572 and 0.411 higher, respectively, than the other methods for the traffic light sequence during the nighttime. The F-measures and similarity results using PBASCM were up to 0.657 and 0.512 higher, respectively, for the strong shadows on a sunny day with waving trees sequence. Thus, our theoretical analysis and experimental results demonstrate that the proposed PBASCM method is suitable for handling slow-moving and temporarily stopped vehicles. Moreover, PBASCM can be used in real-time applications to urban traffic scenes. The results of the qualitative and quantitative analyses demonstrated that our proposed method yielded the most complete and precise detection results compared with the traditional state-of-the-art methods in an urban traffic environment, and the model is also suitable for real-time applications.
The precise definitions of traffic conditions and the updating parameter for the confidence measurement are difficult to determine. In fact, many experiments are required to select appropriate values for these parameters. Therefore, in our future research, we will focus on eliminating the influence of these parameters on the proposed PBASCM method. Furthermore, illumination changes and shadows in the foreground require further consideration.
