Abstract
Occlusion handling is a challenging problem in object tracking. Most existing methods fail to handle well in complex image sequences. This paper presents a scene adaptive tracking algorithm in occlusion. We decompose the tracking into target translation and scale prediction. A kernelized correlation filter with an adaptive update scheme is adopted to estimate target position. The adaptive online update scheme takes advantage of the confidence score sensitivity to occlusion and reduces the false updating in occlusion during the tracking sequence. The target scale can be estimated by the correlation filter with the ridge regression. Extensive experiments results on 29 challenging occlusion sequences show that the proposed tracking approach achieves the average overlap precision (OP) of 72.2%, which improves the performance by 7.6% compared to the DSST. On OTB-50 dataset, our tracking approach is also superior comparing to several state-of-the-art trackers.
Introduction
Object tracking is one of fundamental problems in computer vision with the wide range of applications such as surveillance, security and motion analysis. Though the many visual tracking algorithms has proposed, visual tracking is still a challenge task because of the factors such as occlusion, scale variation, deformation, background clutter, out-of-plane rotation and so on.
The current tracking algorithms can be generally categorized into either generative or discriminative methods. Generative methods [1–3] learn an appearance model and treat tracking problem as finding the regions which are the most similar to the generative model. Ross et al. [1] proposed an incremental visual tracking (IVT) method that models the target appearance as a low-dimensional PCA subspace,where the subspace is updated adaptively with the historical and sequential appearance variations. Kwon et al. [2] proposed the visual tracking decomposition(VTD) method that the observation models are built by sparse principal component analysis (SPCA) of a set of feature templates. Comaniciu et al. [3] proposed the mean-shift tracker that tracks the object by calculating the Bhattacharyya similarity between histogram of candidate object and histogram of object model. The mean-shift tracker was well-known for robustness against partial occlusion and scale variations using the color histogram. However, these generative methods model only the appearance of the object while neglect the background information, which is the important information that can help us to distinguish between the target and background.
Different from Generative trackers, the discriminative approaches formulate the tracking as a classification problem that distinguishes the tracked targets from the backgrounds [4–6]. Zhang [4] proposed a real-time compressive tracking that formulated the task as a binary classification in the compressed domain. Different from tracking-by-detection framework, Kalal et al. [5] proposed the TLD tracker that decomposed the tracking task into three sub-tasks: tracking, learning and detection. The TLD tracker built a long-term tracking system based on the P-N learning while running in real time. Babenko [6] proposed a MIL tracker with few parameter tweak using multiple instance learning instead of traditional supervised learning. Recently, the trackers based the correlation filter in have been proven to be high efficiency and achieve robust tracking performance.Bolme et al. [7] proposed to learn a minimum output sum of squared error (MOSSE) filter for visual tracking. With the use of the fast Fourier transformation (FFT), the MOSSE tracker is high efficient with running at hundreds frames per second. Several extensions have been proposed to considerably improves tracking accuracy, including kernelized correlation filters (KCF) [8], color name tracker(CN) [9] and discriminative scale space tracker (DSST) [10, 11].
Though the existing correlation filter trackers is high efficient, those trackers cannot handle the occlusion problem well. In this paper, we aim to build a correlation filter tracker which is able to handle partial in occlusion and run in real-time. Our key idea is to employ the fast discriminative KCF to estimate target location, and construct a new adaptive update scheme which takes the occlusion into consideration. In addition, we decompose the tracking task into translation and scale estimation which can improve the performance to a great extent. The experiments proved the effectiveness of our tracking approach.
Our paper is arranged as follows. We describe the related work in Section 2. Our algorithm framework and the adaptive update scheme are discussed in Section 3. We give the experimental results of our method in Sections 4 and 5 concludes this paper.
Related work
In recent years, the correlation filter is applied to the field of object tracking. The MOSSE tracker is a classic correlation filter algorithm that performs well under changes in rotation and lighting. Henriques et al. [8] proposed the KCF tracker based on the correlation filter which used a circulant structure of tracking-by-detection with kernels method and multi-channel features. The DSST tracker [10, 11] learnt multi-scale correlation filters to track object using HOG features. However, these trackers didn’t address the adaptive model update for occlusion. Therefore, these correlation filter trackers are easy to drift in the occlusion environment.
To handle the occlusion, many trackers divided the target into separate parts to obtain stable target position. Kwon et al. [12] presented a local patch-based appearance model and provided an efficient algorithm to evolve the topology between local patches by on-line update. Zhang [13] presented a tracker with partial occlusion handling by robust parts matching among multiple frames. Zhao [14] proposed an adaptive template update scheme taking advantage of local sparse representation to detect occlusions during the tracking sequence. Yu [15] proposed a robust PCA algorithm to select part of image pixels to compute coefficients, which can successfully avoid false updates in occlusion and noisy. Liu [16] proposed a part-based visual tracker based on the adaptive correlation filters. However, the computational complexity of these methods is high and difficult to run in real-time. Unlike separate parts, we use a global response value to represent the occlusion while running in real-time.
Scene adaptive tracking
The target translation prediction
As the basis of our tracker, the KCF filter is used to train the translation filter, which can predict the target location. Here we use HOG feature and intensity feature to learn the KCF filter. The KCF tracker is trained using an image patch x of size, M × N, and considers all cyclic shifts xm,n, (m, n) ∈ {0, ⋯ , M - 1} × {0, ⋯ , N - 1} as the training examples. The goal of KCF is to train a linear model f (z) = 〈w, z〉, which indicates the probability of image patch z being the tracked target, by minimizing the squared error over samples
Here F() denotes the FFT transformation and α is a matrix consisting of coefficients αm.n. The kernel correlation operation kx,x′ is defined as
Here the bar
Therefore, the new position of target is detected by searching for the location of the maximal confidence score of y.
We update the KCF filter by simple linear interpolation with an adaptive learning rate η. The update scheme is defined as
Occlusion is one of the main challenging problems that often make it fail to relocate the objects in visual tracking. In General, the maximal confidence score becomes small when the target is partial occluded suddenly. Therefore, the maximal confidence score can reflect the occlusion changes. Let ymax denote the maximal confidence score for the confidence map on an image patch. Figure 1 shows the changes of the maximal confidence score when the target is occluded for the Faceocc2 sequence.

The confidence maps for the Faceocc2 image sequences. The images (b), (d) and (f) are the confidence maps to the sequence images (a), (c) and (e) separately. Here the ymax is the maximal confidence score.
However, the maximal confidence score cannot be used as the learning rate since it only reflects the absolute amount for the occlusion. Here we use the ratio of the maximal confidence scores between adjacent frames to denote the relative quantity of the occlusion. The learning rate η0 is defined as
Here the part
Here the 1-dimensional scale correlation filter is used to estimate the target scale. We use HOG features to learn the scale filter. Let H × W be the target size and N is the number of scales
where t is the index of frame and η1 is a learning rate. For each s
i
∈ S, the confidence score can be calculated as
We present an outline of our method in Algorithm 1.
Proposed tracking algorithm
Parameter setup
We name our proposed tracker “SAT” (Scene Adaptive Tracking). In the translation filter, the standard deviation for the Gaussian kernel is set to 0.5. In the scale filter, the standard deviation for the desired correlation output is set to 0.25 of the target size. The regularization parameter in SAT is set to λ0 = 10-2 in (1) and λ1 = 10-4 in (8). The size of the search window for translation estimation is set to 1.4 times of the target size. The scale learning rate η1 is set to 0.03 in formula (10) and (11). The number of scale is |S|=33 with a self-adaptive scale factor z. Given a target of size, the self-adaptive scale factor can be set to
The strategy can adjust the scale factor parameter adaptively.
We evaluate the proposed algorithm on a large benchmark dataset [19] that contains 50 videos. In benchmark dataset, we adopt the 29 sequences annotated with “occlusion” as the occlusion dataset. We compare our algorithm with 9 state-of-the-art trackers: DSST [10, 11], KCF [8], Struck [20], VTD [2], TLD [5], IVT [1], CT [4], MIL [6], CXT [21]. For the occlusion dataset, we report the overlap precision at a threshold of 0.5, which correspond to the PASCAL evaluation criteria. Meanwhile, we provide two kinds of plots: Precision Plot and Success Plot to evaluate all trackers, where trackers are ranked using the area under curve (AUC). Precision Plot indicates the ratio of frames with center location error (CLE) below a certain threshold. Success Plot [10] is based on the overlap precision (OP) that indicates the percentage of frames where the bounding box overlap surpasses throughout all threshold t ∈ [0, 1]. All trackers in this paper are implemented in Matlab2013 on an Intel I5-3210 2.50 GHz CPU with 4 GB RAM.
The occlusion dataset includes the 29 sequences that those sequences also have challenging problems such as illumination variation, deformation and background clutter. Table 1 shows the Per-video OP at a threshold 0.5 compared with 9 state-of-the-art trackers. Among the trackers in the literature, our SAT algorithm performs well with an average OP of 72.2%, which outperforms the DSST algorithm by 7.6%. In the sequences of david3, faceocc1, jogging-1, tiger1 and tiger2, the main challenge is the occlusion. The SAT algorithm handles the occlusion changes well on those sequences.
Per-video overlap precision (OP) (%) on the occlusion dataset. The red fonts indicate the best performance, the blue fonts indicate the second best ones, and the green fonts indicate the third best ones
Per-video overlap precision (OP) (%) on the occlusion dataset. The red fonts indicate the best performance, the blue fonts indicate the second best ones, and the green fonts indicate the third best ones
Resultant precision plots and success plots of OPE [18] are shown in Fig. 2. In the precision plot, the precision score of the SAT algorithm is 0.7 which outperforms the KCF algorithm by 2.5% and DSST algorithm by 5.3%. In the success plot, the proposed SAT algorithm achieves the score of 0.565 which outperforms the DSST algorithm by 3.3% and KCF algorithm by 5.1%. This indicates that the adaptive update strategy of filter based on the scene can improve the accuracy in occlusion. Our algorithm provides promising results compared to several trackers in the literature both in success plot and in precision plot on occlusion dataset.

Precision and success plots on occlusion dataset. The legend contains the AUC score for each tracker. The proposed SAT tracker performs favorably against the state-of-the-art trackers on occlusion dataset.
To further evaluate the robustness of our SAT, we set up a comparison on the benchmark dataset (OTB-50) [19] with challenging attributes such as occlusion, out-of-plane rotation, deformation and background clutter. Resultant Precision Plots and Success Plots of OPE on are shown in Fig. 3, which shows our tracker is superior comparing to state-of-the-art trackers on the OTB-50 dataset. In the precision plot, the precision score of the SAT algorithm is 0.705 which outperforms the DSST algorithm by 3.0% and KCF algorithm by 3.1%. In the success plot, the success score of the SAT algorithm is 0.575, which also outperforms the DSST algorithm.

Precision and success plots on OTB-50 benchmark dataset.
In addition, we report results for the deformation attributes in Fig. 4. On deformation sequences,the KCF method performs well with the precision score of 0.671 and the success score of 0.534 while the SAT algorithm achieves 0.749 and 0.605, which outperforms the KCF algorithm. Therefore, the SAT algorithm also robustness to the change of deformation.

Precision and success plots on the deformation sequences of the benchmark dataset.
Figure 5 shows a visualization of the tracking results of our method and the visual trackers DSST, KCF, Struck and CT on challenging sequences: carscale, david3, freeman4, jogging-1 and tiger2, which shows our tracker can preferably adapt to the occlusion change of target while keeping high precision. On OTB-50 dataset, our SAT algorithm performs at 39.4 frames per second that indicates the algorithm can be run in real time in most cases.

A visualized comparison of our tracker with four state-of-the-art trackers. The frames are from carscale, david3, freeman4, jogging-1 and tiger2 respectively from top to bottom.
In this paper, we propose a scene adaptive tracking algorithm based on correlation filter in a tracking-by-detection framework. The tracking task is decomposed into target translation and scale prediction in our method. A kernelized correlation filter based on the multidimensional features is adopted to estimate target position. We present an adaptive online update scheme for the kernelized correlation filter. The target scale can be estimated by the correlation filter with the ridge regression. Experimental results show that the SAT algorithm performs favorably against several state-of-the-art trackers on both occlusion dataset and the OTB-50 dataset while running in real time. Moreover, the SAT algorithm also adapts to the change of deformation.
Footnotes
Acknowledgments
This work was supported by Shandong Natural Science Foundation (ZR2013FL018) and National Nature Science Foundation of China (61773244, 61772319, 61472227).
