Abstract
Visual tracking is a very challenging task in computer vision. In this paper, we present a general-purpose framework for robust tracking. We propose to couple one-shot learning and online discriminative learning together to address the fundamental stability-plasticity issue for tracking. A one-shot learner through offline training on large-scale datasets is used as a stable detector which does not suffer model drift while an online discriminative learner is adopted as the tracker which is adaptive to significant appearance changes. Based on the directive framework, we design a baseline tracking model to verify its effectiveness. In practice, a deep Siamese network trained offline plays as the one-shot learner which can re-detect the target in case of tracking drift and failure. A correlation classifier which incorporates a translation model and a scale model plays as the online learner. Through the coupling of the offline and online learning, the simple baseline tracker achieves a good balance between stability and adaptivity without time-consuming optimization. Experimental results on the large-scale benchmark dataset demonstrate the effectiveness of the proposed framework within which the designed baseline tracker outperforms many state-of-the-art methods both in precision and robustness.
Introduction
Visual object tracking is a challenging research topic in computer vision with a variety of practical applications including human-computer interaction and intelligent surveillance. Although great progress has been achieved in recent years, visual tracking still remains a very challenging task due to drastic appearance changes caused by illumination variation, deformation, heavy occlusion and so on. To address these issues, numerous methods have been proposed but a perfect tracker that can deal with all these challenging factors does not exist [1].
To enhance the robustness of tracking, most of modern state-of-the-art tracking researches can be categorized into two types. The first type is to construct new appearance models which aims at increasing the discrimination of the target against background noises. Many methods fall into this type, such as subspace learning [2], compressive sensing model [3], color feature [4] and sparse coding [5]. Recently, representation learning methods have shown great advantages over hand-crafted features [6]. As for the tracking community, trackers based on the features extracted from auto-encoder network [7] and convolutional neural network [8] have been proposed. Though effective in some aspects, powerful feature representations still often fail in some situations, e.g. heavy occlusion and out-of-view.
Another important line of research is to introduce state-of-the-art machine learning methods into the tracking community. These approaches are often referred as tracking-by-detection or tracking-by-classification. Representative methods include multiple instance learning [9, 10], boosting methods [11], ensemble learning [12], support vector machine [13], random forest [14], correlation classifier [15] and so on. Thanks to the advances of machine learning, these trackers show more robustness than traditional motion estimation methods [16]. Since most of these trackers label samples by themselves during online running, however, they are liable to be fooled by false samples and thus suffer the well-known drift problem.
In this paper, we want to treat the tracking problem from a new perspective. We take the starting point from the definition of tracking: given the bounding box defining the object of interest in the first video frame, the tracking task is to automatically output the bounding box of the target in the subsequent frames. From the definition, we can have two important derivations to treat the essentials of tracking. First, tracking is a one-shot learning problem since only one labeled sample (the target patch in the first frame) is offered. All the samples collected from the next frames are all unreliable samples because all of them are labeled by the tracker itself according to the overlap rate between the samples and the tracked regions. If a tracked region has a deviation from the ground truth, small errors will accumulate. This is a positive feedback process which will enlarge the errors until tracking drift and failure. Thus to alleviate this problem, a stable one-shot learner which just utilizes the only one reliable example in the first video frame is a potential solution to robust tracking. In fact, we surely need some intelligent agent to keep the "stable information" during the tracking procedure. A one-shot learner may be the best candidate to play this role. It should be noted that the one-shot learning itself is also a challenging topic. How can a learner find and recognize the target object just based on only one labeled sample? According to the No Free Lunch Theorem, offline training of the one-shot learner to assimilate more auxiliary information is quite necessary.
The second derivation from the tracking definition is, different from classic one-shot learning problem, the tracking process is to recognize and locate the target object in successive images with temporal and spatial continuity. The one-shot learner may not be able to explore the "dynamic information" lies in the temporal and spatial continuity, thus fails to locate the target when it undergoes drastic appearance changes in certain frames during tracking. As a compensation for the one-shot learner trained offline, an online discriminative learner is needed to exploit more information during online tracking and adapt to significant appearance variations.
Based on the analysis above, the tracking problem can be effectively addressed by delicately coupling two types of intelligent agents. The first agent is based on one-shot learning which is used to explore the stable information of the target in a video sequence. The second agent is based on online learning which is used to explore dynamic information lies in consecutive frames. Note that this strategy is actually fusing the offline learning and online learning together. These can be general-purpose guidelines for designing a robust tracker.
According to the guidelines derived above, we have realized a baseline tracker to check its tracking performance. As for the one-shot learner, we adopt the Siamese neural network which is trained offline on a large number of image pairs from video sequences. The fully connected layer of the Siamese network plays the role of a learned metric function instead of a manually defined one, which shows better robustness. Through the whole tracking process, the labeled example of the target in the first frame is stored in the network, namely, the stable information has been kept and explored. As for the online learner, we adopt an enhanced version of correlation classifier with a translation model and a scale model. The online learner is trained with samples collected during online running to adapt to target appearance changes. When a tracking failure has been detected, the one-shot learner can re-detect the target, clear the error accumulation and reset the online learner. Through the coupling of these two learners, robust tracking can be achieved. Experimental results on large-scale tracking benchmark have demonstrated the competitiveness of the baseline tracker, thus proved the advantage of the proposed directive framework and guidelines for tracker design.
In summary, this paper makes the following contributions: A novel general-purpose framework to tackle the visual object tracking problem is proposed with useful guidelines derived from the fundamental definition. A baseline tracker is designed and tested on challenging benchmark video sequences with high precision and robustness.
Related work
Visual object tracking is one of the most challenging problems in computer vision. The key issue is how to make a tracker robust to various noises in a video sequence and less sensitive to the tracking drift problem. To achieve robust tracking, some methods with certain cooperative mechanisms have been proposed.
In [17], the authors propose to combine the template of target in the frame and an updated template to reduce drift. The TLD model [14] adopts a revised version of Lucas-Kanade for motion estimation, and a random fern detector trained online as a compensation. This strategy is also adopted in [18, 19]. In [20], the authors propose to fuse online classifiers in different time spans to keep more historical information, which works well for alleviating drifting. However, all these methods just rely on information from the online running and cannot tackle the effect of false samples which may contaminate the model itself.
To exploit extra information, some methods adopt pre-trained models for tracking. These methods can be roughly categorized into two types. The first type is to adopt networks without fine-tuning online. In [21], the authors use convolutional neural network (CNN) to learn a saliency map to locate the target. In [8, 22], hierarchical convolutional features are adopted to equip correlation filters to construct a robust ensemble framework. In [23, 24], metric learning via CNN is explored for tracking. All the networks in the above work are trained offline and used as either feature extractor or matching function without online optimization. These methods do not suffer time-consuming optimization for online tracking but may lose some adaptivity to some extent. The other type is fine-tuning the network online to increase its adaptivity. In [7], the authors train an auto-encoder network offline on a large scale of images and use the samples collected online to fine-tune the network to adapt to appearance changes. In [25], a light-weight CNN is proposed which is totally trained during online running. These methods are time-consuming and liable to be affected by false samples, which show inferior tracking performance.
Our method is significantly different from the work present above in that we focus on exploring both of online and offline information and coupling them in a compact manner. Our model incorporates general feature representations learned from large-scale image datasets as well as temporal consistencies between video frames. We name our tracker as OOT (abbreviation for Offline-Online Tracker or One-shot-Online Tracker).
Approach
General framework
We observe that, as for robust tracking, the information that needs to be explored should be decoupled into two parts: the stable information and the dynamic information. The stable information is just extracted from the only one labeled sample in the first frame. A learner should utilize the stable information and never be updated online to keep a fixed false positive detection rate. The stable learner will never suffer error accumulation and model drift, but it may give a false positive estimation. The dynamic information lies in the temporal and spatial continuity between consecutive frames in a sequence. A learner should utilize the dynamic information and be updated online to adapt for the appearance changes that have never been seen before by the tracker. The dynamic learner is adaptive to brand new candidates, but it may suffer the drift problem which leads to tracking failure.
By decoupling the information into two types which are explored by two independent learners, a robust tracker can be constructed by delicately coupling these two learners together. The illustration of the proposed framework is shown in Fig. 1. In practice, here are the guidelines for realizing a tracker based on the proposed framework: (1) it is a good choice to adopt one-shot learning as the stable learner which can incorporate both information from the labeled sample of the target and extra training images; (2) design a robust online learning algorithm as the dynamic learner which is updated properly to adopt to appearance changes.

Illustration of the proposed framework for robust object tracking. Two independent learners explore stable and dynamic information respectively to construct a robust tracking system.
We develop a baseline model for tracking based on the proposed framework to check its effectiveness. A correlation classifier with a translation model and a scale model is adopted as the online learner. During tracking, the classifier is trained online to adapt for appearance changes of the target. Since correlation classifiers can realize dense sampling with a very efficient manner and alleviate the label ambiguity, the online learner is quite robust. To further reduce the risk of tracking drift, we design a conservative update strategy to train the online learner.
As for the one-shot learner, we adopt the Siamese network [26] which has two branches. The network is trained offline with image pairs on a large-scale dataset. The fully connected layer is learned as a metric function which outputs the similarity between two images. The network has frozen weights all through the tracking procedure and is able to re-detect the target when the online learner drifts.
One-shot learner
We exploit the Siamese CNN architecture as the one-shot learner. The network has two branches: the exemplar branch and the instance branch. Each branch takes a AlexNet-like structure [27] and can extract hierarchical features from image patches. The exemplar branch takes images with the size of 127 × 127, while the instance branch takes incoming images with the size of 255 × 255.
We adopts a similar curated dataset and training strategy as [24] to train our network. The dataset is from the ImageNet Video dataset [28] which includes 800,000 objects from 2820 videos. The image pairs fed to the two-stream network are from two frames of a video with a distance of N frames. Both of the exemplar image and the corresponding search image are centered at the target and normalized to the input scale of the network. Pairs with the highest similarity are given the ground-truth label " +1 " while lowest similarity has " -1 ".
After the labeled target patch is given in the first frame, it is fed to the exemplar branch and gets its embedding φ (
The one-shot learner keeps frozen weights during tracking which avoids error accumulation and model drift in nature. However, tracking failure may still occur due to similar objects in the background. It has been observed that the one-shot learner may switch its bounding box wrongly to an object with similar appearance of the target, especially when the true target undergoes drastic appearance variations. This can be explained by that the one-shot learner does not explore enough temporal information which is just the key strength of the online learner.
The online learner is used for modeling the temporal and spatial relationship of the target appearance in consecutive frames. We use the correlation classifier (also referred as correlation filter) as the online learner. Two important issues need to be treated carefully: feature representation of the samples and the online training strategy. To enhance the discrimination, we use the convolution layers of the Siamese network in the one-shot learner as the feature extractor. The training follows an incremental way on reliable frames.
Let x
k
denote the feature map extracted from the k-th convolution layer and y be the 2D gaussian shape label matrix with zero mean and standard deviation proportional to the target size. Let X
k
= scripfontF (x
k
) , Y = scripfontF (y), where scripfontF (·) denotes the fast Fourier transformation (FFT). Then the correlation classifier
During online running, the correlation classifier is trained in an incremental way using new samples
The online learner is applied to candidate regions in the new frame. Let z
k
denote the output of the k-th layer of the Siamese network, then the response map is computed as:
To reduce the risk of drift, the weight
After the translation location has been determined, we construct a target patch pyramid centered at that location for scale estimation. Let M × N be the target size and S = {s1, s2,. . . , s
n
} denote the set of scale factors. For s
i
∈ S, we extract the features of a patch with the size sM × sN. Let y
s
i
denote the correlation response map for the target with size factor s
i
, then the optimal scale s* is estimated as:
After the ground-truth patch of the target in the first frame is given, it is fed to the one-shot learner and its embedding is kept as the stable information through the entire tracking process. Meanwhile the online learner begins its running with the labeled sample as its initial status.
The one-shot learner never suffers the drift problem since it does not utilize information from online tracking at all. Having been trained on a large number of images, the one-shot learner is still robust to certain distortions of the target based on the only one labeled sample. However, the one-shot leaner may fail to track the target and switch to very similar objects in the background. This is because it lacks the capability of temporal exploration. The online learner is good at modeling temporal information and is adaptive to drastic appearance variations. However, the online learner is inevitable to suffer the drift problem. Their characteristics are shown in Fig. 2.

Robust tracking via the coupling of the one-shot learner and the online leaner. The one-shot learner is stable but may switch to similar objects during tracking. The online learner is adaptive but may suffer tracking drift. By the complementarity of these two agents, the whole tracking model can be both robust and adaptive, thus achieves high precision. Red: success, Yellow: failure.
Based on the pros and cons of these two learners, robust tracking is performed through the coupling and complementarity of them as shown in Fig. 2. The online learner plays as the main tracker since it can explore temporal information between consecutive frames and gives its estimation according to (7). To deal with the drift issue, the one-shot learner plays as a re-detection partner which can re-detect the target in case of tracking failure.
In normal scenarios, the online learner is quite robust with high confidence scores. When the target suffers significant noises from background, the confidence score drops down. If the score is lower than a threshold, the one-shot learner is activated to re-detect the target and reset the tracker.
We take the video sequence named

Tracking confidence scores with and without the coupling of the one-shot learner which can play a role of re-detection in case of tracking drift or failure. If without the one-shot learner, the confidence score decreases to low values when suffering heavy occlusion and out-of-view. By exploring the stable information via the one-shot learner, the target can be re-captured and the confidence score increases to a relatively high level and the tracker is more clear about its estimation.
With the coupling of the one-shot learner, the target can be re-detected when it appears again after the heavy occlusion. The confidence score of the tracker increases to a relatively high level, which indicates that the tracker is less affected by noises from background and it is confident about the tracking result again.
Setup
For tracking performance evaluation, we carry out experiment on the well-known Online Visual Tracking benchmark [29], which includes 50 challenging video sequences. All of the sequences involve representative challenging factors for visual tracking, such as illumination variation, occlusion, background clutter and fast motion.
For comparison we run several state-of-the-art tracking algorithms with the same initial position of the target. These algorithms include MEEM [20], KCF [15], DSST [30], TLD [14], Struck [13] and ASLA [5]. The KCF tracker and Struck tracker are based on online correlation filter and support vector machine (SVM) respectively without complementary self-correction mechanism. The MEEM tracker and TLD tracker adopt ensemble strategy to enhance the overall robustness. The DSST and ASLA tracker can handle affine transformation well. For fair comparisons, publicly available source or binary codes of these trackers on the benchmark are adopted.
As for the online learner, we set the regularization parameter λ in (8) to 10-4. The learning rate η in (8) is set to 0.01. The update threshold ξ is set to 0.35. As for the one-shot learner, the threshold value to activate it for re-detection is set to 0.15. The algorithm is implemented in MATLAB on an Intel i7 PC with 16G memory. Note that the parameters of all trackers for comparison are kept the same throughout the experiments.
Evaluation metric
For quantitative evaluation of tracking performance, there are two widely-used metrics: center location error (CLE) and overlap rate (OR). The center location error is defined as the Euclidean distance (in pixels) between the center of the tracking result and the ground truth for each frame. The overlap score is defined as
Based on these two metrics presented above, we follow the evaluation criteria in the benchmark [29] where success plot and precision plot are adopted for evaluating each tracker. The success plot indicates the ratios of successful frames over one entire video. A tracked frame is regarded as having been successfully tracked if its overlap rate is larger than a given threshold. By varying the threshold gradually from 0 to 1, it generates a plot of the success rate against the overlap threshold for each tracker. The precision plot shows the percentage of frames whose center location error is within a given threshold. Different tracking algorithms are ranked based on the Area Under Curve (AUC) for the success plot and center location error at a specific threshold (20 pixels) for the precision plot.
Overall performance
The overall tracking performance in terms of precision plot and success plot of the trackers are shown in Fig. 4. The trackers are ranked according to their performance scores in the legend of each plot.

Overall performance of the proposed tracker and several state-of-the-art trackers on the Online Visual Tracking Benchmark. The trackers are ranked by their performance scores shown in the legend.
It can be seen from Fig. 4 that the proposed OOT tracker outperforms the other state-of-the-art tracking methods both in the success plot and the precision plot. The good average performance of our tracker on the benchmark has demonstrated the effectiveness of the proposed tracking framework. The coupling strategy has made a good balance between stability and adaptivity, which has enhanced the overall tracking robustness. Since the OOT tracker is just a simple baseline implementation without further optimization, better precision can be achieved with more advanced machine learning techniques for one-shot learning and online learning or more powerful feature representations.
To further evaluate the performance on different conditions, we also compute the success plots and precision plots of each tracker on different challenging factors. In the tracking benchmark [29], all the video sequences are annotated with 11 different attributes which are typical challenging factors for visual tracking. We report the results on four representative attributes in terms of success plot and precision plot as shown in Fig. 5. Our OOT tracker ranks first in all of these attributes. The results show that the OOT tracker is quite adaptive to significant appearance changes (rotations) and robust to background noises (occlusion, clutter). This is attributed to the coupling of the two independent learners with complementary roles.

Tracking performance of seven state-of-the-art trackers on four representative attributes of the benchmark videos. (a) Tracking performance scores based on success plot. (b) Tracking performance scores based on precision plot.
For qualitative analysis, we also plot the tracking results in terms of bounding boxes in some key frames from several challenging video sequences as shown in Fig. 6. The state-of-the-art trackers include: KCF, MEEM, Struck, TLD. From Fig. 6, it can be observed that our OOT tracker performs well in motion blur (couple), background-clutter (football), occlusion (lemming), out-of-view (jogging-2) and illumination variation (shanking and singer2). The KCF tracker cannot handle out-of-view issues since no self-correction component is equipped. The TLD tracker can re-detect objects but often mistakes similar objects for the target itself. In addition, it does not adopt online learning for the main tracker. The MEEM tracker is a quite good long-term tracker which can restore a drifted tracker to its uncontaminated status by storing its historical snapshots. However, it is still not able to explore the stable information adequately and merely rely on online learning. Overall, our tracker shows better robustness.

Tracking results of OOT tracker and several state-of-the-art trackers on some challenging video sequences. These video sequences include: couple, football, jogging-2, lemming, shaking and singer2.
In this paper, we propose a general-purpose framework for designing a robust tracker. We first have a deep insight into the nature of tracking and then derive some guidelines for tracker design. We propose to couple one-shot learning and online learning together to tackle the tracking difficulties. Based on the guidelines, we develop a baseline model to construct a robust tracker. Our analysis and the experimental results have shown that combining offline and online leaners together to explore both of stable information and dynamic information is of great importance for robust tracking. The proposed framework welcomes more powerful and efficient one-shot learning methods and online learning algorithms to be integrated into.
Footnotes
Acknowledgements
This work was supported by The Fundamental Research Funds for the Central Universities (Grant No. 500418754).
