Adaptive updating correlation filter based on multi features

Abstract

Traditional discriminant correlation filters fuse hand-crafted features and deep features fusion for tracking, which will cause features redundancy too much and lead to model overfitting when parameters are updated. Moreover, it is difficult to deal with various complex problems in video by using the same label map for different features. In addition, the model updating strategy of the traditional method is relatively single, the model is updated by one frame or several frames. Different from the traditional methods of response map fusion, this paper proposes a multi-layer features correlation filter algorithm to estimate the target model from multiple perspectives. The corresponding label maps are used for different features, and an adaptive model updating strategy is proposed. The proposed tracker achieving leading performance in OTB2013, OTB2015 and VOT2016 datasets.

Keywords

Object tracking discriminant correlation filter multi-layer features model updating

1. Introduction

Target tracking is one of the important research directions in the field of computer vision. It is widely used in daily life, such as video surveillance, tracking navigation, unmanned driving, etc. [1, 2]. The appearance of the target is easily disturbed by illumination, deformation, fast motion, occlusion and so on, which increase the difficulty of target tracking. In addition, the lack of training samples and the real-time requirement of the algorithm are also great challenges [3, 4].

Target tracking methods can be roughly divided into two categories, one is online update method, the other is offline learning method. Offline learning methods mainly include full convolutional Siamese Networks (SiamFC) [5] and correlation filter Networks (CFnet) [6]. These methods are trained offline with large-scale image pairs and do not update the model. The testing speed is relatively fast, the accuracy is generally lower than online methods. Online updating methods are mainly discriminant correlation filters (DCF) [7, 8, 9, 10], which trains a regressor by exploiting the properties of circular correlation and performing operations in the Fourier domain.

In recent years, discriminant correlation filter based methods have achieved good results in target tracking competition. Most of these methods use the fusion features of deep features and traditional features [11], or integrate shallow features with high-level semantic features which extracted by neural network [12, 13]. The fusion of deep features and traditional features will make features redundancy too much, and it is easy to cause model overfitting when parameters are updated. The dimension of features can be reduced by principal component analysis (PCA) [14] and other dimension reduction algorithms. However, based on too complex features method is not conducive to the fine-tune hyper-parameters. The information utilization of trackers based on deep features is improved, but the fusion features may not deal with all the challenges (illumination, deformation, fast motion, occlusion and so on). In addition, the performance of a single tracker may be unstable when meet different video sequences [15].

High-level semantic features are more robust to object shape change, and shallow features are good for precise positioning. If the two features are merged directly, the high-level features will cover the shallow features. In this paper, a multi stage features adaptive updating correlation filter method is proposed. It is different from merging deep features and traditional features. The proposed algorithm firstly fuses the features of deep network by weighting, and then solves the trackers for several features respectively. The final result is selected by robust decision. Considering the output of multiple trackers in the decision stage, the system can not only make full use of the features information, but also effectively improve the robustness of the system when deal with different video sequences. In addition, ECO [9] and similar methods make use of the multi-layer features of deep network, but it’s unreasonable that high-level semantic features and shallow features all adopt the same label map. The method proposed in this paper takes into account the differences of deep and shallow features to distinguish label maps. In model updating stage, we adopt the adaptive update method instead of the fixed update, so the algorithm can better adapt to the change of the target environment.

The contributions can be summarized as following folds. 1). We using four weighted integration of different layers of network features to calculate the parameters of correlation filters separately. 2). Unlike previous methods, which used the same label map as the target value of regression, we use different label maps in order to adapt to the difference between features. 3) It is difficult for a tracker to deal with all types of situations in a video sequence, so we get the tracking results by different fusion features and select the final result by robust decision. 4) We propose an adaptive model updating strategy.

2. Discriminant correlation filters (DCF)

The discriminant correlation filter (DCF) combined with convolution features based tracking methods have gained much attention in benchmark test. DCF is designed to train a correlation filter through a series of training samples. In the field of object tracking, the training sample is the first picture. Suppose the picture is x, and the size is $M\times N$ . All cyclic shift samples of ${\bm{x}}$ can be represented by ${\bm{X}}=\{{\bm{x}}_{m,n}\},m\in[0,M-1],n\in[0,N-1]$ . The expected output of the corresponding circular sample ${\bm{x}}_{m,n}$ represented by Gaussian function label ${\bm{y}}_{m,n}$ . Label map ${\bm{y}}$ includes target scores for each location of cyclically shifted samples. The filter ${\bm{f}}$ is trained by minimizing the following regression error:

$\displaystyle\mathop{\min}\limits_{\bm{f}}\sum\limits_{m=1,n=1}^{M,N}\sum% \limits_{c=1}^{d}\left\|\varphi_{c}(x_{m,n}){\bm{f}}_{c}-y_{m,n}\right\|_{2}^{% 2}+\lambda\sum\limits_{c=1}^{d}\left\|{\bm{f}}_{c}\right\|_{2}^{2}$ (1)

$\lambda(\lambda\geqslant 0)$ is the regularization coefficient in the formula, which can effectively prevent the overfitting of the filter. $d$ is the number of channels of features. $\varphi_{c}({\bm{x}}_{m,n})$ represents the $c$ -th channel in features of ${\bm{x}}_{m,n}$ . The solution of the filter can be obtained according to [16].

$\displaystyle\hat{\bm{f}}_{c}^{\ast}=\frac{\hat{\bm{y}}\odot\hat{\varphi}_{c}^% {\ast}\left(x\right)}{\sum\limits_{i=1}^{d}{\hat{\varphi}_{i}^{\ast}\left(x% \right)\odot\hat{\varphi}_{i}\left(x\right)+\lambda}}$ (2)

Where $\odot$ is element-wise product, $\wedge$ denotes the symbol of Discrete Fourier Transform and $\hat{\varphi}_{c}^{\ast}(x)$ is the complex-conjugate of $c$ -th channel features of ${\bm{x}}$ . During the test, the object is located according to the response map, the location of the maximum value in the response map is the central point of the object. The response of the filter on sample ${\bm{z}}$ can be expressed by the following formula, where $F^{-1}$ is the inverse Fourier transform.

$\displaystyle R=F^{-1}\left({\sum\limits_{c=1}^{d}\hat{\bm{f}}_{c}\odot\hat{% \bm{z}}_{c}^{\ast}}\right)$ (3)

In order to reduce the adverse effects of edge effects, Hanning window is usually added to the signal to suppress the boundary effects [8] which produced by circular sample blocks. The online update can be obtained by the following:

$\displaystyle\hat{\bm{A}}_{c}^{t}=\left({1-\alpha}\right)\hat{\bm{A}}_{c}^{t-1% }+\alpha\hat{\bm{y}}\odot({\bm{x}}_{c}^{t})^{\ast}$ $\displaystyle\hat{\bm{B}}_{c}^{t}=\left({1-\alpha}\right)\hat{\bm{B}}_{c}^{t-1% }+\alpha\sum\limits_{i=1}^{d}{({\bm{x}}_{c}^{t})^{\ast}\odot\hat{\bm{x}}_{i}^{% t}}$ (4) $\displaystyle(\hat{\bm{f}}_{c}^{t})^{\ast}=\frac{\hat{\bm{A}}_{c}^{t}}{\hat{% \bm{B}}_{c}^{t}+\lambda}$

where $\alpha$ is the learning rate and $t$ is the index of the current frame.

3. Multi stage features adaptive updating correlation filter

In the field of object detection, many methods combine shallow features with deep features by means of fusion [17, 18]. High-level semantic features have good robustness, and the shallow features are helpful to the accuracy of location.

3.1 Multi-layer features weighted fusion

This paper uses VGG19 [19] (removing the full connection layer) to extract features. $C_{1}$ , $C_{2}$ and $C_{3}$ represent the output features of Conv2-2, Conv3-4 and Conv5-4 respectively. Because the size and dimension of each convolution layers are different, we need to pre-process the features first. Firstly, $C_{2}$ and $C_{3}$ features are reduced to 128 dimensions by PCA (keep the same number of dimensions with $C_{1}$ ), which can remove redundant features and improve the running speed of the algorithm. Then, the size of $C_{2}$ and $C_{3}$ is adjusted to $C_{1}$ size by interpolation.

In order to make full use of the features information of deep network, we fuse the pre-processed $C_{1}$ , $C_{2}$ and $C_{3}$ features with weights, and get four fused features. Traditional methods only get a fused features for training a tracker. However, it is difficult for a single tracker to deal with many complex situations in videos.

$\displaystyle\textit{feature1}=w_{1,1}C_{1}+w_{1,2}C_{2}$ $\displaystyle\textit{feature2}=w_{2,1}C_{2}+w_{2,2}C_{3}$ $\displaystyle\textit{feature3}=w_{3,1}C_{1}+w_{3,2}C_{3}$ (5) $\displaystyle\textit{feature4}=w_{4,1}C_{1}+w_{4,2}C_{2}+w_{4,3}C_{3}$

We fuse multiple features to train multiple trackers. In this way, the proposed method can effectively deal with the interference of video sequence. The sum of the weights is 1, we train the tracker by minimizing the regression error. During fusion stage, we can get the weights according to regression error.

3.2 Multimodel estimation

The performance of single tracker may sometimes be unstable. Unlike the response map fusion methods [9, 10, 12, 20], the proposed method estimates the tracker from multiple angles according to the features of the multi-layer fusion of the VGG19 network. In order to make full use of features information, different features should be labeled differently.

In experiment, we find that the selection of standard deviation $\sigma$ in Gauss objective function is related to the convolution layers. When the settings of $\sigma$ corresponds to high-level semantic features are relatively large and the $\sigma$ shallow features are relatively small, the tracking results are significantly better than $\sigma$ with the unified setting.

We use the method of target scale estimation in DSST tracker to estimate the size of bounding box [34]. Four correlation filters can be obtained according to the corresponding fusion features. The bounding boxes of object corresponding to different fusion features in $t$ -th frame are expressed as $B_{1}^{t}$ , $B_{2}^{t}$ , $B_{3}^{t}$ and $B_{4}^{t}$ respectively. After estimated four bounding boxes, we can calculate Intersection-over-Union (IoU) between two boxes separately by Eq. (6).

$\displaystyle\textit{Overlap}_{i,j}^{t}=\frac{B_{i}^{t}\cap B_{j}^{t}}{B_{i}^{% t}\cup B_{j}^{t}}$ (6)

In order to prevent the occurrence of extreme cases, the non-linear Gauss function is used to estimate the score between the bounding boxes.

$\displaystyle S_{i,j}^{t}=e^{-\frac{1}{1-\textit{Overlap}_{i,j}^{t}}}$ (7)

The variance of the scores between the bounding box $B_{i}^{t}$ and other bounding boxes can be expressed as:

$\displaystyle V_{{}_{i}}^{t}=\sqrt{\frac{1}{3}\sum\limits_{j\neq i}^{4}{\left(% {S_{{}_{i,j}}^{t}-\frac{1}{\Delta t}\sum\limits_{k=t-\Delta t+1}^{t}{S_{i,j}^{% k}}}\right)}}$ (8)

Considering the time series nature of video, the adjacent frames have more similarities, so the weight setting is larger than previous frames. We set weights $\beta_{k=t-\Delta t+i}=p^{i-1},(p>1)$ .

$\displaystyle\bar{V}_{{}_{i}}^{t}=\frac{1}{\sum_{k}{\beta_{k}}}\sum_{k=t-% \Delta t+1}^{t}{\beta_{k}V_{i}^{k}}$ (9) $\displaystyle\bar{M}_{{}_{i}}^{t}=\frac{1}{\sum_{k}{\beta_{k}}}\sum_{k=t-% \Delta t+1}^{t}{\beta_{k}\left({\frac{1}{3}\sum_{j\neq i}^{4}{S_{i,j}^{k}}}% \right)}$ (10)

The robustness score of the $i$ -th estimation bounding box $B_{i}^{t}$ with other estimation bounding boxes in frame $t$ can be expressed as Eq. (11).

$\displaystyle\textit{Score}_{i}^{t}=\frac{\bar{M}_{i}^{t}}{\bar{V}_{i}^{t}}$ (11)

The size of the object in adjacent frames varies slightly. So, object’s bound box between frames should be smooth.

$\displaystyle\textit{Scale}_{i}^{t}=\exp\left({-\frac{\left\|{o\left({B_{i}^{t% }}\right)-o\left({B_{i}^{t-1}}\right)}\right\|^{2}}{2\left({w\left({B_{i}^{t}}% \right)+h\left({B_{i}^{t}}\right)}\right)^{2}}}\right)$ (12)

In Eq. (12), $o$ represents the central coordinate value, $w$ and $h$ represent the width and height of the target frame respectively.

The final bounding box of the object is determined by the robustness score. The highest score box is the best bounding box in the current frame. In Eq. (13), $\theta$ is the parameter to trade off the Score and Scale weights

$\displaystyle{R}_{i}^{t}=\theta\textit{Score}_{i}^{t}+\left({1-\theta}\right)% \textit{Scale}_{i}^{t}$ (13)

3.3 Adaptive update

Most of the discriminant correlation filter methods update filter parameters [8, 10, 12] per frame. When the object moves too fast or deforms too much, the incorrect prediction will lead to model drift, and even lead to tracker has a failure tracking. Therefore, there is a method to update the parameter [9] at intervals of several frames, which not only reduce the update frequency, but also effectively reduces the possibility of drift. However, it is difficult to flexibly cope with the model updating caused by the change of the object when the frequency is fixed. In this paper, an adaptive update mechanism is proposed.

Peak-to-sidelobe ratio (PSR) can reflect the reliability of tracking results [21]. PSR is showed as follow.

$\displaystyle\textit{PSR}=\frac{R_{\max}-\textit{mean}\left(R\right)}{\sigma% \left(R\right)}$ (14)

Online learning can effectively adapt to the changing situations of object in video. If the PSR of the current frame is larger than the average PSR of the historical frame, the selection of the current learning rate is appropriate. If it is lower than the average PSR of the historical frame, it means that the current learning rate needs to be updated.

$\displaystyle\alpha^{t}=\left\{\begin{array}[]{ll}\alpha^{t-1},&\textit{PSR}^{% t}>\varphi/t\sum_{i}{\textit{PSR}^{i}}\\ \alpha^{t-1}\cdot\left({\textit{PSR}^{t}/\left({\frac{1}{t}\sum_{i}{\textit{% PSR}^{i}}}\right)^{\tau}}\right),&\textit{others}\\ \end{array}\right.$ (15)

$\varphi$ controls the selection of threshold, $\tau$ plays a regulatory role in Eq. (15).

For clearly understanding the proposed method in this paper, the framework of our method is depicted in Fig. 1.

Figure 1.

Lowchart of the proposed tracking algorithm.

4. Experiments

In order to verify the effectiveness of the proposed method, a large number of experiments have been carried out on OTB2013 [22], OTB2015 [23] and VOT 2016 [24] datasets. The experimental platform is Intel (R) Core (TM) i7-8700 CPU@3.2 GHz, 8 GB memory, and implemented on MatConvNet. The GPU version is GeForce GTX 1080Ti. In OTB2013 and OTB2015 database experiments, We evaluate the proposed approach with 15 recent state of-the-art trackers including MDNet [25], CCOT [10], CREST [20], ECO-HC [9], BACF [26], LCT [27], SRDCF [28], CF2 [13], HDT [29], Staple [30], CNN-SVM [31], SAMF [32], MEEM [33], DSST [34], KCF [8]. In the experiment, all the tracking methods are evaluated by the distance precision (at an threshold 20 pixels). In addition, we use overlap success plots over these datasets using one-pass evaluation (OPE).

4.1 Evaluation on OTB2013

OTB2013 database has 50 video sequences and 51 test scenarios. These video sequences involve 11 attributes of target tracking, including illumination change, scale change, occlusion, deformation, motion blur, fast motion, in-plane rotation, out-of-plane rotation, out-of-view, background interference and low-pixel. Each video sequence corresponds to two or more attributes. (a) and (b) in Fig. 2 are Precision plots and Success plots respectively. The threshold of precision maps is set to 20 pixels. That is to say, within 20 pixels of the distance between the predicted central of the bounding box and the actual central of ground truth, the tracking result is considered success. The success plots of OPE is calculated based on the area under the curve (AUC).

Figure 2.

Precision plots and Success plots for trickers on OTB2013.

As can be seen from Fig. 2a, the accuracy of our algorithm is higher than the other methods except a little bit lower than MDNet. Figure 2b shows that the success rate of the proposed method is the best among all the comparison methods. It is worth noting that MDNet is a deep learning method which is trained offline in an end-to-end manner. Although the accuracy of MDNet method is higher than proposed method, the testing speed of MDNet is very slow, which is below the real-time requirement of 1 fps. When test a new video sequence, the results of MDNet may be worse. The proposed method not only has good robustness, but also achieves the test speed of 12.4 fps.

4.2 Evaluation on OTB2015

OTB2015 database has 98 video sequences and 100 test scenarios. Compared with OTB2013, OBT2015 adds video sequences, which makes the performance evaluation of tracking algorithm more reasonable.

Figure 3.

Precision plots and Success plots for trickers on OTB2015.

Figure 3 is the experiments of all trackers in OTB2015 database. From the precision plots, we can find that the accuracy of this method is the highest. From the success plots, the area under the curve (AUC) of the proposed method is also the highest, which is outperforming 1.4% than MDNet and nearly 8% higher than CREST (ICCV2017) and BACF (CVPR2017).

In order to compare the performance of each tracker in videos of different properties, we select videos with occlusion, vision, background interference and fast motion as the test database. The experimental results in Fig. 4 show that the accuracy of the proposed method is slightly lower than that of MDNet and CCOT under fast motion, but the area under the curve of the success rate of the proposed method is the highest. In addition, the proposed method exhibits optimal performance under occlusion, out-of-view and background interference.

4.3 Evaluation on VOT2016

VOT 2016 database is one of the most commonly used databases in the field of object tracking. The database is short video sequence, but it is more challenging than OTB video database. In the experiment, accuracy, failure rate and expected average overlap (EAO) are used to measure the performance of tracking algorithm. We use the official VOT algorithm results as a comparison, the experimental methods in this section are MDNet, CCOT, Staple, SRDCF, SSAT [35], EBT [36] and MLDF [24].

Table 1 shows our method along with several trackers listed in the report of the VOT2016. In comparison, the proposed method rank 1st according to EAO. Specifically, it can surpass CCOT in the 2nd place by 5.1%. From Table 1, it can be seen intuitively that the performance of our method is the best except that the accuracy is slightly less than SSAT method. The failure rate reflects the number of times the tracker fails in tracking the object, our method has the lowest failure rate compared with other methods.

Figure 4.

Precision plots and Success plots in video sequences under occlusion, out of view, background clutter, fast motion.

4.4 Visualization of tracking results

In order to intuitively see the tracking results of the trackers, we show the results of Dragon Baby, Biker and Tiger2 video sequences. It is shown in Fig. 5.

Table 1
Indicators of tracking methods for VOT 2016 dataset

Algorithm	EAO	Accuracy	Failure rate
Proposed	0.382	0.571	0.75
MDNet	0.257	0.541	1.20
CCOT	0.331	0.539	0.85
Staple	0.295	0.544	1.35
SRDCF	0.247	0.535	1.42
SSAT	0.321	0.577	1.04
EBT	0.291	0.465	0.90
MLDF	0.311	0.480	0.83

Figure 5.

Tracking results.

Figure 5 shows our method is able to outperform CREST (ICCV 2017) in tracking videos regardless of the fast change or occlusion. When the appearance of object varies greatly, such as frame 71 in Biker video sequence, our method can effective track the object. Overall, our algorithm performs favorably against CREST, DSST, MEEM and KCF.

5. Conclusion

It is difficult for a single correlation filter to deal with all situations in the field of target tracking. The method in this paper makes full use of the features information of the network. Besides, an adaptive updating method was proposed in the paper. High-level semantic features are more robust to object shape changes, while shallow features are more accurate for location. The proposed algorithm firstly fuses the features by weighting, then we train the trackers by minimizing the regression error for several fused features respectively. During decision stage, we select the final result from the outputs of multiple trackers. In this manner, the method makes full use of the deep features information, and also effectively improves the robustness of the system. In addition, our method in this paper takes into account the differences of high-level and shallow features to set different label maps. In terms of update strategy, adaptive update method is adopted instead of fixed update. In future, we will focus on how to further enhance the robustness of the proposed method. And how to effectively combine offline learning with online update model.

References

Lee

K.H.

and Hwang

J.N.

, On-road pedestrian tracking across multiple driving recorders, IEEE Transactions on Multimedia 17(9) (2015), 1429–1438.

Chen

M.Y.

and Zhu

C.A.

, Covariance intersection multirobot object tracking algorithm based on self-adaption SR-CKF, Journal of Computational Methods in Sciences and Engineering 18(2) (2018), 479-489.

A.F.

Luo

Tian

X.M.

et al., A twofold siamese network for real-time object tracking, In Proceedings of the European Conference on Computer Vision (2018), 4834–4843.

Yan

et al., High performance visual tracking with siamese region proposal network, In Proceedings of the European Conference on Computer Vision, 2018, pp. 8971–8980.

Bertinetto

Valmadre

Henriques

J.F.

et al., Fully-convolutional siamese networks for object tracking, In Proceedings of the European Conference on Computer Vision, 2016, pp. 850–865.

Valmadre

Bertinetto

Henriques João

et al., End-to-end representation learning for correlation filter based tracking, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5000–5008.

Zhu

Zou

et al., End-to-end flow correlation tracking with spatial-temporal attention, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 548–557.

Henriques

J.F.

Caseiro

Martins

et al., High-speed tracking with kernelized correlation filters, IEEE Transactions on Pattern Analysis and Machine Intelligence 37(3) (2015), 583–596.

Danelljan

Bhat

Shahbaz Khan

and Felsberg

, Eco: Efficient convolution operators for tracking, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6931–6939.

10.

Danelljan

Robinson

Khan

F.S.

and Felsberg

, Beyond correlation filters: learning continuous convolution operators for visual tracking, In Proceedings of the European Conference on Computer Vision, 2016, pp. 472–488.

11.

Danelljan

Hager

Shahbaz Khan

and Felsberg

, Learning spatially regularized correlation filters for visual tracking, IEEE International Conference on Computer Vision Workshop, 2015, pp. 4310–4318.

12.

Fan

Zhuang

et al., Correlation filters with weighted convolution responses, IEEE International Conference on Computer Vision Workshop, 2017, pp. 1992–2000.

13.

Huang

J.B.

Yang

and Yang

M.H.

, Hierarchical convolutional features for visual tracking, IEEE International Conference on Computer Vision Workshop, 2015, pp. 3074–3082.

14.

Turk

and Pentland

, Eigenfaces for recognition, J Cogn Neurosci 3(1) (1991), 71–86.

15.

Wang

Zhou

W.G.

and Tian

, Multi-cue correlation filters for robust visual tracking, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4844–4853.

16.

Danelljan

Hager

Khan

F.S.

and Felsberg

, Discriminative scale space tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence 39(8) (2017), 1561–1575.

17.

Lin

T.Y.

Piotr

Girshick

et al., Feature section, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 936–944.

18.

Zhang

Wen

Bian

et al., Single-shot refinement neural network for object detection, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4203–4212.

19.

Simonyan

and Zisserman

, Very deep convolutional networks for large-scale image recognition, Computer Science, 2014.

20.

Song

Gong

et al., CREST: Convolutional residual learning for visual tracking, IEEE International Conference on Computer Vision, 2017, pp. 2574–2583.

21.

Bolme

D.S.

Beveridge

J.R.

Draper

B.A.

et al., Visual object tracking using adaptive correlation filters, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2010, 2544–2550.

22.

Lim

and Yang

M.H.

, Online object tracking: a benchmark, IEEE Conference on Computer Vision and Pattern Recognition, 2013.

23.

Lim

and Yang

M.-H.

, Object tracking benchmark, IEEE Transactions on Pattern Analysis and Machine Intelligence 37(9) (2015), 1834–1848.

24.

Kristan

Matas

Leonardis

et al., A novel performance evaluation methodology for single-target trackers, IEEE Transactions on Pattern Analysis and Machine Intelligence 38(11) (2016), 2137–2155.

25.

Nam

and Han

, Learning multi-domain convolutional neural networks for visual tracking, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4293–4302.

26.

Galoogahi

H.K.

Fagg

and Lucey

, Learning background-aware correlation filters for visual tracking, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1144–1152.

27.

Yang

Zhang

et al., Long-term correlation tracking, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5388–5396.

28.

Danelljan

Gustav

Khan

F.S.

et al., Learning spatially regularized correlation filters for visual tracking, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, 4310–4318.

29.

Y.K.

Zhang

S.P.

Qin

et al., Hedged deep tracking, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4303–4311.

30.

Bertinetto

Valmadre

Golodetz

et al., Staple: Complementary learners for real-time tracking, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1401–1409.

31.

Hong

You

Kwak

and Han

, Online tracking by learning discriminative saliency map with convolutional neural network, International Conference on Machine Learning, 2015.

32.

and Zhu

, A scale adaptive kernel correlation filter tracker with feature integration, In Proceedings of the European Conference on Computer Vision, Springer, Cham, 2014.

33.

Zhang

J.M.

S.G.

and Sclaroff

, MEEM: Robust tracking via multiple experts using entropy minimization, In Proceedings of the European Conference on Computer Vision, 2014.

34.

Danelljan

Häge

Khan

F.S.

et al., Discriminative scale space tracking, IEEE Transactions on Pattern Analysis & Machine Intelligence 39(8) (2017), 1561–1575.

35.

Bibi

Mueller

and Ghanem

, Target response adaptation for correlation filter tracking, In Proceedings of the European Conference on Computer Vision, 2016, pp. 419–433.

36.

Zhu

Porikli

and Li

, Beyond local search: tracking objects everywhere with instance-specific proposals, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 943–951.

Adaptive updating correlation filter based on multi features

Abstract

Keywords

1. Introduction

2. Discriminant correlation filters (DCF)

3.1 Multi-layer features weighted fusion

4.1 Evaluation on OTB2013

Table 1 Indicators of tracking methods for VOT 2016 dataset

References

Table 1
Indicators of tracking methods for VOT 2016 dataset