Abstract
Infrared target tracking is increasingly becoming important for various applications in recent years. However, it is still a challenging task as limited information can be obtained from the infrared image. Inspired by the excellent performance of deep tracker, a novel tracker based on MDNet is proposed. As the prior information has great value for target tracking, a modified Back-Propagation network is used for predicting the scale of target during tracking. The result of the prediction is used for generating candidate windows for online learning, which can improve the performance of tracker. To evaluate the proposed tracking algorithm, we performed experiments on the VOT-TIR2016 and AMCOM infrared data. The experimental results demonstrate that our algorithm provides a 1.94% relative gain in accuracy and 21.4% in robustness on VOT-TIR2016 when compared with MDNet.
Introduction
Visual tracking is sometimes considered a solved task, but many applied projects show that robust and accurate infrared(IR) target tracking method is scarce [1]. IR imaging technology is a non-contact passive detection technology, which producing images by sensing the radiation of target [2]. The main advantages of thermal camera are their ability to see in total darkness, their robustness to illumination changes and shadow effects, and they reduce privacy intrusion [3]. With the development of the IR technologies, they are commonly used not only in military purpose but also in civilian applications, such as the surveillance systems [2, 4]. However, in contrast to visual images, IR images generally have low spatial resolution, poor signal-to-noise ratios(SNR), and lack of textural information (see Fig. 1) [5]. Except the poor imaging quality compared with the visual image, the IR target dataset available is quite limited, which make it even impossible to train a very deep network for tracking [6]. Although great progress has been made, how to track IR target reliably in applied project is still challenging.

(a) Visual image and (b) IR image.
Scholars have done a lot of work in the area of IR target tracking in past decade [7–10]. Meanwhile, numerous superior trackers have been proposed in visual tracking. Despite the above-mentioned progresses in either speed or accuracy, real-time, high quality IR target tracking algorithms remain scarce. Multi-Domain Network (MDNet) [11] is the champion of the VOT2015, which is composed of shared layers and multiple branches of domain-specific layers. Where domains correspond to individual training sequences and each branch is responsible for binary classification to identify the target in each domain. The architecture of the network only contains six layers, which is substantially smaller than the common neural networks, such as the AlexNet [12] and VGG-Nets [13].
Our approach builds on two major observations based on prior work. First, the architecture of MDNet is high efficiency for IR target tracking as it only contains six layers. On the one hand, it has been proved that a lower layer carries more discriminative information which is suitable for target localization [14]. On another hand, the tiny network structure usually has a higher speed. Second, the precision of online learning has close relationship with tracking precision, online learning with poor quality even cause tracking failure. However, online tracking in MDNet is performed by evaluating the candidate windows randomly sampled around the previous target state. The scale of the target is highly related to the scales before current frame, which won’t change suddenly. Random sampling policy cannot make full use of the prior information. The main contributions of this paper can be summarized as follows:
As our first contribution, we proposed an IR tracking algorithm called SPMDNet (Scale Prediction based MDNet). It provides a 1.94% relative gain in accuracy and 21.4% in robustness on VOT-TIR2016 when compared with MDNet.
As our second contribution, we designed a modified BP network to predict the scale of the next frame online. The tracking performance can be improved with the feedback of the scale prediction. The scale predicting is meaningful for the research on IR image sequences.
The following sections are organized as follows. In section 2, we reviewed the existing research on both IR trackers and visual trackers, then we explained the motivation why we predicting the scale of the target during the online learning. In section 3, we presented the model structure of our approach and the details of it were introduced. In section 4, the tracking performance of our approach when compared with state-of-the-art algorithms from VOT-TIR2016 is introduced. Comprehensive experiments clearly demonstrate that our approach concurrently improved the tracking performance.
IR trackers
IR sensors/cameras produce images with a low SNR, which cause differences between IR trackers and visual trackers to some extent. As far as studies to date are concerned, many scholars have made brilliant achievements. In [15], morphological filter is used to remove noise and clutter in tracking window, which can only deal with the background noise. Mean-shift based approaches have been widely used in IR target tracking, which provide a general optimization solution in this field [9]. However, it cannot track the target properly when the target affected by different disturbing issues. In [16], AM-FM consistency checks was used for IR target tracking, which can improve the performance on challenging IR data sequences to some extent, even the famous AMCOM forward-looking infrared (FLIR) dataset. Despite the above-mentioned progresses in IR target tracking, the tracking performance cannot satisfy the requirement of applied projects.
Deep trackers
The rapid development of deep learning has promoted the development of visual tracking [17]. In [18–20], the convolutional neural network (CNN) is used as the feature extractor, and they all adopt correlation filter as their base tracker. The state-of-the-art performance of deep trackers benefit from the expressive power of CNN features. C-COT [21] employed the implicit interpolation method to solve the learning problem in the continuous spatial domain. ECO [22] is an improved version of C-COT [21] in both accuracy and speed. While these deep trackers result in high accuracy and robustness, it often suffers from high computational burden. MDNet [11] is composed of five shared layers. The small architecture is suitable for IR target tracking, as the IR images usually have a low quality. The input of the images in IR target tracking is usually much smaller than the visual tracking, the size in AMCOM dataset is 128×128. It tends that the deep feature is less effective in target localization since the spatial information tends to be diluted as the network goes deeper. As mentioned above, we chose the MDNet as the basic architecture of our approach.
Modified BP neural network for scale prediction
For the MDNet, online tracking is performed by evaluating the candidate windows randomly sampled around the previous target state. The randomly sampled policy cannot make full use of the prior information, the error accumulation of the online learning will decrease the tracking precision. In this paper, we use a modified BP network to predict the scale of the target during the online learning. The input data is a series of tracking result before the current frame, and the output is the predicted scale of the target in current frame. By this mean, the precision and robustness of the tracker can be improved conspicuously.
The artificial neural network does not need to determine the mathematical equation of the mapping relationship between input and output in advance. It only learns some rules through its own training and obtains the result closest to the expected output value when given the input value. BP neural network is a kind of multi-layer feed-forward neural network, the main characteristics of which are signal forward transmission and error back propagation [23]. Since the scale changing on different sequence is vary, we can only predict the scale of next frame based on the tracking result before the current frame. As a nonlinear modeling and forecasting method, the neural network has been widely used in plenty of domains with good nonlinear quality, high fitting accuracy, flexible and effective learning method, completely distributed storage structure and hierarchical model structure [24]. In this paper, we use a modified BP neural network to predicting the scale during the online learning.
Scale prediction based MDNet
This section describes the architecture of our algorithm and the evaluation method used. The architecture of our approach is based on the MDNet. The scale prediction during the online learning can make full use of the prior information through generating reasonable learning samples, which can improve the accuracy and robustness of the tracker.
Network architecture
As shown in the Fig. 2, the model architecture can be divided into two parts, the shared layers and the domain-specific layers. The input of the Network is 107×107 grayscale images. There are five hidden layers including three convolutional layers and two fully connected layers. The domain-specific layers contain K branches responding to K domains is the fully connected layers. Compared with AlexNet [12] and VGG-Nets [13], this network architecture is tiny. Unlike the detection task, the target tracking indeed mainly focuses on distinguish target and background [11]. As a result, the depth of the network doesn’t need to be very deep. A shallow network is obvious effective, especially when we need to online learning during the tracking.

Illustration of the SPMDNet.
The architecture of the modified BP network used for predicting the scale of the target is illustrated in Fig. 3. The BP network contain two layers, the hidden layer and the output layer. Where the input data is the scale information before the current frame, the output data is the predicting scale of the current frame. The input data dimension is m and the output data dimension are single, which is the predicting result of the target.

Illustration of the BP network.
As a classical BP network, it suffers from the disadvantages of classical BP network. The perform function of BP neural network mainly employs MSE (Mean Square Error) function, which is good at dealing with gaussian distribution error. However, the scale predicting forecast error does not satisfy the gaussian distribution. Comparing with MSE, MCC (Maximum Correntropy Criteria) is more suitable for scale predicting. Correntropy [25] is a theory used for evaluating the similarity between two random variables such as X, Y. The definition is as Equation (1).
According to the theory of correntropy, while correntropy equals the maximum, the forecast error is the minimum and can obtain an optimized BP network model.
As shown in Fig. 4, the input training data are the scales of the target form current frame t - 1 to frame t - m - n. As a result, the training data have 2 (m - n). The ‘2’ here means the width and the length of the target, which were predicted respectively. The output of the BP network has only one result at each time. As shown in Fig. 4, the training data was recombined to n item, each item contain m number. In our algorithm, m = 49 and n = 51. The network predicting 200 times to produce candidate windows for online learning.

Training data used for training of the BP network.
The overall procedure of the scale prediction algorithm is presented in Algorithm 1.
The further implementation details are described below.
Bounding box regression technique has been widely used in object detection. Based on the prediction of the scale of the target, a simple linear regression model was used for predicting the target location. It is a task which mean to find the optical target position
In contrast to the numerous benchmarks that exist in the area of visual tracking, the benchmarks designed for IR tracking is rare. As performance measure, we compared our algorithm with 18 kinds of excellent algorithm from the VOT-TIR2016. All the tracking algorithms are evaluated with the VOT2016 evaluation Toolkit 1 .
Accuracy
The accuracy measurement is computed from the overlap between the predicted bounding box and the ground truth. Base on the definition of IOU (Intersection over Union), the accuracy in VOT is defined as follows.
The Φ
t
(i, k) denotes accuracy in the kth repetition of tracker i, and the N
rep
denotes the number of repetitions. As a result, the accuracy of the tracker i can be defined as follows.
In order to make the display more intuitive, the accuracy is modified as follows. S is a constant term, which would be shown in the result of the accuracy.
The robustness is used to evaluate the robustness of a tracker. The robustness of the VOT is defined as follows.
Because the evaluation method Accuracy-Robustness cannot make full use of the row data, a new index EAO is proposed. The EAO only focus on the accuracy based on definition of overlap. The calculation of EAO is shown as follows.
Implementation details and dataset
Our tracker is implemented in MATLAB with MatConvNet toolbox, which runs at round 1.38 fps on a single NVIDIA GTX 1070 GPU with 4GB memory in Linux operating system.
We evaluated the proposed tracking algorithm on the well-known public IR dataset named VOT-TIR 2016 2 , which contain 25 challenging video sequences. The IR sequences contained in the data set were collected from nine different sources using ten different types of sensors, which originate from industry, research institutes and 2 EU (European Union) projects [1]. In the VOT challenge protocol, the tracker is re-initialized whenever tracking fails and the evaluation module reports both accuracy and robustness, which correspond to the bounding box overlap ratio and the number of failures respectively. In addition, we also evaluated our algorithm on two sequences from the famous FLIR dataset named AMCOM. Moreover, the AMCOM sequences obtained via an airborne, moving platform, suffer from abrupt discontinuities in motion. It is more similar with the applied project.
Evaluation on the VOT-TIR2016
At first, we made a qualitative evaluation of our algorithm and other trackers. In the VOT challenge protocol, a tracker is re-initialized whenever tracking fails. In order to evaluate the performance of the trackers more concise, we do not re-initialize it again. Figure 5 summarizes qualitative comparisons of our approach with 11 ranking trackers of VOT-TIR2016. The tracking box with different color responding to different trackers. The failing frames are labeled with different colors of ×. Compared with other trackers, our approach can track the target more reliably. Even in the cases of the camera motion, severe occlusion, fast motion et al.

A visualization of tracking result of our algorithm and other 11 ranking trackers on nine challenging sequences.
The sequence pooled AR rank plot is obtained by concatenating the results from all sequences and creating a single rank list, while the attribute normalized AR rank plot is created by ranking the trackers over each attribute and averaging the rank lists. Table 1 illustrates the accuracy ranks with respect to the visual attributes. Our algorithm outperforms all the compared algorithms in most of attribute, including the camera motion, motion change and size change. All the attributes above have a crucial relationship with the scale of the target, which indicate that our scale prediction is effective and meaningful. The average accuracy of our approach ranking first on VOT-TIR2016. The average scores of the different attributes shows the accuracy of our algorithm increased by 1.94 percent approximately. Table 2 illustrates the robustness ranking result. Our algorithm outperforms all the compared algorithms, which proved the robustness of our algorithm is excellent. Compared with the previous MDNet, the robustness of our algorithm has increased by 21.4 percent roughly. In order to show the ranking unambiguous, the first, second and the third best scores are highlighted in red, blue and green colors respectively. In summary, the accuracy and robustness of our tracker have been improved obvious with the scale prediction.
The average scores and ranks of accuracy in different attributes: Camera motion (CM), Dynamics change (DC), Motion change (MC), Occlusion(O), Size change (SC)
The average scores and ranks of accuracy in different attributes: Camera motion (CM), Dynamics change (DC), Motion change (MC), Occlusion(O), Size change (SC)
The average scores and ranks of robustness in different attributes
Figure 6 shows the robustness-accuracy ranking of all the trackers in different attributes. From this figure, we can see the robustness and accuracy contrastively. Our approach has an excellent performance in the AR plot. Figure 7 shows the failures and overall overlap in different attributes, from which we can see that the failures in nearly all attributes have an excellent performance. The overall overlap of our algorithm ranking first in attributes camera motion, motion change and size change, which indicate the accuracy of our tracker have been proved in different attributes.

The robustness-accuracy ranking plots for five visual attributes: Camera motion, Dynamics change, Motion change, Occlusion, Size change and Empty.

The Failures and Overall overlap ranking plots for five visual attributes: Camera motion, Dynamics change, Motion change, Occlusion, Size change and Empty.
As illustrated in Table 1, Table 2, Figs. 6 and 7, our algorithm has an excellent performance in both accuracy and robustness. It demonstrates that our approach has the better accuracy than all other methods, even with fewer re-initializations. It also proved that our tracker is stable in various challenging situations.
The overall criterion expected overlap also play an important role in evaluation. The expected average overlap curve is given by the average bounding-box-overlap averaged over a set of sequences of certain length, as defined in Equation (9). As shown in Fig. 8(a), our approach is consistently better than other trackers. Expected overlap score was obtained by integrating the EAO curve over an interval of typical sequence expected overlap of 223 to 509 frames. Figure 8(b) summarizes the EAO curves and the ranking of the trackers. The ranking of expected overlap score also proved that our approach is excellent in both accuracy and robustness. The right-most tracker is the top-performing according to the VOT2016 expected average overlap values. As illustrated in Table 3, our ranking second in all the trackers.

Expected overlap curve(a) and EAO graph(b) with trackers ranked from right to left.
Expected overlap analysis
In this section, the evaluation based on no re-initialization, which can also be called as unsupervised evaluation. The performance of overlap of all the trackers in different attributes is shown in Table 4 and Fig. 9. Our algorithm ranking first among 18 kinds of trackers in attribute Camera motion, Motion change, Occlusion, Size change and second in Dynamics change and Empty. In general, the performance of overlap outperforms other trackers. Precision plot, which shows the percentage of image frames whose tracked location is within the given threshold distance of ground truth. Our approach shows the best performance compared with other trackers as well.
Overlap overview in different attributes
Overlap overview in different attributes

Precision in different threshold(left) and overall overlap(right) in different attributes.
In addition, we also valuated our algorithm in two sequences from the famous FLIR dataset AMCOM. The data sets are available in grayscale format and each frame is 128×128 pixels. We employ one-pass evaluation (OPE) in this subsection. The success plot and precision plot were used for evaluating the tracking result. Figure 10 shows the result of our algorithm and five excellent trackers, including MDNet [11], LSST [26], TLD [27], KCF [28], LCT [29]. For the sequence lwir_1608, our algorithm has an excellent performance on precision and ranking second for successful rate. When it comes to sequence lwir_1913, the location precision ranking second and the success rate ranking first compared with other five trackers. It also indicates that our approach works well on the AMCOM dataset. In especial, our tracker has a better accuracy when compared with MDNet on those two sequences.

Comparison of DPR and OSR with the 5 excellent trackers on AMCOM.
In the experiment, we also made a qualitative evaluation of our algorithm and other trackers. Fig. 11 summarizes qualitative comparison of our approach with five trackers on two sequences from AMCOM dataset. In sequence lwir_1608, our algorithm can track the target reliable all the time, the tracking precision is excellent. When it comes to the sequence lwir_1913, the SNR of the target decrease severely from frame 139 as the target swerved to right. The low SNR result in the tracking failure of the tracker LCT, KCF and MDNet. Our algorithm can still track the target because the scale predicting can make full use of the prior information. The integrating error during the online learning has been reduced to some extent, which has crucial meaning for the performance of long-term tracking.

A visualization of tracking result of our algorithm and 5 kinds excellent trackers on two sequences from the AMCOM.
SPMDNet is a modified tracking algorithm based on MDNet, which predicting the target scale during the tracking by improved BP network. As we can see in the experiment above, our approach has a better precision and robustness in different attributes. The predicting sampled method of online learning can make full use of the prior information, which have crucial effect on the performance of the tracker. Our algorithm also takes advantages of the lightweight architecture of MDNet.
Conclusions
In this paper, a novel tracking algorithm based on MDNet was proposed. The scale prediction of target during the tracking is performed by improved BP network, which can make full use of the prior information. The utilization of scale prediction can improve the tracking precision and robustness substantially in comparison to MDNet. Extensive experimental results show that the proposed algorithm performs favorably against the state-of-the-art methods in terms of accuracy and robustness.
