Abstract
In order to effectively improve the tracking performance of the target in various complex environment in the tracking process, the reinforcement research of target features has become one of the important work. In this paper, a Siamese network target tracking algorithm based on parallel channel attention mechanism (PCAM) is proposed by combining feature cascade algorithm with visual attention. Firstly, the characteristics of SENet network and ECA network are fully analyzed. Secondly, the parallel channel attention mechanism is constructed based on ECA module, which integrates global average pooling and maximum pooling. Parallel channel attention mechanism not only solves the problem of channel correlation reduction of SENet module, but also solves the problem of target feature information enhancement. Thirdly, the output model of channel attention is used as the input of spatial attention model to realize the effective complement to channel attention mechanism. By calculating the weight value of different spatial locations, the structural relation between spatial location information is constructed, the feature expression ability of the model is enhanced. Finally, the algorithm is evaluated on standard data sets OTB100, OTB2013, OTB2015, VOT2016 and VOT2018. Experimental results show that the PCAM has stronger feature extraction performance for complex environment, higher target tracking accuracy and robustness, and has strong advantages compared with other comparative experiments.
Keywords
Introduction
Visual target tracking is widely applicated in the fields of intelligent surveillance, human-computer interaction, and unmanned vehicles, etc. [1, 2]. Along with the speed development of deep learning technology, the capability and accuracy of target tracking has been greatly improved. However, target tracking is still a challenging task, especially in real-world applications, Moving targets are often affected by illumination variation, scale variation background clutters and heavy occlusions in unlimited recording environment [3]. In addition, the appearance of non-rigid targets may change significantly due to extreme variations in posture.
Recently, deep learning have had an excellent performance in target tracking, and
network models such as classical convolutional networks AlexNet and VGGNet, Residual
Network (ResNet), and Region Proposal Network (NetworkRPN) have achieved good
results on test datasets of target tracking. Deep Convolutional Neural Networks
(DCNN) generate a hierarchical and three-dimensional representation of the shallow
and deep layers of the image. In DCNN, the shallow features expressing the
appearance features and the deep features expressing the semantic features of the
image, thus improve the constancy of target appearance. The Siamese Network-based
target tracking has become a key research direction in recent years because it
balances both computing speed and tracking accuracy. The disappearance of target is
almost unavoidable due to long-term tracking and occlusions in the process of target
tracking, so it requires a more precise and efficient network structure when the
target is obscured or is needed to be long-term tracked. To address this problem,
SiamRPN [4], SiamRPN
In order to enable target tracking algorithms to focus more on features that are useful for target tracking on spatial location and channel location, attention mechanism (AM) [8] is applied to target tracking. The basic concept of visual AM is to focus on the important information of an image and ignore the non-important information. The AM is widely used in the target detection, and it enables the model to acquire the modeling ability of spatial, feature channel and background information in training, so as to enhance the feature representation of targets by convolutional neural networks. Many researchers have introduced the AM into Siamese Network. First, for the channel attention module, different channels of the image give different weights, the appearance semantics of image features is more stereoscopic, and more attention is paid to the in-channel features of foreground targets in target tracking. Second, for the spatial attention module, different weights are assigned to different spatial positions on the image feature map, the spatial location weights of the foreground targets are increased, and thus the foreground targets are highlighted.
Wang et al. [9] proposed a
residual attention network, which uses the attention module in encoding and decoding
mode, by redefining the feature map, the algorithm not only has better performance,
but also is more robust to nose. This algorithm enhances the key features of the
image by the residual attention mechanism, and has strong adaptability for target
tracking in complex environments such as occlusions, background clutters and
illumination variation. However, the real-time performance of the algorithm needs to
be improved with the application of the network attention module. Hu et al. [10] used the feature of average
pooling layer to compute the attention between channels. In references [11, 12] the residual network are both used as the
backbone network, and an efficient channel attention module are both added after the
backbone network to increase the usage of the first frame information; the module
will compute the global average pooling (GAP) and maximum pooling for the input
features in each channel. In reference [11], interactive learning across channels is
performed without reducing the number of channels, thus providing more information
about target features, and further reducing other disturbance features to
effectively address situations such as huge deformation and rotation during target
tracking. However, in reference [12], the channel dependencies are limited to K (K
Recently, AM has been widely applied in target classification and tracking tasks, and Hu [10] proposed the Squeeze-and-Excitation (SE) network structure. This network structure consists of Squeeze and Excitation. By modeling the relevance of feature channels and the importance feature of different channels can be automatically learned by SE network, this structure will be able to enhance the importance features of the channels and suppress those of non-importance features, thus to improves the classification accuracy of images. By using the attention mechanism, Woo et al. [17] proposed a Siamese network tracker for RASNet, but the tracker uses only template information, which limits the representation of target features. Detail features and attention are computed and used separately in the target templates and search templates, thus limiting its performance.
In reference [18], the authors effectively fused deep features with shallow features. However, the feature fusion in the reference [18] only cascades the three layers of features in the last stage of the ResNet-50 network, while the feature information within each channel is processed equally, which does not reflect the importance and non-importance of different feature information for target localization and tracking. Therefore the focus of the target information is not reflected in the channel representation, and important features for supporting target tracking are not enhanced and non-important features are not suppressed. The way to reflect the feature weights of different channels and different locations in the target image have become the main research topic in target feature reinforcement. The target features, after being enhanced, are of great significance for subsequent target localization and tracking. In this paper, the visual AM is introduced, with additional channel AM and spatial AM based on the feature-centric cascade in the reference [18]. However, all existing channel attention mechanisms perform the dimensionality reduction of input features by GAP or maximum pooling, either focusing on detail information or global information, resulting in less complete feature extraction. In this paper, based on the cited ECA module, the GAP and max pooling are considered in parallel, and a parallel channel AM is proposed.
The general structure of the algorithm
In order to better use the visual attention mechanism for target feature representation in the Siamese network, in this paper, a parallel AM is added to the feature fusion from Reference [18]. Firstly, the 5-stage ResNet-50 as the backbone network of the Siamese network. After cascading the 3-layer features in the last stage, the fused features are obtained, and then they are used as the input of the parallel attention mechanism module to improve the mutual attention between the template area and the search area.
The output of the parallel channel attention mechanism is the input of the spatial and channel attention module, and the features after spatial location and channel enhancement are finally output. The overall structure is shown in Fig. 1.
Network structure of Siamese parallel channels attention mechanism.
It is hoped that the channel features can be extracted by the network quickly and efficiently in the process of target tracking, and model the deep information of the tracking target from different angles by different channel features, so as to the feature extraction ability of the model is improved by modeling the deep information of the tracking target from different angles by different channel features. For different tracking targets, the importance of each feature channel is varied, the response of different channel features to different targets and different tracking scenes is also different, moreover, there are interdependencies between channels.
The changing in appearance of the tracking target can be accurately expressed by the deep semantic features of CNN. Different channels have different contributions to the target, allowing some channels make the target more discriminative while other channels to have little effect on it. Therefore, the study on different channels’ feature representation of the target, especially the enhancement of the importance features for accurate representation of the target and the attenuation of the non-importance features, as well as their relationship, is important for improving the discriminative power of the model and efficiency of tracking. the channel AM is introduced and the discriminative power of the model is improved through the importance selection of the feature channels.
SENet block and ECA (Efficient Channel Attention) module.
Firstly, the SENet network channel AM is analyzed, and then the ECA (Efficient Channel Attention) module in ECA-Net [19] is referred to model the interrelationship between deep feature channels and get the importance weight vector of channel feature based on the cascade feature fusion.
A multi-layer sensing method is used by SENet’s channel attention mechanism to calculate the channel weights, in Fig. 2a.
The channel attention model contains three parts. The first part is the summation and squeezing of global spatial feature information by global average pooling to generate the respective channel features. The statistical value of each channel feature is
where Eq. (1)
completes the converts of the input
The channel features derived from Eq. (1) are global channel information, which increases the receptive field of the target.
The second part is excitation. The first fully connected layer reduces the
dimension of channels from
After squeezing and excitation, the feature weights of each channel domain and
the interrelationships between channels, as well as activation
where
The extensive use of fully connected layers increases the parameters, resulting in lower computational speed and less real-time algorithms
The third part is the attention module. The output features
where
The channel features are found to be periodic after visualization, so we only
consider the information interaction between the current channel and its
neighboring
Same as SENet, Firstly, the global spatial feature information of initial input
features
The weight vectors of each channel obtained after the Sigmoid activation function are:
where
The 1-d convolution kernel
Given
In the above mentioned SENet two FC layers are used to calculate the weights, by
dimensionality reduction, to reduce the channel parameters and complexity of the
model. However, the dimensionality reduction also brings side effects on the
prediction of channel attention and reduces the relevance of the channels at the
same time. To solve these two problems, the ECA module of ECA-Net is introduced
into the Siamese network, which defines an adaptive convolution kernel
Both SENet and ECA modules compress feature information by global average pooling, but global average pooling is only effective for the overall recognition of the target, and some detailed features of the target will be lost, while maximum pooling is effective for the detailed feature information of the target. In view of this, a parallel channel AM is constructed on the basis of ECA module with the fusion of global average pooling and maximum pooling. It addresses both the problem of channel relevance reduction of SENet module and the problem of target feature information enhancement. The structure of the parallel channel AM is shown in Fig. 3.
Parallel channel attention mechanism
The Global Max_Pool (GMP) and Global Avg_Pool (GAP) are combined to generate the feature fusion of the channel by parallel channel attention mechanism to achieve the joint attention of global and detail information of the target features. The detail features are extracted by GMP and the global overall features are extracted by GAP, and the combination of them enhances the attention mechanism of the channel, improves the extraction of important features and suppresses the influence of non-important features, thus improving the generalization ability of the channel feature extraction.
In this parallel channel processing mechanism, firstly, GMP and GAP are respectively used by the upper and lower parallel channel attention mechanisms to obtain a one-dimensional feature vector as the input of the one-dimensional convolution layer; secondly, a one-dimensional convolution operation is performed and an adaptive convolution kernel k is used to achieve cross-channel information interaction; then, the respective weight vectors are obtained via the Sigmoid function, and the respective weight vectors are added element by element; finally, the original input features are multiplied element by element to get the fused target feature map.
The parallel channel attention mechanism works in the following steps.
Step 1: Calculate the GAP and GMP of feature map
where,
The respective feature vectors are obtained after the following calculations.
where,
Step 2: The interrelationships between adjacent channels are calculated by one-dimensional convolution, respectively.
where,
Step 3: After passing through the Sigmoid function, we obtain the weight vector, respectively.
where,
Step 4: The two weight vectors are weighted and summed channel by channel, and the results are as follows.
Step 5: The input features of the image
The spatial AM is an effective complement to the channel AM. It pays more attention to the spatial location information of the image, calculates the weight values of different spatial locations, and builds structural links between the spatial location information, then, the feature expression ability of the model is enhanced. The structure of the spatial AM is shown as Fig. 4.
Spatial AM.
The spatial attention module takes the fused features
The feature map is sensitive to position importance is obtained after
element-by-element multiplication of Eq. (17) with
where
The loss function defined in the whole training process of the network is as follows:
Where,
Experimental environment and data set
The experiments are conducted in the framework of Python 3.7 and PyTorch 1.2, and
the configuration of running platform is: Intel(R) Xeon(R) CPU E5-2660 V2 @3.50
GHz
The experiments are trained offline on the GOT10K dataset, which consists of over
1.5 million frames of artificial marked bounding boxes and over 10,000 video
sequences of real moving object, covering more than 560 categories in total, of
which the validation and test sets each contains more than 180 video sequences.
The initial value of the learning rate for offline training is 0.01, and the
learning rate is exponential decay, starting from 10
In this paper, the algorithms are evaluated by widely used standard datasets OTB100, OTB2013, OTB2015, VOT2016, and VOT2018, where OTB100, OTB2013, and OTB2015 are used for quantitative analysis, and VOT2016 and VOT2018 are used for qualitative analysis. The accuracy as well as the robustness of the PCAM algorithm is verified by comparison experiments with the feature cascade algorithm in Section 2 and existing prevailing algorithms.
Ablation experiment: Quantitative analysis on the OTB dataset
In order to demonstrate the effectiveness of the module designed by the algorithm
in this paper, an ablation experiment was conducted. The quantitative analysis
of overlap rate and center error is made on OTB100, OTB2013, and OTB2015 through
the ablation experiment. Comparison experiments on the above three datasets with
the existing prevailing algorithms, including SiamDW-RPN [20], SiamRPN
Tracking precision results of different algorithm on OTB2013
Tracking precision results of different algorithm on OTB2013
Tracking precision comparison of different algorithm on OTB2015
The tracking accuracy and success rate curves of 8 different
algorithms on three kinds of data sets.
Figure 5 show that the tracking precision and success rate of the PCAM algorithm is superior compared to other algorithms on all three data sets. The success rate is 76% on OTB100 by PCAM which is slightly lower than the 76.9% of DSiam-Att [21], while on the OTB2013 and OTB2015, PCAM are higher than other algorithms In addition, the tracking center error of the PCAM is better than the other algorithms, which strongly proves that the parallel channel AM plays an important role in the accurate localization and the feature enhancement of the target.
The tracking accuracy of the algorithm is compared in OTB2013 and OTB2015, with 11 different video attributes. The 11 attributes are Background Clutter (BC), Occlusion (OCC), Illumination Variation (IV), Scale Variation (SV), Deformation (DEF), Fast Motion (FM), Motion blur (MB), In-plane Rotation (IPR), Out-plane Rotation (OPR), Low Resolution (LR), and Out of View (OV). The comparison of the tracking precision of different algorithms for 11 different attributes on the OTB2013 dataset is shown in Table 1. he tracking precision of different algorithms for 11 different attributes on the OTB2015 dataset is shown in Table 2.
As seen from Tables 1 and 2, the PCAM has higher tracking accuracy among 11 video attributes, including 7 complex cases such as BC, OCC, SV, FM, OPR, LR and OV in the OTB2013 dataset, in which it outperforms other algorithms. In the OTB2015 dataset, some differences are found with OTB2013, and the PCAM perform relatively better in the following seven cases: BC, OCC, SV, DEF, FM, LR, and OV. Among the other four attributes in the datasets of OTB2013 and OTB2015, the tracking accuracy of the PCAM is the second highest. Comparing with the CFFA algorithm in Chapter 2 of this paper, the tracking accuracy is improved to some extent, including 2%, 2.2%, 1.1%, 2.8%, and 5.9% in the five cases of BC, OCC, SV, FM, and LR from OTB2013 dataset, and the improvement in the five cases of BC, OCC, SV, FM, and LR from OTB2015 dataset is 1.1%, 0.5%, 0.4%, 0.9%, and 5%, respectively.
In this paper, ResNet-50, a five-stage network, is used as the backbone network of the Siamese network, which requires a moderate amount of computation. In this paper, parallel channel attention mechanism module and spatial attention mechanism module are added, which can improve the overall performance of the network and have little impact on the overall running speed of the algorithm. The tracking speeds of the algorithm in this paper and the comparison algorithm on the data sets OTB100, OTB2013 and OTB2015 are shown in Table 3.
Tracking speed comparison
As can be seen from the table, the running speed of the proposed algorithm is
almost the same as that of SiamRPN
The VOT2016 and VOT2018 datasets each contain 61 sets of high-resolution video image sequences, which have more stringent requirements for the network model and more comprehensive performance evaluation criteria.
In this chapter, the PCAM is compared with DSiam-Att [21], CFFA [18], SiamRPN
Results of different algorithm on VOT2016 and VOT2018
Results of different algorithm on VOT2016 and VOT2018
Tracking precision comparison.
EAO comparison.
Qualitative analysis and comparison of tracking results of different
algorithms in different scenarios.
As seen in Table 4, in terms of accuracy, the PCAM is equivalent to DSiam-Att [21] in the VOT2016 dataset, and slightly lower than DSiam-Att [21] in the VOT2018 dataset, but both are better than other comparison algorithms; as for precision, in VOT2016, the PCAM is better than other comparison algorithms, higher than DSiam-Att [21], CFFA [18], and ECO [25] algorithms by more than 1%, while it is slightly inferior to DSiam-Att [21] in VOT2018, with the second best performance. When it comes to EAO, the PCAM has better comprehensive performance than other algorithms. Relative to the CFFA [18] algorithm has improved in different performance comparisons. In comparison to the CFFA [18] algorithm, it has improved in each aspect.
As can be seen from Figs 6 and 7, the PCAM shows some advantages in both accuracy and EAO. The comparison of quantitative experiments reveals that the PCAM has a good performance among the evaluation indexes such as tracking precision, tracking accuracy, success rate, robustness and expected average overlap on tracking target center, and has a strong robustness to complex and variable tracking environments.
In order to have a more intuitive and clear view of the tracking performance of
the PCAM in different complex environments, 10 representative scenarios in
OTB2015 are selected for comparison experiments in this paper. The chosen
trackers for comparison experiments are DSiam-Att [21], CNNSCAM [23], and SiamRPN
Background clutters. In scenarios Basketball,
Bolt2, Football, Liquor and other video sequences with similar
backgrounds, the tracking accuracy of PCAM and DSiam-Att is better,
However, for CNNSCAM and SiamRPN
Out of view. In Bird1 and Liquor, the target is out
of view or partially out of view Especially in Bird1, the tracking
results of the four algorithms are not satisfactory due to the
influence of clouds, but in contrast, the PCAM is effective in the
follow-up tracking. The other three algorithms have a poor
performance in follow-up tracking, in particular, the tracking was
lost after 130 frames, and the subsequent detection and tracking
were not well achieved when the targets reappeared.
IPR. In the scenes Basketball, Biker, and Jump, the
target appears to rotate in plane, causing both the state and size
of the target to change. In Basketball and Biker the tracking
performance of all four algorithms is satisfactory, but SiamRPN
Occlusion. In the scenes of Basketball, Football
and Girl, the target is obscured In Basketball, the target features
are different from the occlusion, and all the four algorithms are
able to recognize and track successfully when the occlusion is
removed. but in Football and Girl, the target in Football and Girl
is quite similar to the occlusion, which causes the target tracking
to fail. For the cases of motion blur, camera shake and illumination variation
the PCAM and DSiam-Att have relatively better tracking performances;
although the tracking precision of CNNSCAM and SiamRPN
In this paper, the algorithm is designed and implemented based on the feature cascade of CFFA [18]. In order to make full use of channel features and spatial features to enhance the extraction of target features based on cascade feature fusion, the SENet block and ECA modules are analyzed in this paper, and the advantages of both are fully utilized to construct a parallel channel AM, which enhances the important features and suppress non-important features to improve the generalization ability of the model. The output model of channel attention is used as the input of spatial attention model. The spatial AM is an effective complement to the channel attention mechanism. It calculates the weight values of different spatial locations, and builds structural links between the spatial location information, so as to the feature expression ability of the model is enhanced. Finally, after comparison experiments, it is found that the proposed algorithm in has certain advantages in tracking precision and tracking success rate.
Footnotes
Funding
This work was supported by Youth Top Talent Project of Hebei province, Hebei Province “333 Talents Project” funding project (A202101102), and Plan Project of Shijiazhuang Science and Technology (NO. 221130321A).
