Research on Siamese network algorithm based on parallel channel attention mechanism for target tracking

Abstract

In order to effectively improve the tracking performance of the target in various complex environment in the tracking process, the reinforcement research of target features has become one of the important work. In this paper, a Siamese network target tracking algorithm based on parallel channel attention mechanism (PCAM) is proposed by combining feature cascade algorithm with visual attention. Firstly, the characteristics of SENet network and ECA network are fully analyzed. Secondly, the parallel channel attention mechanism is constructed based on ECA module, which integrates global average pooling and maximum pooling. Parallel channel attention mechanism not only solves the problem of channel correlation reduction of SENet module, but also solves the problem of target feature information enhancement. Thirdly, the output model of channel attention is used as the input of spatial attention model to realize the effective complement to channel attention mechanism. By calculating the weight value of different spatial locations, the structural relation between spatial location information is constructed, the feature expression ability of the model is enhanced. Finally, the algorithm is evaluated on standard data sets OTB100, OTB2013, OTB2015, VOT2016 and VOT2018. Experimental results show that the PCAM has stronger feature extraction performance for complex environment, higher target tracking accuracy and robustness, and has strong advantages compared with other comparative experiments.

Keywords

Image process target tracking Siamese network parallel channel attention mechanism spatial attention

1. Introduction

Visual target tracking is widely applicated in the fields of intelligent surveillance, human-computer interaction, and unmanned vehicles, etc. [1, 2]. Along with the speed development of deep learning technology, the capability and accuracy of target tracking has been greatly improved. However, target tracking is still a challenging task, especially in real-world applications, Moving targets are often affected by illumination variation, scale variation background clutters and heavy occlusions in unlimited recording environment [3]. In addition, the appearance of non-rigid targets may change significantly due to extreme variations in posture.

Recently, deep learning have had an excellent performance in target tracking, and network models such as classical convolutional networks AlexNet and VGGNet, Residual Network (ResNet), and Region Proposal Network (NetworkRPN) have achieved good results on test datasets of target tracking. Deep Convolutional Neural Networks (DCNN) generate a hierarchical and three-dimensional representation of the shallow and deep layers of the image. In DCNN, the shallow features expressing the appearance features and the deep features expressing the semantic features of the image, thus improve the constancy of target appearance. The Siamese Network-based target tracking has become a key research direction in recent years because it balances both computing speed and tracking accuracy. The disappearance of target is almost unavoidable due to long-term tracking and occlusions in the process of target tracking, so it requires a more precise and efficient network structure when the target is obscured or is needed to be long-term tracked. To address this problem, SiamRPN [4], SiamRPN $++$ [5] and DaSiamRPN [6] have extended the search strategy from local to global to detect whether a target is lost in tracking. Qin et al. [7] designed a template update method aimed at solving the problem of target loss due to occlusion.

In order to enable target tracking algorithms to focus more on features that are useful for target tracking on spatial location and channel location, attention mechanism (AM) [8] is applied to target tracking. The basic concept of visual AM is to focus on the important information of an image and ignore the non-important information. The AM is widely used in the target detection, and it enables the model to acquire the modeling ability of spatial, feature channel and background information in training, so as to enhance the feature representation of targets by convolutional neural networks. Many researchers have introduced the AM into Siamese Network. First, for the channel attention module, different channels of the image give different weights, the appearance semantics of image features is more stereoscopic, and more attention is paid to the in-channel features of foreground targets in target tracking. Second, for the spatial attention module, different weights are assigned to different spatial positions on the image feature map, the spatial location weights of the foreground targets are increased, and thus the foreground targets are highlighted.

Wang et al. [9] proposed a residual attention network, which uses the attention module in encoding and decoding mode, by redefining the feature map, the algorithm not only has better performance, but also is more robust to nose. This algorithm enhances the key features of the image by the residual attention mechanism, and has strong adaptability for target tracking in complex environments such as occlusions, background clutters and illumination variation. However, the real-time performance of the algorithm needs to be improved with the application of the network attention module. Hu et al. [10] used the feature of average pooling layer to compute the attention between channels. In references [11, 12] the residual network are both used as the backbone network, and an efficient channel attention module are both added after the backbone network to increase the usage of the first frame information; the module will compute the global average pooling (GAP) and maximum pooling for the input features in each channel. In reference [11], interactive learning across channels is performed without reducing the number of channels, thus providing more information about target features, and further reducing other disturbance features to effectively address situations such as huge deformation and rotation during target tracking. However, in reference [12], the channel dependencies are limited to K (K $<$ 9) adjacent channels so as to improve the modeling speed for the channels. In the above study, the feature expression of each channel is given attention with the use of channel attention, but the importance of each feature point for the overall feature is ignored. Song [13] proposed an adaptive Siamese network tracking algorithm for overall feature channel recognition. The algorithm uses ResNet22 as the backbone network, adds an efficient channel AM in the fourth convolutional layer of the Conv3 stage, and uses the overall feature recognition function to calculate the global information after extracting features to get the dependencies between individual channels in the overall features. In order to improve the model’s attention to key features, enlarge the foreground contribution, suppress background features, and the spatial information is fully utilized, many researchers have combined channel attention with spatial attention and proposed Spatial-Temporal Attention Networks or Joint Attention Networks, thus enhancing the discrimination ability of convolutional networks for positive samples. In references [14, 15, 16], a global joint attention mechanism is designed for further processing of the extracted features to enhance the discriminative ability of the network.

Recently, AM has been widely applied in target classification and tracking tasks, and Hu [10] proposed the Squeeze-and-Excitation (SE) network structure. This network structure consists of Squeeze and Excitation. By modeling the relevance of feature channels and the importance feature of different channels can be automatically learned by SE network, this structure will be able to enhance the importance features of the channels and suppress those of non-importance features, thus to improves the classification accuracy of images. By using the attention mechanism, Woo et al. [17] proposed a Siamese network tracker for RASNet, but the tracker uses only template information, which limits the representation of target features. Detail features and attention are computed and used separately in the target templates and search templates, thus limiting its performance.

In reference [18], the authors effectively fused deep features with shallow features. However, the feature fusion in the reference [18] only cascades the three layers of features in the last stage of the ResNet-50 network, while the feature information within each channel is processed equally, which does not reflect the importance and non-importance of different feature information for target localization and tracking. Therefore the focus of the target information is not reflected in the channel representation, and important features for supporting target tracking are not enhanced and non-important features are not suppressed. The way to reflect the feature weights of different channels and different locations in the target image have become the main research topic in target feature reinforcement. The target features, after being enhanced, are of great significance for subsequent target localization and tracking. In this paper, the visual AM is introduced, with additional channel AM and spatial AM based on the feature-centric cascade in the reference [18]. However, all existing channel attention mechanisms perform the dimensionality reduction of input features by GAP or maximum pooling, either focusing on detail information or global information, resulting in less complete feature extraction. In this paper, based on the cited ECA module, the GAP and max pooling are considered in parallel, and a parallel channel AM is proposed.

2. The general structure of the algorithm

In order to better use the visual attention mechanism for target feature representation in the Siamese network, in this paper, a parallel AM is added to the feature fusion from Reference [18]. Firstly, the 5-stage ResNet-50 as the backbone network of the Siamese network. After cascading the 3-layer features in the last stage, the fused features are obtained, and then they are used as the input of the parallel attention mechanism module to improve the mutual attention between the template area and the search area.

The output of the parallel channel attention mechanism is the input of the spatial and channel attention module, and the features after spatial location and channel enhancement are finally output. The overall structure is shown in Fig. 1.

Figure 1.

Network structure of Siamese parallel channels attention mechanism.

3. Parallel channels AM

It is hoped that the channel features can be extracted by the network quickly and efficiently in the process of target tracking, and model the deep information of the tracking target from different angles by different channel features, so as to the feature extraction ability of the model is improved by modeling the deep information of the tracking target from different angles by different channel features. For different tracking targets, the importance of each feature channel is varied, the response of different channel features to different targets and different tracking scenes is also different, moreover, there are interdependencies between channels.

The changing in appearance of the tracking target can be accurately expressed by the deep semantic features of CNN. Different channels have different contributions to the target, allowing some channels make the target more discriminative while other channels to have little effect on it. Therefore, the study on different channels’ feature representation of the target, especially the enhancement of the importance features for accurate representation of the target and the attenuation of the non-importance features, as well as their relationship, is important for improving the discriminative power of the model and efficiency of tracking. the channel AM is introduced and the discriminative power of the model is improved through the importance selection of the feature channels.

Figure 2.

SENet block and ECA (Efficient Channel Attention) module.

Firstly, the SENet network channel AM is analyzed, and then the ECA (Efficient Channel Attention) module in ECA-Net [19] is referred to model the interrelationship between deep feature channels and get the importance weight vector of channel feature based on the cascade feature fusion.

3.1 SENet block

A multi-layer sensing method is used by SENet’s channel attention mechanism to calculate the channel weights, in Fig. 2a.

The channel attention model contains three parts. The first part is the summation and squeezing of global spatial feature information by global average pooling to generate the respective channel features. The statistical value of each channel feature is

$\displaystyle z_{c}=F_{\textit{squeeze}}(u_{c})=\frac{1}{H\times W}\sum_{i=1}^% {H}\sum_{j=1}^{W}u_{c}(i,j)$ (1)

where Eq. (1) completes the converts of the input $H\times W\times C$ to output $1\times 1\times C$ , and $H$ and $W$ denotes the size of the input feature. $u_{c}$ is the $C^{\text{th}}$ channel of the input feature.

The channel features derived from Eq. (1) are global channel information, which increases the receptive field of the target.

The second part is excitation. The first fully connected layer reduces the dimension of channels from $C$ to $C/r$ ( $r$ is the dimensionality reduction ratio), thus reducing the number of channels and the computational effort. The output dimension remains unchanged after passing through a ReLU function. The second fully-connected layer restore the number of channels to $C$ , and after the layer is a Sigmoid function.

After squeezing and excitation, the feature weights of each channel domain and the interrelationships between channels, as well as activation $s$ , are obtained during training.

$\displaystyle s=F_{\textit{excitation}}(z,W)=\textit{Sig}(g(z,W))=\textit{Sig}% (\mathbf{W}_{2}\text{Re{lu}}(\mathbf{W}_{1}))$ (2)

where $\mathbf{W}_{1}\in R^{\frac{C}{r}\times C}$ and $\mathbf{W}_{2}\in R^{C\times\frac{C}{r}}$ are the weights of the fully connected layers, respectively.

The extensive use of fully connected layers increases the parameters, resulting in lower computational speed and less real-time algorithms

The third part is the attention module. The output features $O_{C}$ are obtained by multiplying with the original channel feature values at the channel level.

$\displaystyle O_{c}=F_{\textit{Sc}}(u_{c},s_{c})=s_{c}u_{c}$ (3)

where $s_{c}$ denotes the $C$ th channel of activation $s$ .

3.2 ECA module

The channel features are found to be periodic after visualization, so we only consider the information interaction between the current channel and its neighboring $k$ channels. The adaptive convolution kernel $k$ is defined, and the information across channels is interacted by a fast 1-d convolution of size $k$ . The structure of ECA module is shown in Fig. 2b.

Same as SENet, Firstly, the global spatial feature information of initial input features $F$ is summed and squeezed by GAP to generate the respective channel features. The feature vector $\mathbf{f}$ is obtained, and $\mathbf{f}=(f_{1},f_{2},\ldots,f_{c})$ is used as the input of the one-dimensional convolutional layer, then the weight vector $\mathbf{w}=(w_{1},w_{2},\ldots,w_{c})$ is obtained after activation by Sigmoid function. After that the weight vector is multiplied with the initial input features $F$ element by element, and finally the features filtered by the attention module are obtained.

The weight vectors of each channel obtained after the Sigmoid activation function are:

$\displaystyle\bm{\omega}_{i}=\sigma\left(\sum_{j=1}^{k}w^{j}y_{i}^{j}\right),% \quad y_{i}^{j}\in\Omega_{i}^{k}$ (4)

where $\Omega_{i}^{k}$ denotes the set of $k$ adjacent channels of $y_{i}$ , $\sigma$ denotes the Sigmoid function, $\mathrm{y}=\frac{1}{H\times W}\sum_{i=1,j=1}^{W,H}{\chi_{ij}}$ , $\bm{\chi}\in R^{W\times H\times C}$ denotes the output of the convolutional block. In order to better ensure the efficiency and effectiveness of cross-channel interaction information, the frequency band matrix $\mathbf{W}_{k}$ is used to learn the channel attention, denoted as follows:

$\displaystyle\mathbf{W}_{k}=\begin{bmatrix}w^{1,1}&\cdots&w^{1,k}&0&0&\cdots&% \cdots&0\\ 0&w^{2,2}&\cdots&w^{2,k+1}&0&\cdots&\cdots&0\\ \vdots&\vdots&\vdots&\cdots&\ddots&\vdots&\vdots&\vdots\\ 0&\cdots&0&0&\cdots&w^{C,C-K+1}&\cdots&w^{C,C}\\ \end{bmatrix}$

The 1-d convolution kernel $k$ is proportional to $C$ . $C$ is expressed as a linear equation of $k$ :

$\displaystyle C=\phi(k)=2^{(\gamma*k-b)}$ (5)

Given $C$ , the 1-d adaptive convolution kernel $k$ is calculated as follows.

$\displaystyle k=\psi(c)=\left|\frac{\log_{2}(C)}{\Upsilon}+\frac{b}{\Upsilon}% \right|_{\textit{odd}}$ (6)

$\left|t\right|_{\textit{odd}}$ denotes the nearest odd number of $t$ . Set $\Upsilon=2$ , $b=1$ .

In the above mentioned SENet two FC layers are used to calculate the weights, by dimensionality reduction, to reduce the channel parameters and complexity of the model. However, the dimensionality reduction also brings side effects on the prediction of channel attention and reduces the relevance of the channels at the same time. To solve these two problems, the ECA module of ECA-Net is introduced into the Siamese network, which defines an adaptive convolution kernel $k$ by considering each channel and its nearest neighbors while avoiding dimensionality reduction, and achieves information interaction across channels by a fast 1-d convolution which size is $k$ .

3.3 Parallel channel attention mechanism

Both SENet and ECA modules compress feature information by global average pooling, but global average pooling is only effective for the overall recognition of the target, and some detailed features of the target will be lost, while maximum pooling is effective for the detailed feature information of the target. In view of this, a parallel channel AM is constructed on the basis of ECA module with the fusion of global average pooling and maximum pooling. It addresses both the problem of channel relevance reduction of SENet module and the problem of target feature information enhancement. The structure of the parallel channel AM is shown in Fig. 3.

Figure 3.

Parallel channel attention mechanism

The Global Max_Pool (GMP) and Global Avg_Pool (GAP) are combined to generate the feature fusion of the channel by parallel channel attention mechanism to achieve the joint attention of global and detail information of the target features. The detail features are extracted by GMP and the global overall features are extracted by GAP, and the combination of them enhances the attention mechanism of the channel, improves the extraction of important features and suppresses the influence of non-important features, thus improving the generalization ability of the channel feature extraction.

In this parallel channel processing mechanism, firstly, GMP and GAP are respectively used by the upper and lower parallel channel attention mechanisms to obtain a one-dimensional feature vector as the input of the one-dimensional convolution layer; secondly, a one-dimensional convolution operation is performed and an adaptive convolution kernel k is used to achieve cross-channel information interaction; then, the respective weight vectors are obtained via the Sigmoid function, and the respective weight vectors are added element by element; finally, the original input features are multiplied element by element to get the fused target feature map.

The parallel channel attention mechanism works in the following steps.

Step 1: Calculate the GAP and GMP of feature map $\mathbf{F}^{H\times W\times C}$ as follows:

$\displaystyle\mathbf{y}_{\textit{Gap}}=\frac{1}{H\times W}\sum_{i=1,j=1}^{W,H}% \chi_{ij}$ (7) $\displaystyle\mathbf{y}_{\textit{Gmp}}=\max\chi_{ij}$ (8)

where, $\bm{\chi}\in R^{W\times H\times C}$ is the output vector of the convolution block.

The respective feature vectors are obtained after the following calculations.

$\displaystyle\mathbf{y}_{\textit{Gap}}=(y_{1},y_{2},\ldots,y_{c})$ (9) $\displaystyle\mathbf{y}_{\textit{Gmp}}=(y_{1}^{\prime},y_{2}^{\prime},\ldots,y% _{c}^{\prime})$ (10)

where, $y_{i},y_{i}^{\prime}\in R^{H\times W}$ , $i=1,2,3,\ldots,C$ .

Step 2: The interrelationships between adjacent channels are calculated by one-dimensional convolution, respectively.

$\displaystyle\mathbf{f}_{\textit{Gap}}=\sum_{j=1}^{k}w^{j}y_{i}^{j},\quad y_{i% }^{j}\in\Omega_{i}^{k}$ (11) $\displaystyle\mathbf{f}_{\textit{Gmp}}=\sum_{j=1}^{k}w^{j}y_{i}^{\prime j},% \quad y_{i}^{\prime j}\in\Omega_{i}^{k}$ (12)

where, $\Omega_{i}^{k}$ denotes the set of $k$ adjacent channels of $y_{i}$ .

Step 3: After passing through the Sigmoid function, we obtain the weight vector, respectively.

$\displaystyle\bm{\omega}_{Ai}=\sigma\left(\sum_{j=1}^{k}w^{j}y_{i}^{j}\right),% \quad y_{i}^{j}\in\Omega_{i}^{k}$ (13) $\displaystyle\bm{\omega}_{Mi}=\sigma\left(\sum_{j=1}^{k}w^{j}y_{i}^{\prime j}% \right),\quad y_{i}^{\prime j}\in\Omega_{i}^{k}$ (14)

where, $\sigma$ denotes the Sigmoid function.

Step 4: The two weight vectors are weighted and summed channel by channel, and the results are as follows.

$\displaystyle\bm{\omega}=\lambda_{A}\bm{\omega}_{Ai}\oplus\lambda_{M}\bm{% \omega}_{Mi}$ (15)

Step 5: The input features of the image $F^{H\times W\times C}$ are multiplied channel by channel with the feature weights obtained from Step 4 to obtain the fused features, which are calculated as follows.

$\displaystyle\mathbf{S}^{H\times W\times C}=\mathbf{F}^{H\times W\times C}% \otimes\bm{\omega}$ (16)

4. Spatial attention mechanism

The spatial AM is an effective complement to the channel AM. It pays more attention to the spatial location information of the image, calculates the weight values of different spatial locations, and builds structural links between the spatial location information, then, the feature expression ability of the model is enhanced. The structure of the spatial AM is shown as Fig. 4.

Figure 4.

Spatial AM.

The spatial attention module takes the fused features $\mathbf{S}^{H\times W\times C}$ obtained from the channel attention module as input features. Firstly, the input fused channel domain features are compressed by average pooling and maximum pooling, and connected by the connection operation to merge the results; secondly, the multi-channel features are compressed into a single channel by the 1 $\times$ 1 convolution operation to eliminate the influence of inter-channel information dependence on the spatial attention; and finally, the spatial weights of the image are obtained by normalization through the Sigmoid activation function and then they are multiplied with $\mathbf{S}^{H\times W\times C}$ element by element to generate target features of different weights $\mathbf{T}^{H\times W\times C}$ . The spatial attention weight matrix is as follows.

$\displaystyle\mathbf{A}(\mathbf{S})=\sigma\left(f^{1\times 1}\left(\left[% \textit{AvgPool}(\mathbf{S}^{H\times W\times C});\textit{MaxPool}(\mathbf{S}^{% H\times W\times C})\right]\right)\right)$ (17)

The feature map is sensitive to position importance is obtained after element-by-element multiplication of Eq. (17) with $\mathbf{S}^{H\times W\times C}$ .

$\displaystyle\mathbf{T}^{H\times W\times C}=\mathbf{S}^{H\times W\times C}% \otimes\mathbf{A}(\mathbf{S})$ (18)

where $\sigma$ is Sigmoid, $f^{1\times 1}$ is the convolutional layer with a convolutional kernel size of $1\times 1$ , and $\left[\textit{AvgPool}(\mathbf{S}^{H\times W\times C});\textit{MaxPool}(% \mathbf{S}^{H\times W\times C})\right]$ denotes the connecting operation after pooling.

5. Loss function

The loss function defined in the whole training process of the network is as follows:

$\displaystyle l(y,\nu)=\log(1+e^{-y\nu})$ (19)

Where, $y\in\left(+1,-1\right)$ , $y$ represents the true label value of each point on the response graph, with $+$ 1 representing a positive sample and $-$ 1 representing a negative sample. $\nu$ represents the actual score between the sample image and the search image. The probability of a positive sample is $\frac{1}{1+e^{-1}}$ . The probability of a negative sample is $1-\frac{1}{1+e^{-1}}$ . Then the loss of the above formula can be calculated according to the formula of cross entropy.

6. Results and analysis of experiments

6.1 Experimental environment and data set

The experiments are conducted in the framework of Python 3.7 and PyTorch 1.2, and the configuration of running platform is: Intel(R) Xeon(R) CPU E5-2660 V2 @3.50 GHz $\times$ 40, NVIDIA GTX 1080Ti GPUs $\times$ 2, and 24 GB of RAM.

The experiments are trained offline on the GOT10K dataset, which consists of over 1.5 million frames of artificial marked bounding boxes and over 10,000 video sequences of real moving object, covering more than 560 categories in total, of which the validation and test sets each contains more than 180 video sequences. The initial value of the learning rate for offline training is 0.01, and the learning rate is exponential decay, starting from 10 ${}^{-2}$ and ending at 10 ${}^{-5}$ .

In this paper, the algorithms are evaluated by widely used standard datasets OTB100, OTB2013, OTB2015, VOT2016, and VOT2018, where OTB100, OTB2013, and OTB2015 are used for quantitative analysis, and VOT2016 and VOT2018 are used for qualitative analysis. The accuracy as well as the robustness of the PCAM algorithm is verified by comparison experiments with the feature cascade algorithm in Section 2 and existing prevailing algorithms.

6.2 Ablation experiment: Quantitative analysis on the OTB dataset

In order to demonstrate the effectiveness of the module designed by the algorithm in this paper, an ablation experiment was conducted. The quantitative analysis of overlap rate and center error is made on OTB100, OTB2013, and OTB2015 through the ablation experiment. Comparison experiments on the above three datasets with the existing prevailing algorithms, including SiamDW-RPN [20], SiamRPN $++$ [5], DSiam-Att [21], SRDCF [22], CFFA [18], CNNSCAM [23], and SiamFC [24], and the results are shown in Fig. 5.

Table 1
Tracking precision results of different algorithm on OTB2013

Attributes	Ours	DSiam-Att	CFFA	SiamRPN $++$	SiamDW-RPN	CNNSCAM	SRDCF	SiamFC
		[21]	[18]	[5]	[20]	[23]	[22]	[24]
BC	0.822	0.813	0.802	0.785	0.768	0.803	0.775	0.710
OCC	0.843	0.839	0.821	0.791	0.813	0.806	0.770	0.740
IV	0.819	0.823	0.817	0.827	0.790	0.795	0.752	0.759
SV	0.831	0.802	0.820	0.795	0.816	0.789	0.743	0.775
DEF	0.817	0.822	0.813	0.778	0.793	0.781	0.727	0.755
FM	0.807	0.793	0.779	0.769	0.757	0.763	0.710	0.726
MB	0.821	0.829	0.815	0.793	0.763	0.721	0.699	0.711
IPR	0.795	0.814	0.790	0.776	0.787	0.732	0.658	0.695
OPR	0.803	0.797	0.780	0.780	0.765	0.754	0.702	0.690
LR	0.912	0.895	0.851	0.860	0.876	0.858	0.793	0.810
OV	0.843	0.839	0.822	0.826	0.835	0.822	0.795	0.680

Table 2

Tracking precision comparison of different algorithm on OTB2015

Attributes	Ours	DSiam-Att	CFFA	SiamRPN $++$	SiamDW-RPN	CNNSCAM	SRDCF	SiamFC
		[21]	[18]	[5]	[20]	[23]	[22]	[24]
BC	0.824	0.813	0.813	0.789	0.795	0.803	0.765	0.710
OCC	0.830	0.830	0.825	0.782	0.791	0.787	0.741	0.759
IV	0.798	0.812	0.815	0.817	0.795	0.783	0.724	0.755
SV	0.831	0.820	0.827	0.795	0.802	0.789	0.752	0.775
DEF	0.811	0.817	0.801	0.775	0.791	0.779	0.728	0.764
FM	0.815	0.815	0.806	0.783	0.799	0.768	0.745	0.788
MB	0.822	0.829	0.810	0.795	0.783	0.745	0.671	0.725
IPR	0.800	0.821	0.802	0.772	0.775	0.712	0.660	0.710
OPR	0.781	0.787	0.780	0.775	0.732	0.751	0.685	0.690
LR	0.915	0.902	0.865	0.872	0.869	0.853	0.757	0.805
OV	0.835	0.827	0.812	0.815	0.826	0.823	0.810	0.675

Figure 5.

The tracking accuracy and success rate curves of 8 different algorithms on three kinds of data sets.

Figure 5 show that the tracking precision and success rate of the PCAM algorithm is superior compared to other algorithms on all three data sets. The success rate is 76% on OTB100 by PCAM which is slightly lower than the 76.9% of DSiam-Att [21], while on the OTB2013 and OTB2015, PCAM are higher than other algorithms In addition, the tracking center error of the PCAM is better than the other algorithms, which strongly proves that the parallel channel AM plays an important role in the accurate localization and the feature enhancement of the target.

The tracking accuracy of the algorithm is compared in OTB2013 and OTB2015, with 11 different video attributes. The 11 attributes are Background Clutter (BC), Occlusion (OCC), Illumination Variation (IV), Scale Variation (SV), Deformation (DEF), Fast Motion (FM), Motion blur (MB), In-plane Rotation (IPR), Out-plane Rotation (OPR), Low Resolution (LR), and Out of View (OV). The comparison of the tracking precision of different algorithms for 11 different attributes on the OTB2013 dataset is shown in Table 1. he tracking precision of different algorithms for 11 different attributes on the OTB2015 dataset is shown in Table 2.

As seen from Tables 1 and 2, the PCAM has higher tracking accuracy among 11 video attributes, including 7 complex cases such as BC, OCC, SV, FM, OPR, LR and OV in the OTB2013 dataset, in which it outperforms other algorithms. In the OTB2015 dataset, some differences are found with OTB2013, and the PCAM perform relatively better in the following seven cases: BC, OCC, SV, DEF, FM, LR, and OV. Among the other four attributes in the datasets of OTB2013 and OTB2015, the tracking accuracy of the PCAM is the second highest. Comparing with the CFFA algorithm in Chapter 2 of this paper, the tracking accuracy is improved to some extent, including 2%, 2.2%, 1.1%, 2.8%, and 5.9% in the five cases of BC, OCC, SV, FM, and LR from OTB2013 dataset, and the improvement in the five cases of BC, OCC, SV, FM, and LR from OTB2015 dataset is 1.1%, 0.5%, 0.4%, 0.9%, and 5%, respectively.

In this paper, ResNet-50, a five-stage network, is used as the backbone network of the Siamese network, which requires a moderate amount of computation. In this paper, parallel channel attention mechanism module and spatial attention mechanism module are added, which can improve the overall performance of the network and have little impact on the overall running speed of the algorithm. The tracking speeds of the algorithm in this paper and the comparison algorithm on the data sets OTB100, OTB2013 and OTB2015 are shown in Table 3.

Table 3

Tracking speed comparison

Tracker	OTB100	OTB2013	OTB2015
Ours	80	85	78
SiamDW-RPN [20]	75	78	70
SiamRPN $++$ [5]	80	86	75
DSiam-Att [21]	70	70	65
SRDCF [22]	35	37	35
CFFA [18]	50	65	60
CNNSCAM [23]	65	60	55
SiamFC [24]	80	90	85

As can be seen from the table, the running speed of the proposed algorithm is almost the same as that of SiamRPN $++$ , and is even slightly better than that of SiamRPN $++$ in the OTB2015 data set, which is more than 2 times higher than that of the algorithm SRDCF. The running speed of the proposed algorithm is better than that of most similar tracking algorithms. To sum up, the proposed algorithm can achieve the balance of tracking speed and performance.

6.3 Experimental analysis on the VOT dataset

The VOT2016 and VOT2018 datasets each contain 61 sets of high-resolution video image sequences, which have more stringent requirements for the network model and more comprehensive performance evaluation criteria.

In this chapter, the PCAM is compared with DSiam-Att [21], CFFA [18], SiamRPN $++$ [5], SiamRPN [4], SRDCF [22], SASiam [26], ECO [25], DaSiamRPN [6], SPM [27] and SiamFC [24] in the VOT2016 and VOT2018 datasets to verify the performance of the PCAM in terms of Accuracy, Robustness, and Expected Average Overlap (EAO). Accuracy comparison and EAO comparison are shown in Figs 6 and 7.

Table 4
Results of different algorithm on VOT2016 and VOT2018

	VOT2016			VOT2018
Tracker	Accuracy	Robustness	EAO	Accuracy	Robustness	EAO
Ours	0.65	0.19	0.441	0.63	0.22	0.423
DSiam-Att [21]	0.65	0.20	0.439	0.64	0.21	0.420
CFFA [18]	0.62	0.20	0.437	0.59	0.23	0.409
SiamRPN $++$ [5]	0.60	0.23	0.424	0.60	0.23	0.414
SiamRPN [13]	0.56	0.26	0.344	0.49	0.46	0.244
SASiam [26]	0.54	0.34	0.291	0.50	0.46	0.236
ECO [25]	0.55	0.20	0.375	0.48	0.27	0.280
DaSiamRPN [6]	0.61	0.22	0.411	0.56	0.34	0.326
SPM [27]	0.62	0.21	0.434	0.58	0.30	0.338
SRDCF [22]	0.54	0.39	0.325	0.51	0.43	0.301
SiamFC [24]	0.53	0.46	0.235	0.50	0.59	0.188

Figure 6.

Tracking precision comparison.

Figure 7.

EAO comparison.

Figure 8.

Qualitative analysis and comparison of tracking results of different algorithms in different scenarios.

As seen in Table 4, in terms of accuracy, the PCAM is equivalent to DSiam-Att [21] in the VOT2016 dataset, and slightly lower than DSiam-Att [21] in the VOT2018 dataset, but both are better than other comparison algorithms; as for precision, in VOT2016, the PCAM is better than other comparison algorithms, higher than DSiam-Att [21], CFFA [18], and ECO [25] algorithms by more than 1%, while it is slightly inferior to DSiam-Att [21] in VOT2018, with the second best performance. When it comes to EAO, the PCAM has better comprehensive performance than other algorithms. Relative to the CFFA [18] algorithm has improved in different performance comparisons. In comparison to the CFFA [18] algorithm, it has improved in each aspect.

As can be seen from Figs 6 and 7, the PCAM shows some advantages in both accuracy and EAO. The comparison of quantitative experiments reveals that the PCAM has a good performance among the evaluation indexes such as tracking precision, tracking accuracy, success rate, robustness and expected average overlap on tracking target center, and has a strong robustness to complex and variable tracking environments.

6.4 Qualitative analysis

In order to have a more intuitive and clear view of the tracking performance of the PCAM in different complex environments, 10 representative scenarios in OTB2015 are selected for comparison experiments in this paper. The chosen trackers for comparison experiments are DSiam-Att [21], CNNSCAM [23], and SiamRPN $++$ [5]. The scenarios are Basketball (FM, target similarity and occlusion, IPR), Biker (IPR), Bird1 (deformation, OV), Blurbody (MB camera shake), Bolt2 (FM, BC), Car1 (IV), Football (BC, occlusion), Girl (SV, rotation, occlusion), Jump (IPR), Liquor (BC, OV, SV). Figure 8 shows the comparison experiments of four algorithms including the PCAM. Through the experimental comparison, we can find that the PCAM has certain advantages in complex scenes, and the target localization and tracking effect is better. It can be found from the comparison that the PCAM has certain advantages in complicated scenarios, and is more accurate in target localization and tracking.

(1)
Background clutters. In scenarios Basketball, Bolt2, Football, Liquor and other video sequences with similar backgrounds, the tracking accuracy of PCAM and DSiam-Att is better, However, for CNNSCAM and SiamRPN $++$ , tracking is lost in the latter part of Bolt2, Football and Liquor due to the background clutters. In contrast, The PCAM is able to locate and track properly.
(2)
Out of view. In Bird1 and Liquor, the target is out of view or partially out of view Especially in Bird1, the tracking results of the four algorithms are not satisfactory due to the influence of clouds, but in contrast, the PCAM is effective in the follow-up tracking. The other three algorithms have a poor performance in follow-up tracking, in particular, the tracking was lost after 130 frames, and the subsequent detection and tracking were not well achieved when the targets reappeared.
(3)
IPR. In the scenes Basketball, Biker, and Jump, the target appears to rotate in plane, causing both the state and size of the target to change. In Basketball and Biker the tracking performance of all four algorithms is satisfactory, but SiamRPN $++$ shows relatively poor tracking accuracy in video sequence Jump after the 60th frame due to the large target rotation.
(4)
Occlusion. In the scenes of Basketball, Football and Girl, the target is obscured In Basketball, the target features are different from the occlusion, and all the four algorithms are able to recognize and track successfully when the occlusion is removed. but in Football and Girl, the target in Football and Girl is quite similar to the occlusion, which causes the target tracking to fail.
(5)
For the cases of motion blur, camera shake and illumination variation the PCAM and DSiam-Att have relatively better tracking performances; although the tracking precision of CNNSCAM and SiamRPN $++$ is inferior, still there is no tracking loss.

7. Conclusion

In this paper, the algorithm is designed and implemented based on the feature cascade of CFFA [18]. In order to make full use of channel features and spatial features to enhance the extraction of target features based on cascade feature fusion, the SENet block and ECA modules are analyzed in this paper, and the advantages of both are fully utilized to construct a parallel channel AM, which enhances the important features and suppress non-important features to improve the generalization ability of the model. The output model of channel attention is used as the input of spatial attention model. The spatial AM is an effective complement to the channel attention mechanism. It calculates the weight values of different spatial locations, and builds structural links between the spatial location information, so as to the feature expression ability of the model is enhanced. Finally, after comparison experiments, it is found that the proposed algorithm in has certain advantages in tracking precision and tracking success rate.

Footnotes

Funding

This work was supported by Youth Top Talent Project of Hebei province, Hebei Province “333 Talents Project” funding project (A202101102), and Plan Project of Shijiazhuang Science and Technology (NO. 221130321A).

References

Liu

Chu

Liu

, et al. GSM: Graph similarity model for multi-object tracking. In: Twenty-Ninth International Joint Conference on Artificial Intelligence. 2020. pp. 530-536.

Liu

Wang

. Target tracking algorithm based on deep learning and multi-video monitoring. In: Proceedings of the 5th International Conference on Systems and Informatics. Los Alamitos: IEEE Computer Society Press; 2018. pp. 440-444.

Chen

Ouyang

, et al. Gradnet: Gradient-guided network for visual object tracking. In: IEEE International Conference on Computer Vision. 2019. pp. 6162-6171.

Yan

, et al. High performance visual tracking with Siamese region proposal network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Munich: CVPR; 2018. pp. 8971-8980.

Wang

, et al. SiamRPN+⁣+: Evolution of siamese visual tracking with very deep networks. In: IEEE Conference on Computer Vision and Pattern Recognition. Long Beach: CVPR; 2019. pp. 4282-4291.

Zhu

Wang

, et al. Distractor-aware siamese networks for visual object tracking. In: European Conference on Computer Vision. Munich: ECCV; 2018. pp. 101-117.

Qin

Zhang

Chang

, et al. ACSiamRPN: Adaptive Context Sampling for Visual Object Tracking. Electronics. 2020; 9(9): 1528-1538.

Wang

Teng

Xing

, et al. Learning Attentions: Residual Attentional Siamese Network for High Performance Online Visual Tracking. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: CVPR; 2018. pp. 4854-4863.

Wang

Jiang

Qian

, et al. Residual attention network for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition. Hawaii: CVPR; 2017. pp. 3156-3164.

10.

Shen

Samuel

, et al. Squeeze-and-excitation networks. In: IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: CVPR; 2018. pp. 7132-7141.

11.

Bai

Zhang

Wang

. Target tracking algorithm based on efficient attention and context awareness. Journal of Beijing University of Aeronautics and Astronautics. First online paper. 2021.

12.

Shao

. Siamese Object Tracking Algorithm Combining Residual Connection and Channel Attention Mechanism. Journal of Computer-Aided Design & Computer Graphics. First online paper. 2021.

13.

Song

Yang

, et al. An adaptive Siamese network tracking algorithm based on global feature channel recognition. Journal of Zhejiang University (Engineering Science). 2021; 55(5): 966-975.

14.

Zhang

. Siamese Network with Multi-Attention Map for Visual Object Tracking. Journal of Signal Processing. 2020; 36(9): 1557-1566.

15.

Cheng

Cui

Song

, et al. Object Tracking Algorithm Based on Temporal-Spatial Attention Mechanism. Computer Science. First online paper. 2021.

16.

Zhang

Cheng

, et al. Siamese Network Combined with Attention Mechanism for Object Tracking. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences. 2020; XLIII-B2-2020: 1315-1322.

17.

Woo

Park

Lee

, et al. CBAM: Convolutional block attention module. In: Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: 2018. pp. 1352-1368.

18.

Han

Wan

Wang

, et al. Research on Object Tracking Algorithm Based on Cascading Feature Fusion of Siamese Network. Computer Engineering and Applications. 2022, 58(6): 208-218.

19.

Wang

Zhu

, et al. ECA-Net: efficient channel attention for deep convolutional neural networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. pp. 1-12.

20.

Zhang

Peng

. Deeper and Wider Siamese Networks for Real-Time Visual Tracking. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: CVPR; 2019. pp. 4586-4595.

21.

Xiong

Huang

, et al. Deformable Siamese Attention Networks for Visual Object Tracking. In: IEEE Conference on Computer Vision and Pattern Recognition. 2020. pp. 6728-6737.

22.

Martin

Gustav

Fahad

, et al. Learning spatially regularized correlation filters for visual tracking. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE; 2015. pp. 1-10.

23.

Liu

Xie

Huang

, et al. Spatial and Channel Attention Mechanism Method for Object Tracking. Journal of Electronics & Information Technology. 2021; 43(9): 2569-2576.

24.

Bertinretto

Valmadre

Henriques

, et al. Fully-convolutional Siamese networks for object tracking. In: European Conference on Computer Vision. Amsterdam: ECCV; 2016. pp. 850-856.

25.

Zhang

Wang

. Object tracking in Siamese network with attention mechanism and Mish function. Academic Journal of Computing & Information Science. 2021; 4(1): 75-81.

26.

Jack

Luca

Henriques

, et al. End-to-end representation learning for correlation filter based tracking. In: IEEE Conference on Computer Vision and Pattern Recognition. Hawaii: CVPR; 2017. pp. 2805-2813.

27.

Gao

Zhang

, et al. Graph Convolutional Tracking. In: IEEE Conference on Computer Vision and Pattern Recognition. Long Beach: CVPR; 2019. pp. 4649-4659.

Research on Siamese network algorithm based on parallel channel attention mechanism for target tracking

Abstract

Keywords

1. Introduction

2. The general structure of the algorithm

6.1 Experimental environment and data set

6.2 Ablation experiment: Quantitative analysis on the OTB dataset

Table 1 Tracking precision results of different algorithm on OTB2013

Table 4 Results of different algorithm on VOT2016 and VOT2018

Footnotes

Funding

References

Table 1
Tracking precision results of different algorithm on OTB2013

Table 4
Results of different algorithm on VOT2016 and VOT2018