AdapTrack: An adaptive FairMOT tracking method applicable to marine ship targets

Abstract

Ship tracking at sea is faced with the disadvantages of complex sea conditions and the large influence of ship occlusion on the tracker. Therefore, we propose a method called AdapTrack based on On the Fairness of Detection and Re-Identification in Multiple Object Tracking (FairMOT) which is suitable for marine targets. The search strategy of trivial augmentation is used to randomly select suitable data augmentation methods and strengths. Then, based on the FairMOT tracking framework, we change the sampling selection method of positive and negative samples from a two-dimensional Gaussian distribution with the same variances to a two-dimensional Gaussian distribution with different variances. It is limited by the bounding box (bbox) of ground truth. This method can improve the detection algorithm’s fitness to the ship target. At the same time, we use Multi-Object Tracking by Associating Every Detection Box (ByteTrack)’s double-threshold strategy to divide detection bboxes, which improves the matching and inference speed. In the first stage of data association, the high-scoring bbox calculates the cost matrix of data association through the Re-identification (Re-ID) model. In the second stage, the Intersection over Union(IOU) cost matrix is calculated after merging low-scoring detection bboxes and unmatched detection bboxes of the first stage. The method achieves Multiple Object Tracking Accuracy (MOTA) of 36.1, Identification F-Score (IDF1) of 47.3, and Frames Per Second (FPS) of 28.79 on the Singapore-Marine-dataset. Experiments show that this method can better alleviate Identification Switch (ID-switch) and ensure the real-time tracking of complex and changeable ship target tracking at sea.

Keywords

Ship detection FairMOT track strategy Re-ID model

1. Introduction

Object tracking has always been an important direction in the field of computer vision. It combines video processing, automatic system control, informatics, etc. The application scenarios of visual target tracking are very broad, such as intelligent monitoring, traffic supervision, marine monitoring, and other fields. Ship tracking at sea is beneficial to provide auxiliary information on ships. It enhances the safety of life at sea and improves the safety and efficiency of navigation. On the one hand, the difficulty of target tracking at sea lies in how to effectively establish a tracker model that can adapt to the complex and changeable sea environment, ship occlusion, and other difficult scenes. On the other hand, due to the limitation of computing power and the practical application requirements, we hope to stably track the target while maintaining the real-time tracking speed. Therefore, the study of the moving target tracking problem has strong practical significance and application value.

Traditional target tracking algorithms include Mean Shift [5], Particle Filter [18], and Kalman Filter [25]. Because the offset of the target between two adjacent frames is very small. At the same target frame position, Mean Shift can infer the required offset of the target bounding box(bbox). It marks the new target bbox through the image similarity measure. Particle Filter can exert its advantages for nonlinear and non-Gaussian motion models. It assigns appropriate weights to correct sampling points through a large number of samples to achieve tracking along the direction of weight increase. Kalman Filter is mainly used for linear motion models. According to the motion state of the current frame and the observation of the next frame, it can predict a target frame with higher reliability. Because of their limitations in feature representation, their prediction of target trajectories is not suitable for some difficult scenarios.

Due to the rapid development of the Convolutional Neural Network (CNN), visual object tracking has achieved unprecedented progress. Visual object trackers are broadly divided into two categories: detection-based trackers and template-matching-based trackers. Detection-based tracking divides the tracker into detection and data association such as Jointly learns the Detector and Embedding model (JDE) [24] and On the Fairness of Detection and Re-Identification in Multiple Object Tracking (FairMOT) [34]. Among the methods based on template matching, it is mainly represented by the Siamese network such as Siamese multi-object tracking (SiamMOT) [21]. The difference between the two is the variety of matching metrics.

However, the above methods are all general target-tracking models. These models don’t specifically improve model performance along with the attributes of target bboxes. We regard FairMOT as a baseline model. The sea and sky background in the Singapore-Marine-dataset (SMD) [19] is complex, the weather is diverse, and the illumination changes greatly. Among them, the ship targets in the shore-based video are rapidly deformed, the occlusion phenomenon is serious, and there are many small targets. Furthermore, the tracking algorithm in FairMOT has a low utilization rate of the low-confidence target bbox, which leads to a serious Identification Switch (ID-switch) phenomenon. Therefore, our contributions are as follows. (1) We use the Trivial Augment (TA) [17] automatic data enhancement strategy to reorganize the data augmentation method in TA according to the characteristics of the SMD. This method enhances the expressive performance of the subsequent convolutional neural network, which is beneficial to improve the performance of detecting difficult objects in complex situations. (2) Aiming at the disadvantage that the anchor-base network is difficult to alleviate the ID-switch phenomenon, we use the anchor-free network and propose an adaptive positive and negative sample selection strategy based on the target bbox to optimize the model performance. (3) In view of the low matching rate of low-confidence target bboxes, which leads to the ID-switch phenomenon, we propose a dual-threshold target tracking strategy. Based on the combination of motion model and appearance model, the reasoning time of the algorithm is reduced.

The rest of this article is organized as follows. Section 2 reviews related work. Section 3 gives a detailed description of the algorithm implementation. Section 4 is the experimental verification and analysis. Section 5 presents the conclusions and prospects of this paper.

2. Related work

Multi-target tracking algorithms can generally be divided into two categories: Tracking-by-Detection (TBD) and Detection-Free Tracking (DFT). The Tracking-by-Detection tracking algorithm does not need to initialize the target frame. And it is more suitable for ship target tracking in bay areas with large traffic. On the one hand, due to the complex and changeable maritime environment and various ship postures, using automatic data enhancement algorithms can reduce the time cost. On the other hand, according to the length and width of the ship target, a new strategy of selecting positive and negative samples can fully dig the model expression ability. After that, the real-time algorithm not only simplifies the calculation of the tracker after the high-performance detector but also improves the performance of the tracker.

2.1. Automatic data augmentation methods

In the field of image recognition, common data augmentation methods have been manually designed and widely used. Different datasets require different augmentation methods to get the best results. The manual augmentation method can require significant design time and expertise. Automatic data augmentation aims to liberate people from a large number of verification experiments. It selects the optimal data augmentation method by designing a data augmentation search strategy. Learning augmentation policies from data (AutoAugment) [6] is the beginning of an automatic data augmentation search strategy. It designs a complete enhancement strategy composed of multiple sub-strategies. The process of searching for an augmentation policy consists of a controller and a policy optimization algorithm. The boosting strategy proposed by this algorithm is very useful. But the training time of the strategy optimization algorithm is too long. Subsequent methods are basically to solve the problem of long training time, such as Fast AutoAugment [15], Practical Automated Data Augmentation With a Reduced Search Space (RandAugment) [7], TA, and so on. Fast AutoAugment employs a density-matching algorithm to find the best augmentation strategy. RandAugment proves that data augmentation is strongly related to model and training set size. It simplifies the search space and adopts a fixed-strength strategy, so the only hyperparameters are the number of types and strength of data augmentation methods. However, the search cost of the optimized NAS-based automatic data augmentation still can’t be ignored. Above all, we take TA’s almost cost-free data augmentation strategy and replace the data augmentation methods in it.

2.2. Object detection models

In the field of ship target detection, commonly used detection data sets consist of grayscale images generated from synthetic aperture radar (SAR) images or images imaged by cameras. SAR satellites have all-day, all-weather, and global coverage observation capabilities, and the resolution has reached the sub-meter level, preferably up to 0.15 meters. With this remarkable advantage, they have great prospects in the fields of disaster prevention, emergency rescue, and land resource monitoring. For example, Depthwise separable convolution neural network for high-speed sar ship detection [32] uses depthwise separable convolution neural network (DS-CNN) to improve the inference speed of SAR ship image detection on the basis of reducing the amount of model parameters. High-Speed Ship Detection in SAR Images Based on a Grid Convolutional Neural Network [31] using grid convolutional neural network (G-CNN) has great application values in real-time maritime disaster rescue and emergency military strategy formulation. A Lightweight Deep Learning Detector for On-Board Ship Detection in Large-Scene Sentinel-1 SAR Images [28] (Lite-YOLOv5) integrates a histogram-based pure backgrounds classification(HPBC) module, a shape distance clustering (SDC) module, a channel and spatial attention (CSA) module, and a hybrid spatial pyramid pooling (H-SPP) module to improve detection performance. A Group-Wise Feature Enhancement-and-Fusion Network with Dual-Polarization Feature Enrichment for SAR Ship Detection [27] proposes a group-wise feature enhancement-and-fusion network with dual-polarization feature enrichment (GWFEF-Net) for better dual-polarization SAR ship detection. However, it is still difficult for SAR images to capture the shape of real-time moving objects. Therefore, at present, we mainly use RGB images taken by cameras for model training and testing. Detection models based on RGB images are roughly divided into two categories: anchor-base and anchor-free. Bbox extracted by the preset anchor for predicting the target information is likely to be misaligned with the target center. Therefore, an anchor-based detection network is not suitable for learning Re-ID features. Unifying landmark localization with end to end object detection (DenseBox) [11] is the first anchor-free detector. The algorithm builds the bbox from the corner positions. A circle at the center of the sample is used to delineate the area of the positive sample point, and the pixels in the boundary part of the positive and negative samples are removed. Detecting objects as paired keypoints (CornerNet) [14] gathers feature information on the corners through the corner max-pooling layer. Positive sample points are divided by generating a Gaussian distribution over the ground truth corners. However, a target box requires at least two corners to be determined. So CornerNet requires a time-consuming paired corners grouping strategy to generate the final bbox. Fully convolutional one-stage object detection (FCOS) [23] divides the positive samples by selecting a small central area which is a sub-bbox of the ground truth bbox. The positive samples are not distinguished according to the distance from the center point, which causes the model to converge slowly. From the above models, we know that different models have different ways of selecting positive and negative samples. Therefore, the adaptive selection of positive and negative samples according to the attributes of the ground truth bbox is an important way to improve the performance of the model.

2.3. Object tracking models

The main method of multi-target tracking is the tracking-by-detection paradigm. This method uses the result of target detection for data association in each frame. Data association relies on target features and matching strategies, such as appearance information, motion information, and shape features. Multi-target tracking algorithms mainly include simple online and realtime tracking (SORT) [1] and simple online and realtime tracking with a deep association metric (Deep-SORT) [26] algorithms. The SORT algorithm simply uses the center coordinates, area ratio, aspect ratio, and rate of change to make predictions. It calculates the Intersection over Union (IOU) distance cost matrix between the predicted results and the actual detected results. And then SORT uses a linear matching algorithm to match the detection and tracking bounding boxes through the cost matrix. The Deep-SORT algorithm is an improvement on the SORT algorithm. Because SORT only uses the motion information of the target and ignores the appearance information. It can track objects successfully only when they are high confidence. It combines target motion and appearance feature information. And then cascade matching is used to alleviate the problem of target occlusion. The performance of SORT and Deep-SORT algorithms heavily depends on the performance of detectors, such as JDE, Tracking objects as points (CenterTrack) [35], FairMOT, etc. The DeepSORT algorithm separates detection and Re-ID feature extraction. The biggest disadvantage of this method is that the inference is slow. Therefore, JDE learns the Re-ID feature of the target by outputting one more branch in the detector. A network learns both location information and feature embeddings, increasing the speed of the model inference. But JDE is still a two-stage algorithm. It detects and extracts Re-ID features first, and then performs data association matching to achieve tracking. CenterTrack merges the detection stage and the matching stage and realizes a one-stage MOT by outputting the target offset. However, CenterTrack only considers the relationship between the previous and subsequent frames. It is difficult to form a long-term relationship. Therefore, we provide an ID-switch-resistant, low-complexity tracking algorithm by changing the DeepSORT tracking algorithm.

3. Ship target tracking algorithm

The performance of the tracking-by-detection target tracking algorithm is very dependent on the performance of the detection algorithm. Therefore, it is aimed at the complexity of the maritime environment and the diversity of ship target size. We perform adaptive automatic data augmentation on the dataset. Given the flattening attribute of ships, the selection method of positive and negative samples is modified. At the same time, we modified the strategy of the data association module of the tracking algorithm. The overall flow chart of the model is shown in Fig. 1.

Fig. 1.

The overall framework of the target tracking model.

As shown in Fig. 1(a), the TA data enhancement strategy is used in the training stage. In the inference stage, after learning the dataset features through the encoder-decoder feature extraction network, target localization, classification, and Re-ID feature extraction are performed simultaneously. In Fig. 1(b) Detection, heatmap, and center offset are used for target positioning and classification. Bbox size is used for border regression. In Fig. 1(c), the Re-ID part is used to generate the embedding feature vector.

3.1. Automatic data augmentation algorithm based on Trivial Augment

Since TA mainly studies the application of data augmentation in image classification scenarios. It is not suitable for our requirement of data augmentation for object detection. Data augmentation in object detection needs to consider both image and bounding box transformations. We simply adopted its effective search strategy. TA is defined as a data augmentation function a and the corresponding intensity value m (Some data augmentation functions do not work with intensity values). First, a collection of images and data augmentation methods are as input. Second, a data augmentation method is randomly sampled from a. And then it uniformly samples a value from 0, 1, 2, …, 30 as the intensity m. Third, the model returns the enhanced image. For the data augmentation method of the marine environment, we filter out data augmentation methods for the variable marine environment and ship targets through prior experience. In the beginning, for the changing sea and sky background, the image needs to be pixel enhancement, such as randomly changing the hue, saturation, lighting, and brightness of the image, adding Gaussian, salt, and pepper noise, performing Gaussian blur, histogram equalization, and subtracting the mean. Thereafter, the data enhancement method for ship targets needs to consider the occlusion, deformation, and size of the ship such as random cropping, scaling, rotation, translation, cutout, random erasing, copy-paste, mosaic, shadow enhancement [29] and rectangle training. The enhancement process is shown in Fig. 2.

Fig. 2.

Data augmentation strategy.

Figure 2 is the process visualization of TA. For each image, TA uniformly samples a data augmentation function and an intensity value. Furthermore, previous methods tend to stack multiple data augmentations, while TA only uses a single data augmentation for each image. In this way, the TA-enhanced dataset can be regarded as that picture is enhanced separately by all data enhancement methods, and then uniformly sampled from it.

3.2. Adaptive ship target detection model

In FairMOT, there are various backbone networks used for the backbone of the detection network, such as Deep layer aggregation (DLA34) [30], Deep high-resolution representation learning for human pose estimation (HrNet) [22], and you only look once version 5 (YOLOv5) [12]. We used DLA34 as our baseline which is based on Keypoint triplets for object detection (CenterNet) [9]. CenterNet divides the object detection center into two parts: center localization and bbox regression. In terms of localization, it generates a Gaussian distribution heatmap centered on the target center. The regression defines the pixel at the center of the object as a training sample. And it directly predicts the height and width of the object. This part also predicts the position offset of the down-sampled model to correct the coordinates of the object center. At the same time, to jointly train the embedding and the detection network, the detection network in FairMOT has an additional identity embedding branch for generating Re-ID features. CenterNet adopts a positive and negative sample selection method which is similar to CornerNet. The positive and negative sample is shown in Fig. 3.

Fig. 3.

Positive samples (coincident with the center point of the ground truth) and negative samples.

When coincident with the ground truth, the detection bboxes are regarded as positive samples, and the rest of the detection boxes are regarded as Gaussian-weighted negative samples. In CenterNet, the correspondence between the ground truth and the feature points is to find the corresponding feature points according to the geometric center of the ground truth given by the training data. The heatmaps are set in the category channel. Specifically, the center value of the corresponding feature point of the corresponding category is set to 1. After that, the value corresponding to the category in other feature points near this feature point continues to decrease according to the Gaussian distribution. The sample selection method is shown in Fig. 4(a).

Fig. 4.

Different positive and negative sample selection methods.

The two-dimensional Gaussian distribution used by CenterNet is the product of two one-dimensional Gaussian distributions of x and y. The standard deviation σ in the two dimensions is the same, and the center point is $(x_{c}, y_{c})$ . The expression of the two-dimensional Gaussian distribution is shown in Equation (1). $\begin{matrix} (1) & G (x, y) = \frac{1}{2 π σ^{2}} exp (- \frac{{(x - x_{c})}^{2} + {(y - y_{c})}^{2}}{2 σ^{2}}) \end{matrix}$

Its projection on the two-dimensional surface is a circle. However, the shape of the ship target is usually flat, and it is rich in different aspect ratios. Therefore, we separate the standard deviations of the two one-dimensional Gaussian distributions of x and y. The expression of the modified two-dimensional Gaussian distribution is shown in Equation (2). $\begin{matrix} (2) & G (x, y) = \frac{1}{2 π σ^{2}} exp (- \frac{{(x - x_{c})}^{2}}{2 σ_{x}^{2}} - \frac{(y - y_{c}^{2})}{2 σ_{y}^{2}}) \end{matrix}$

As shown in Fig. 4(c), its projection on the two-dimensional plane is an ellipse controlled by $σ_{x}$ , $σ_{y}$ . In addition, the Gaussian distribution in CenterNet spreads over the entire feature map, but the heatmap should focus on the inside of the bbox. For example, as shown in Fig. 4(b), the FCOS model generates a sub-bbox by scaling the ground truth bbox. It treats the pixels in the sub-bbox as positive samples. The problem is that the geometric center point is often not necessarily the feature center point, and the feature points near the geometric center point have a very high similarity with the features of the geometric center point. Therefore, we describe this correspondence by the relationship between the length and width of the ground-truth bbox and the standard deviation $σ_{x}$ , $σ_{y}$ . The Gaussian function is a bell-shaped curve. The area of the bell curve in the interval $(μ - σ, μ + σ)$ accounts for 68% of the total area under the curve. The interval $(μ - 2 σ, μ + 2 σ)$ accounts for 95%. The interval $(μ - 3 σ, μ + 3 σ)$ accounts for 99.7%. Generally, the value outside $3 σ$ is close to 0. Therefore, a Gaussian window of radius $3 σ$ has size $6 σ \times 6 σ$ . We know that the conversion of standard deviation and window size in OpenCV [3] is shown in Equation (3). $\begin{matrix} (3) & σ = 0.3 \times (({kernel}_{size} - 1) \times 0.5 - 1) + 0.8 \end{matrix}$

Therefore, the two-dimensional Gaussian kernel with a different standard deviation $σ_{x}$ , $σ_{y}$ is calculated by the length and width $(w, h)$ of the ground truth bbox. The relationship between them is shown in Equations (4) and (5). $\begin{array}{l} (4) & σ_{x} = 0.3 \times ((α w - 1) \times - 1) + 0.8 \\ (5) & σ_{y} = 0.3 \times ((α h - 1) \times - 1) + 0.8 \end{array}$ α is an adjustable constant that constrains the ratio of the 2D Gaussian kernel boundary to the ground truth bbox. The feature point at the center of the Gaussian kernel is also the center point of the prediction frame and it is considered a positive sample. Any other feature points are negative samples. The weight distribution of the feature points other than the Gaussian kernel is 0. We follow the heatmap loss function from CenterNet. Given the predicted heatmap $\hat{H}$ and the ground truth heatmap H, the heatmap loss function is shown in Equation (6). $\begin{matrix} (6) & L_{cls} = - \frac{1}{N} \sum_{xyc} \{\begin{matrix} {(1 - {\hat{H}}_{xyc})}^{a} log ({\hat{H}}_{xyc}), & if H_{xyc} = 1 \\ {(1 - H_{xyc})}^{β} {\hat{H}}_{xyc}^{α} log (1 - {\hat{H}}_{xyc}), & otherwise \end{matrix} \end{matrix}$ α, β are hyperparameters in focal loss, and N represents the number of ground-truth. The difference between this loss function and focal loss is that ${(1 - H_{xyc})}^{β}$ weight component is added to the otherwise branch. The closer the target geometric center point and its surrounding center points are, the stronger the correlation. Therefore, when the loss function is weighted with ${(1 - H_{xyc})}^{β}$ , the smaller response values near the center point will not contribute much to training. The greater the feature point correlation, the more ignorant the network is during training.

3.3. Double-threshold target tracking strategy

The detection module generally determines whether it is a target through a single threshold. However, simply ignoring the low-confidence target bboxs can reduce the precision of trajectory matching. The double-threshold method improves the probability that the low-scoring bounding box is matched. At the same time, the high-quality detection frame only uses the motion model for tracking, which reduces the calculation amount of the Re-ID feature similarity measure. The tracking strategy is shown in Algorithm 1.

Algorithm 1

Data association algorithm

The Tracking-by-detection tracking architecture consists of the following components. (1) Track creation, retention time, and deletion. (2) Using Kalman Filter for state estimation and update. (3) The data association between the trajectory and the detection frame. We assume no camera shake and an online object tracking method. Our tracking framework and data association method are shown in Fig. 5.

Fig. 5.

Different forms of double threshold data association matching.

3.3.1. Track creation, retention time, and deletion

As shown in Fig. 5 detection 3, when the detection of the current frame is not associated with any existing track, a new track will be created for the detection frame larger than the low threshold. As shown in Fig. 5 detection 2, detection bboxes smaller than the low threshold will be discarded directly. Each track will record the number of the last successful matching frame. If the current frame minus the last successful matching frame number is greater than the preset maximum retention time, the track will be deleted.

3.3.2. Using Kalman filter for state estimation and update

We define an 8-dimensional state space $(x, y, γ, h, \dot{x}, \dot{y}, \dot{γ}, \dot{h})$ for each trajectory. $(x, y)$ is the object center. γ is the aspect ratio. h is the height. The rest is the corresponding rate of change in image space. In each frame, the Kalman Filter is used to estimate and predict the state of the existing trajectory, and then the trajectory and the detection frame are associated. If the state of each track is associated with detection, its state will be updated. Kalman Filter is divided into two stages. (1) Predict the position of the trajectory at the next time step. (2) Update the predicted location based on the detection. The prediction predicts the state of the trajectory at time t based on the state of the trajectory at time $t - 1$ . $\begin{array}{l} (7) & x^{'} = F x \\ (8) & P^{'} = F P F^{T} + Q \end{array}$

In Equation (7), x is the state value of the trajectory at time step $t - 1$ . F is called the state transition matrix. This equation predicts the mean vector $x^{'}$ at time step t. In Equation (8), P is the covariance matrix of the trajectory samples at time step $t - 1$ . Q is the noise matrix of the system. This equation predicts $P^{'}$ at time t. Based on the target box detected at time step t, updating the trajectory predicts a more accurate state at the current time step. $\begin{array}{l} (9) & y = z - H x^{'} \\ (10) & S = H P^{'} H^{T} + R \\ (11) & K = P^{'} H^{T} S^{- 1} \\ (12) & x = x^{'} + K y \\ (13) & P = (I - K H) P^{'} \end{array}$

In Equation (9), z is the observation space vector, which does not contain differential variation. H is called the measurement matrix, which maps the mean vector $x^{'}$ of the trajectory to the observation space. This equation calculates the position error of the detected target box and trajectory. In Equation (10), R is the noise matrix of the hypothetical observer. S is the mapping of the state transition matrix $P^{'}$ to the observation space. Equation (11) calculates the Kalman gain K. Equation (12) and Equation (13) obtain the updated mean vector x and covariance matrix P at time t. In Kalman Filter, the estimated detection bbox is essentially a weighted average between the state space and the observations. In addition to the above updates and predictions, the Kalman Filter calculates the $gating_distance$ to filter out the feature information of some targets with a relatively large distance. $\begin{matrix} (14) & gating_distance = \sqrt{(z - H x^{'}) S^{- 1} {(z - H x^{'})}^{T}} \end{matrix}$

Equation (14) calculates the Mahalanobis distance between the sample observation space and the observation mean as a threshold for whether to choose a trajectory. Then the $gating_distance$ and the embedding correlation matrix are weighted. They are added to obtain the correlation matrix of the fusion motion model. As shown in Equation (15), $\begin{matrix} (15) & cost_matrix = λ \times cost_matrix + (1 - λ) \times gating_distance \end{matrix}$ $cost_matrix$ is the correlation matrix. λ is a constant between $0 - 1$ .

3.3.3. Data association method

The association between the current frame detection bbox and the trajectory is based on the association matrix. The correlation matrix is generally defined by the target bounding box IOU, the Mahalanobis distance of the target center, and the cosine distance of the Re-ID feature. We adopt the target bbox IOU and Re-ID feature cosine distance as a measure of two-stage data association. Usually, after establishing an association matrix, the association between trajectories and detections is formulated as a linear assignment problem. The Jonker-Volgenant [13] allocation algorithm is a low-time complexity method to solve it. In addition, based on the superior performance of ByteTrack [33], we apply a double threshold $(τ_{high}, τ_{low})$ to distinguish the detection bbox scores. The detection bboxes above $τ_{high}$ are used as the first detection set. Between $τ_{high}$ and $τ_{low}$ are used as the second detection set. Below $τ_{low}$ will be discarded and not participate in matching. Assuming that the track τ is successfully matched in the previous frame, the track τ is denoted as $τ_{tracked}$ in the current frame. On the contrary, the trajectory τ is represented as $τ_{lost}$ in the current frame.

As shown in Fig. 5 detection 1, the first detection set uses the Embedding cost association matrix to measure the degree of association between the detection frame $τ_{high}$ and the track $τ_{tracked}$ . After that, they are matched through a linear assignment algorithm. Because the high-quality detection frame almost has no occlusion. In the second stage, we will merge the first unmatched trajectory $τ_{remain}$ and the lost state trajectory $τ_{lost}$ . The first unmatched detection bbox $D_{remain}$ and $D_{low}$ merged at the same time. After that, as shown in Fig. 5 detection 4, we use the IOU cost matrix to measure the degree of association between the detection bounding boxes $D_{remain} \cup D_{low}$ and $τ_{remain} \cup τ_{lost}$ . This association uses a linear assignment algorithm for a second match again. As described in 3.3.2 above, we weigh the $gating_distance$ in the Kalman Filter to obtain the IOU cost correlation matrix of the fused motion model. In the second stage, we use IOU cost metrics to measure the degree of association. Because the low-scoring frame is often partially occluded and largely deformed. However, in the case of a high number of frames in the video stream, the video frame is approximately equivalent to a uniform motion. At this time, using embedding metrics is easy to mismatch. But using a uniform linear model can alleviate this problem well.

Trajectories τ that are still unmatched after two-stage matching will be added to $τ_{lost}$ . The unmatched detection bbox $D_{remain}$ initializes the score greater than the threshold ε as a new track. The detection threshold is changed from high to low, and the calculation of the correlation matrix is simple to complex. This method can reduce the overhead of data association and filter out high-quality trajectories. Most importantly, it retains more difficult-to-match trajectories.

4. Experiment analysis

In this section, we tracked ship targets to evaluate the effectiveness of the model. Firstly, we introduced the marine target dataset required for the experiment. Secondly, the specific implementation details of the experiment were introduced. Finally, comparative experiments and robustness experiments were carried out. And the experimental results were analyzed and discussed to verify the specific effects of the model.

4.1. Dataset introduction and metrics

The public dataset used in this experiment was the SMD. All video frames were shot at $1920 \times 1080$ resolution. The dataset was divided into three parts such as ship segmentation, ship detection, and ship tracking. It is suitable for three different computer vision tasks. We selected 35 videos for experiments. As shown in Table 1, our training and testing sequences include scenes with a large number of ships, angle changes, sea fog, and light and dark changes. Statistics information is shown in Table 2.

Table 1
Part of the dataset

Table 2

Dataset information statistics

Type	Video	GT Boxes	Trajectories	Frame number
Training	25	120453	241	15075
Inference	10	40661	83	6030

We used mean Average Precision ( $mAP$ ) [4] to measure the performance of the detector. The category corresponding to the Re-ID feature was measured using the True Positive Rate ( $TPR$ ) under the ROC curve with a False Positive Rate ( $FPR$ ) of 0.1. We used the CLEAR metrics $MOTA$ and $IDF 1$ to evaluate tracking accuracy. $\begin{array}{l} (16) & mAP = \frac{\sum_{i = 1}^{N} {AP}_{i}}{K} \\ (17) & TPR = \frac{TP}{TP + FN} \\ (18) & FPR = \frac{FP}{FP + TN} \\ (19) & MOTA = 1 - \frac{\sum_{t} ({FN}_{t} + {FP}_{t} + {IDSW}_{t})}{\sum_{t} {GT}_{t}} \\ (20) & IDF 1 = \frac{2 IDTP}{2 IDTP + IDFP + IDFN} \end{array}$ In Equation (19), $FN$ refers to the negative samples, which are predicted to be positive by the model. It can be called the false alarm rate. $FP$ refers to the positive samples predicted as negative by the model, which can be called the false negative rate. $IDSW$ refers to the number of $ID$ transitions, which can be called the mismatch rate. $MOTA$ is an indicator to measure the accuracy of single-camera multi-target tracking. It pays more attention to the existence of the tracked track and ignores the number of frames where the track $ID$ is wrong. In Equation (20), $IDTP$ is the number of true $ID s$ . $IDFP$ is the number of false $ID s$ that are falsely judged to be true. $IDFN$ is the number of true IDs that are falsely judged to be false. The $IDF 1$ evaluation metric is used to supplement the $ID$ feature of $MOTA$ that ignores multiple targets.

4.2. Experimental details

To verify the practicability of the model and the correctness of the experiment, we used anaconda to build the experimental environment on the ubuntu18 operating system. The GPU model is NVIDIA GeForce GTX 3070 and the graphics memory is 8 GB. The programming language is python. For the detection module, CenterNet were used as the basic detection module. DLA34 were used as a backbone. Common objects in context (COCO) [16] pretrained weights were used as initialization weights. The model were trained for 70 epochs using SMD with a batch size of 4. The image size of the input training was uniformly scaled to $1088 \times 608$ . The optimizer adopted adam optimizer. The learning rate adopted a step learning rate, which was divided by 10 every 20 epochs and initialized to 0.34e−4. In the detection frame post-processing part, we only took the first 500 detection bbox of each frame. The threshold value of the high-scoring detection bbox was the score of the 12th highest detection bbox, and the threshold value of the low-scoring detection bbox was the score of the 24th highest detection bbox. The threshold selection here can be adjusted according to the complexity of the scene.

In the tracking linear assignment step, if the IOU value of the tracking bbox and detection bbox was less than 0.5, they would not be matched. For the embedding cost matrix, the optimal estimation equation of the Kalman filter can be deduced from the chi-square distribution of the state vector. Therefore, the expectation 9.4877 of the inverse chi-square distribution with a 95% confidence interval of 4 degrees of freedom is used as $gating_distance$ to filter $gating_distance$ . The model calculates whether $gating_distance$ was greater than $gating_threshold$ (9.4877) through the Kalman Filter state matrix to determine whether to participate in the second stage of matching. For a track in a lost state, it would hold up to 30 frames before it reappeared.

4.3. Model ablation experiment

We adoptd the SMD for model ablation experiments. With the same hyperparameters, all models were trained for 70 epochs for comparison. Our baseline model was FairMOT. The original FairMOT model adopted three data enhancement strategies such as random Hue, Saturation, Value(HSV) color enhancement, random affine transformation, and random left-right flip. Based on the FairMOT model (base), we sequentially added automatic data augmentation (Auto), positive and negative sample selection strategy (select), and data association matching strategy (match) improvement. The results of the ablation experiments are shown in Table 3.

Table 3
Model ablation experiments

Model MOTA IDF1 mAP FPS TPR

Base 31.7 41.4 65.3 24.76 63.6

Base + Auto 34.3 43.9 67.1 24.55 65.1

Base + Auto + Select 35.2 43.6 68.4 24.64 65.5

Base + Auto + Select + Match 36.1 47.3 68.3 28.79 65.5

Model	MOTA	IDF1	mAP	FPS	TPR
Base	31.7	41.4	65.3	24.76	63.6
Base + Auto	34.3	43.9	67.1	24.55	65.1
Base + Auto + Select	35.2	43.6	68.4	24.64	65.5
Base + Auto + Select + Match	36.1	47.3	68.3	28.79	65.5

It is known from the table that the automatic data enhancement and the positive and negative sample selection strategy have almost no cost to the performance improvement of the model. It improves the performance of the model during the training phase without redundant weight information. The data association matching strategy pays more attention to the performance of $FPS$ , which reduces the calculation amount of the model in the data association stage. The experimental results of our scheme are shown in Figs 6, 7, 8, and 9.

Fig. 6.

Result of large number of ships.

Fig. 7.

Result of sea fog scenes.

Fig. 8.

Result of sea fog scenes.

Fig. 9.

Result of sea fog scenes.

We compared the tracking results of some specific frames under different models. In Fig. 6, we can see that the performance of the detector is relatively saturated, and the biggest difference is that the assignment of ID and the ID-Switch are serious. Figure 6 is the 1st, 3rd, and 392nd frames of the video $MVI_1625_VIS$ . It can be known from the figure that the detector performance is close to saturation for ships without abnormal deformation. However, due to the difficulty of extracting small target features, ID-switch is more serious. But with our double-threshold object detection algorithm, this phenomenon is relieved step by step. Figure 7 is the 0th, 491st, and 603rd frames of the video $MVI_1448_VIS_Haze$ . We can know from the figure that after using automatic data augmentation, ships with their bows aligned with the camera are also detected. And even with normal sea fog, the detector can work normally. Figure 8 is the 0th, 20th, and 170th frames of the video $MVI_1451_VIS_Haze$ . When the target is always deformed, the tracker without automatic data enhancement obviously cannot accurately detect the position of the target. However, after automatic data augmentation and improvement of positive and negative sample selection strategies, the model not only suppresses the appearance of false detection boxes but also ensures the stability of the tracker. Figure 9 is the 0th, 130th, and 291st frames of the video $MVI_1584_VIS$ . In this figure, the Base model cannot detect occluded targets. But our method can detect occluded objects well in dim scenes.

4.4. Comparison experiment of positive and negative sample selection methods

We choosed CenterNet, FCOS, and our positive and negative sample selection strategies. The backbone adopted dla34 uniformly. Models are post-processed in the same way. Set the minimum confidence threshold to 0.3 and set the minimum bounding box area to 60. The CenterNet and our model corresponded to the improved focal loss, and the FCOS corresponded to the focal loss. The model was uniformly trained on the SMD tracking training set for 70 epochs. In this block of experiments, we added the experimental results as shown in Fig. 10.

Fig. 10.

Comparison of positive and negative sample selection methods for detection models.

It can be seen that adopting different positive and negative sample selection strategies have a large impact on the same model. Both $Precision$ and $Recall$ of FCOS are lower. Because the post-processing is not good to restrain $F P$ bboxs. Therefore we only learn from its sample selection strategy idea. CenterNet and our sample selection strategy are similar. The difference between them is that CenterNet uses a two-dimensional Gaussian distribution with the same variance, but ours(AdapTrack) uses an adaptive two-dimensional Gaussian distribution. In this way, our model has a little better performance than CenterNet.

4.5. Comparison test of different tracking models

We selected 4 different state-of-the-art(SOTA) tracking models, including DeepSORT, JDE, CenterTrack, and ours(AdapTrack). DeepSORT, JDE uses a combination of motion model and Re-ID similarity matching method. CenterTrack matches by finding the smallest predicted offset. We kept the model hyperparameters constant and each model was trained on the SMD for 70 epochs. Their results are shown in Fig. 11.

Fig. 11.

Comparison of tracking models.

It can be seen that our ship-tracking strategy outperforms several other models on the SMD with higher accuracy and speed. But the CenterTrack has a higher FPS value. The reason is that CenterTrack has a very simple tracking strategy. This strategy can’t relieve the ID-switch problem. Therefore, our model has higher MOTA and IDF1 values.

4.6. Dataset robustness experiment

We compared the robustness of our method (Base, Base + Auto + Select + Match) on two ship datasets (SMD, Vision Based Large-Scale Maritime Ship Tracking Benchmark for Autonomous Navigation Applications(LMD-TShip) [20]). To be fair, we used roughly the same number of frames for training and testing. The tracking results are shown in Table 4.

Table 4
Model robustness comparison

Dataset Method MOTA IDF1 FPS

SMD Base 31.7 41.4 24.76

AdapTrack 36.1 47.3 28.79

LMD Base 30.5 44.6 22.63

AdapTrack 38.4 50.9 26.98

Dataset	Method	MOTA	IDF1	FPS
SMD	Base	31.7	41.4	24.76
AdapTrack	36.1	47.3	28.79
LMD	Base	30.5	44.6	22.63
AdapTrack	38.4	50.9	26.98

We can find that the adaptively improved detector and the tracker applying double-threshold achieve better performance even on different datasets. Therefore, it is proved that our model has good robustness on different datasets.

5. Conclusion

Based on FairMOT, we propose an effective multi-target tracking model for ship targets. The mwodel can achieve MOTA 36.1, IDF1 47.3, FPS 28.79 on the SMD. We use the TA automatic data augmentation method during the training part. It greatly reduces the time spent on tuning parameters. However, tracking is poor for dimly lit scenes. Our tracking model is a one-shot multi-object tracking model. At detection part, we design an adaptive positive and negative sample selection strategy to improve the learning efficiency of the model and feature expression performance. In the tracking section, a dual-threshold target tracking strategy improves the utilization of low-scoring frames and reduces the matching time cost while ensuring the tracking effect of high-scoring frames. Compared with the original model, the overall model has been optimized in terms of tracking accuracy and inference time, but the phenomenon of ID-switch still occurs in the face of severe occlusion scenes. Therefore, we hope to further improve the tracking effect of difficult scenes in the future.

References

Bewley,

Ge,

Ott,

Ramos and

Upcroft, Simple online and realtime tracking, in: 2016 IEEE International Conference on Image Processing (ICIP), IEEE, 2016, pp. 3464–3468. doi:10.1109/ICIP.2016.7533003.

D.S.

Bolme,

J.R.

Beveridge,

B.A.

Draper and

Y.M.

Lui, Visual object tracking using adaptive correlation filters, in: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE, 2010, pp. 2544–2550. doi:10.1109/CVPR.2010.5539960.

Bradski and

Kaehler, OpenCV, Dr. Dobb’s journal of software tools 3 (2000), 120.

Cartucho,

Ventura and

Veloso, Robust object recognition through symbiotic deep learning in mobile robots, in: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 2336–2341.

Comaniciu and

Meer, Mean shift: A robust approach toward feature space analysis, IEEE Transactions on pattern analysis and machine intelligence 24(5) (2002), 603–619. doi:10.1109/34.1000236.

E.D.

Cubuk,

Zoph,

Mane,

Vasudevan and

Q.V.

Le, Autoaugment: Learning augmentation strategies from data, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 113–123.

E.D.

Cubuk,

Zoph,

Shlens and

Q.V.

Le, Randaugment: Practical automated data augmentation with a reduced search space, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 702–703.

Danelljan,

Häger,

F.S.

Khan and

Felsberg, Discriminative scale space tracking, IEEE transactions on pattern analysis and machine intelligence 39(8) (2016), 1561–1575. doi:10.1109/TPAMI.2016.2609928.

Duan,

Bai,

Xie,

Qi,

Huang and

Tian, Centernet: Keypoint triplets for object detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6569–6578.

10.

J.F.

Henriques,

Caseiro,

Martins and

Batista, High-speed tracking with kernelized correlation filters, IEEE transactions on pattern analysis and machine intelligence 37(3) (2014), 583–596. doi:10.1109/TPAMI.2014.2345390.

11.

Huang,

Yang,

Deng and

Yu, Densebox: Unifying landmark localization with end to end object detection, 2015, preprint arXiv:1509.04874.

12.

Jocher,

Stoken,

Borovec,

Changyu,

Hogan,

Diaconu,

Ingham,

Poznanski,

Fang,

Yu et al., ultralytics/yolov5: v3. 1-bug fixes and performance improvements, Version v3 1 (2020).

13.

Jonker and

Volgenant, A shortest augmenting path algorithm for dense and sparse linear assignment problems, Computing 38(4) (1987), 325–340. doi:10.1007/BF02278710.

14.

Law and

Deng, Cornernet: Detecting objects as paired keypoints, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 734–750.

15.

Lim,

Kim,

Kim and

Kim, Fast autoaugment, Advances in Neural Information Processing Systems 32 (2019).

16.

T.-Y.

Lin,

Maire,

Belongie,

Hays,

Perona,

Ramanan,

Dollár and

C.L.

Zitnick, Microsoft coco: Common objects in context, in: European Conference on Computer Vision, Springer, 2014, pp. 740–755.

17.

S.G.

Müller and

Hutter, Trivialaugment: Tuning-free yet state-of-the-art data augmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 774–782.

18.

Okuma,

Taleghani,

N.d.

Freitas,

J.J.

Little and

D.G.

Lowe, A boosted particle filter: Multitarget detection and tracking, in: European Conference on Computer Vision, Springer, 2004, pp. 28–39.

19.

D.K.

Prasad,

C.K.

Prasath,

Rajan,

Rachmawati,

Rajabaly and

Quek, Challenges in video based object detection in maritime scenario using computer vision, 2016, preprint arXiv:1608.01079.

20.

Shan,

Liu,

Zhang,

Jing and

Xu, LMD-TShip: Vision based large-scale maritime ship tracking benchmark for autonomous navigation applications, IEEE Access 9 (2021), 74370–74384. doi:10.1109/ACCESS.2021.3079132.

21.

Shuai,

Berneshawi,

Li,

Modolo and

Tighe, Siammot: Siamese multi-object tracking, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12372–12382.

22.

Sun,

Xiao,

Liu and

Wang, Deep high-resolution representation learning for human pose estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5693–5703.

23.

Tian,

Shen,

Chen and

He, Fcos: Fully convolutional one-stage object detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9627–9636.

24.

Wang,

Zheng,

Liu,

Li and

Wang, Towards real-time multi-object tracking, in: European Conference on Computer Vision, Springer, 2020, pp. 107–122.

25.

Welch,

Bishop et al., 1995, An introduction to the Kalman filter.

26.

Wojke,

Bewley and

Paulus, Simple online and realtime tracking with a deep association metric, in: 2017 IEEE International Conference on Image Processing (ICIP), IEEE, 2017, pp. 3645–3649. doi:10.1109/ICIP.2017.8296962.

27.

Xu,

Zhang,

Shao,

Shi,

Wei,

Zhang and

Zeng, A group-wise feature enhancement-and-fusion network with dual-polarization feature enrichment for SAR ship detection, Remote Sensing 14(20) (2022), 5276. doi:10.3390/rs14205276.

28.

Xu,

Zhang and

Zhang, Lite-yolov5: A lightweight deep learning detector for on-board ship detection in large-scene sentinel-1 sar images, Remote Sensing 14(4) (2022), 1018. doi:10.3390/rs14041018.

29.

Xu,

Zhang,

Yang,

Shi and

Zhan, Shadow-background-noise 3D spatial decomposition using sparse low-rank Gaussian properties for video-SAR moving target shadow enhancement, IEEE Geoscience and Remote Sensing Letters 19 (2022), 1–5. doi:10.1109/LGRS.2022.3226859.

30.

Yu,

Wang,

Shelhamer and

Darrell, Deep layer aggregation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2403–2412.

31.

Zhang and

Zhang, High-speed ship detection in SAR images based on a grid convolutional neural network, Remote Sensing 11(10) (2019), 1206. doi:10.3390/rs11101206.

32.

Zhang,

Shi and

Wei, Depthwise separable convolution neural network for high-speed SAR ship detection, Remote Sensing 11(21) (2019), 2483. doi:10.3390/rs11212483.

33.

Zhang,

Sun,

Jiang,

Yu,

Weng,

Yuan,

Luo,

Liu and

Wang, Bytetrack: Multi-object tracking by associating every detection box, in: European Conference on Computer Vision, Springer, 2022, pp. 1–21.

34.

Zhang,

Wang,

Zeng and

Liu, Fairmot: On the fairness of detection and re-identification in multiple object tracking, International Journal of Computer Vision 129(11) (2021), 3069–3087. doi:10.1007/s11263-021-01513-4.

35.

Zhou,

Koltun and

Krähenbühl, Tracking objects as points, in: European Conference on Computer Vision, Springer, 2020, pp. 474–490.

AdapTrack: An adaptive FairMOT tracking method applicable to marine ship targets

Abstract

Keywords

1. Introduction

2. Related work

2.1. Automatic data augmentation methods

2.2. Object detection models

2.3. Object tracking models

3. Ship target tracking algorithm

3.3.2. Using Kalman filter for state estimation and update

3.3.3. Data association method

4. Experiment analysis

4.1. Dataset introduction and metrics

Table 1 Part of the dataset

4.3. Model ablation experiment

Table 3 Model ablation experiments Model MOTA IDF1 mAP FPS TPR Base 31.7 41.4 65.3 24.76 63.6 Base + Auto 34.3 43.9 67.1 24.55 65.1 Base + Auto + Select 35.2 43.6 68.4 24.64 65.5 Base + Auto + Select + Match 36.1 47.3 68.3 28.79 65.5

Table 4 Model robustness comparison Dataset Method MOTA IDF1 FPS SMD Base 31.7 41.4 24.76 AdapTrack 36.1 47.3 28.79 LMD Base 30.5 44.6 22.63 AdapTrack 38.4 50.9 26.98

References

Table 1
Part of the dataset

Table 3
Model ablation experiments

Model MOTA IDF1 mAP FPS TPR

Base 31.7 41.4 65.3 24.76 63.6

Base + Auto 34.3 43.9 67.1 24.55 65.1

Base + Auto + Select 35.2 43.6 68.4 24.64 65.5

Base + Auto + Select + Match 36.1 47.3 68.3 28.79 65.5

Table 4
Model robustness comparison

Dataset Method MOTA IDF1 FPS

SMD Base 31.7 41.4 24.76

AdapTrack 36.1 47.3 28.79

LMD Base 30.5 44.6 22.63

AdapTrack 38.4 50.9 26.98