Abstract
Ship tracking at sea is faced with the disadvantages of complex sea conditions and the large influence of ship occlusion on the tracker. Therefore, we propose a method called AdapTrack based on On the Fairness of Detection and Re-Identification in Multiple Object Tracking (FairMOT) which is suitable for marine targets. The search strategy of trivial augmentation is used to randomly select suitable data augmentation methods and strengths. Then, based on the FairMOT tracking framework, we change the sampling selection method of positive and negative samples from a two-dimensional Gaussian distribution with the same variances to a two-dimensional Gaussian distribution with different variances. It is limited by the bounding box (bbox) of ground truth. This method can improve the detection algorithm’s fitness to the ship target. At the same time, we use Multi-Object Tracking by Associating Every Detection Box (ByteTrack)’s double-threshold strategy to divide detection bboxes, which improves the matching and inference speed. In the first stage of data association, the high-scoring bbox calculates the cost matrix of data association through the Re-identification (Re-ID) model. In the second stage, the Intersection over Union(IOU) cost matrix is calculated after merging low-scoring detection bboxes and unmatched detection bboxes of the first stage. The method achieves Multiple Object Tracking Accuracy (MOTA) of 36.1, Identification F-Score (IDF1) of 47.3, and Frames Per Second (FPS) of 28.79 on the Singapore-Marine-dataset. Experiments show that this method can better alleviate Identification Switch (ID-switch) and ensure the real-time tracking of complex and changeable ship target tracking at sea.
Introduction
Object tracking has always been an important direction in the field of computer vision. It combines video processing, automatic system control, informatics, etc. The application scenarios of visual target tracking are very broad, such as intelligent monitoring, traffic supervision, marine monitoring, and other fields. Ship tracking at sea is beneficial to provide auxiliary information on ships. It enhances the safety of life at sea and improves the safety and efficiency of navigation. On the one hand, the difficulty of target tracking at sea lies in how to effectively establish a tracker model that can adapt to the complex and changeable sea environment, ship occlusion, and other difficult scenes. On the other hand, due to the limitation of computing power and the practical application requirements, we hope to stably track the target while maintaining the real-time tracking speed. Therefore, the study of the moving target tracking problem has strong practical significance and application value.
Traditional target tracking algorithms include Mean Shift [5], Particle Filter [18], and Kalman Filter [25]. Because the offset of the target between two adjacent frames is very small. At the same target frame position, Mean Shift can infer the required offset of the target bounding box(bbox). It marks the new target bbox through the image similarity measure. Particle Filter can exert its advantages for nonlinear and non-Gaussian motion models. It assigns appropriate weights to correct sampling points through a large number of samples to achieve tracking along the direction of weight increase. Kalman Filter is mainly used for linear motion models. According to the motion state of the current frame and the observation of the next frame, it can predict a target frame with higher reliability. Because of their limitations in feature representation, their prediction of target trajectories is not suitable for some difficult scenarios.
Due to the rapid development of the Convolutional Neural Network (CNN), visual object tracking has achieved unprecedented progress. Visual object trackers are broadly divided into two categories: detection-based trackers and template-matching-based trackers. Detection-based tracking divides the tracker into detection and data association such as Jointly learns the Detector and Embedding model (JDE) [24] and On the Fairness of Detection and Re-Identification in Multiple Object Tracking (FairMOT) [34]. Among the methods based on template matching, it is mainly represented by the Siamese network such as Siamese multi-object tracking (SiamMOT) [21]. The difference between the two is the variety of matching metrics.
However, the above methods are all general target-tracking models. These models don’t specifically improve model performance along with the attributes of target bboxes. We regard FairMOT as a baseline model. The sea and sky background in the Singapore-Marine-dataset (SMD) [19] is complex, the weather is diverse, and the illumination changes greatly. Among them, the ship targets in the shore-based video are rapidly deformed, the occlusion phenomenon is serious, and there are many small targets. Furthermore, the tracking algorithm in FairMOT has a low utilization rate of the low-confidence target bbox, which leads to a serious Identification Switch (ID-switch) phenomenon. Therefore, our contributions are as follows. (1) We use the Trivial Augment (TA) [17] automatic data enhancement strategy to reorganize the data augmentation method in TA according to the characteristics of the SMD. This method enhances the expressive performance of the subsequent convolutional neural network, which is beneficial to improve the performance of detecting difficult objects in complex situations. (2) Aiming at the disadvantage that the anchor-base network is difficult to alleviate the ID-switch phenomenon, we use the anchor-free network and propose an adaptive positive and negative sample selection strategy based on the target bbox to optimize the model performance. (3) In view of the low matching rate of low-confidence target bboxes, which leads to the ID-switch phenomenon, we propose a dual-threshold target tracking strategy. Based on the combination of motion model and appearance model, the reasoning time of the algorithm is reduced.
The rest of this article is organized as follows. Section 2 reviews related work. Section 3 gives a detailed description of the algorithm implementation. Section 4 is the experimental verification and analysis. Section 5 presents the conclusions and prospects of this paper.
Related work
Multi-target tracking algorithms can generally be divided into two categories: Tracking-by-Detection (TBD) and Detection-Free Tracking (DFT). The Tracking-by-Detection tracking algorithm does not need to initialize the target frame. And it is more suitable for ship target tracking in bay areas with large traffic. On the one hand, due to the complex and changeable maritime environment and various ship postures, using automatic data enhancement algorithms can reduce the time cost. On the other hand, according to the length and width of the ship target, a new strategy of selecting positive and negative samples can fully dig the model expression ability. After that, the real-time algorithm not only simplifies the calculation of the tracker after the high-performance detector but also improves the performance of the tracker.
Automatic data augmentation methods
In the field of image recognition, common data augmentation methods have been manually designed and widely used. Different datasets require different augmentation methods to get the best results. The manual augmentation method can require significant design time and expertise. Automatic data augmentation aims to liberate people from a large number of verification experiments. It selects the optimal data augmentation method by designing a data augmentation search strategy. Learning augmentation policies from data (AutoAugment) [6] is the beginning of an automatic data augmentation search strategy. It designs a complete enhancement strategy composed of multiple sub-strategies. The process of searching for an augmentation policy consists of a controller and a policy optimization algorithm. The boosting strategy proposed by this algorithm is very useful. But the training time of the strategy optimization algorithm is too long. Subsequent methods are basically to solve the problem of long training time, such as Fast AutoAugment [15], Practical Automated Data Augmentation With a Reduced Search Space (RandAugment) [7], TA, and so on. Fast AutoAugment employs a density-matching algorithm to find the best augmentation strategy. RandAugment proves that data augmentation is strongly related to model and training set size. It simplifies the search space and adopts a fixed-strength strategy, so the only hyperparameters are the number of types and strength of data augmentation methods. However, the search cost of the optimized NAS-based automatic data augmentation still can’t be ignored. Above all, we take TA’s almost cost-free data augmentation strategy and replace the data augmentation methods in it.
Object detection models
In the field of ship target detection, commonly used detection data sets consist of grayscale images generated from synthetic aperture radar (SAR) images or images imaged by cameras. SAR satellites have all-day, all-weather, and global coverage observation capabilities, and the resolution has reached the sub-meter level, preferably up to 0.15 meters. With this remarkable advantage, they have great prospects in the fields of disaster prevention, emergency rescue, and land resource monitoring. For example, Depthwise separable convolution neural network for high-speed sar ship detection [32] uses depthwise separable convolution neural network (DS-CNN) to improve the inference speed of SAR ship image detection on the basis of reducing the amount of model parameters. High-Speed Ship Detection in SAR Images Based on a Grid Convolutional Neural Network [31] using grid convolutional neural network (G-CNN) has great application values in real-time maritime disaster rescue and emergency military strategy formulation. A Lightweight Deep Learning Detector for On-Board Ship Detection in Large-Scene Sentinel-1 SAR Images [28] (Lite-YOLOv5) integrates a histogram-based pure backgrounds classification(HPBC) module, a shape distance clustering (SDC) module, a channel and spatial attention (CSA) module, and a hybrid spatial pyramid pooling (H-SPP) module to improve detection performance. A Group-Wise Feature Enhancement-and-Fusion Network with Dual-Polarization Feature Enrichment for SAR Ship Detection [27] proposes a group-wise feature enhancement-and-fusion network with dual-polarization feature enrichment (GWFEF-Net) for better dual-polarization SAR ship detection. However, it is still difficult for SAR images to capture the shape of real-time moving objects. Therefore, at present, we mainly use RGB images taken by cameras for model training and testing. Detection models based on RGB images are roughly divided into two categories: anchor-base and anchor-free. Bbox extracted by the preset anchor for predicting the target information is likely to be misaligned with the target center. Therefore, an anchor-based detection network is not suitable for learning Re-ID features. Unifying landmark localization with end to end object detection (DenseBox) [11] is the first anchor-free detector. The algorithm builds the bbox from the corner positions. A circle at the center of the sample is used to delineate the area of the positive sample point, and the pixels in the boundary part of the positive and negative samples are removed. Detecting objects as paired keypoints (CornerNet) [14] gathers feature information on the corners through the corner max-pooling layer. Positive sample points are divided by generating a Gaussian distribution over the ground truth corners. However, a target box requires at least two corners to be determined. So CornerNet requires a time-consuming paired corners grouping strategy to generate the final bbox. Fully convolutional one-stage object detection (FCOS) [23] divides the positive samples by selecting a small central area which is a sub-bbox of the ground truth bbox. The positive samples are not distinguished according to the distance from the center point, which causes the model to converge slowly. From the above models, we know that different models have different ways of selecting positive and negative samples. Therefore, the adaptive selection of positive and negative samples according to the attributes of the ground truth bbox is an important way to improve the performance of the model.
Object tracking models
The main method of multi-target tracking is the tracking-by-detection paradigm. This method uses the result of target detection for data association in each frame. Data association relies on target features and matching strategies, such as appearance information, motion information, and shape features. Multi-target tracking algorithms mainly include simple online and realtime tracking (SORT) [1] and simple online and realtime tracking with a deep association metric (Deep-SORT) [26] algorithms. The SORT algorithm simply uses the center coordinates, area ratio, aspect ratio, and rate of change to make predictions. It calculates the Intersection over Union (IOU) distance cost matrix between the predicted results and the actual detected results. And then SORT uses a linear matching algorithm to match the detection and tracking bounding boxes through the cost matrix. The Deep-SORT algorithm is an improvement on the SORT algorithm. Because SORT only uses the motion information of the target and ignores the appearance information. It can track objects successfully only when they are high confidence. It combines target motion and appearance feature information. And then cascade matching is used to alleviate the problem of target occlusion. The performance of SORT and Deep-SORT algorithms heavily depends on the performance of detectors, such as JDE, Tracking objects as points (CenterTrack) [35], FairMOT, etc. The DeepSORT algorithm separates detection and Re-ID feature extraction. The biggest disadvantage of this method is that the inference is slow. Therefore, JDE learns the Re-ID feature of the target by outputting one more branch in the detector. A network learns both location information and feature embeddings, increasing the speed of the model inference. But JDE is still a two-stage algorithm. It detects and extracts Re-ID features first, and then performs data association matching to achieve tracking. CenterTrack merges the detection stage and the matching stage and realizes a one-stage MOT by outputting the target offset. However, CenterTrack only considers the relationship between the previous and subsequent frames. It is difficult to form a long-term relationship. Therefore, we provide an ID-switch-resistant, low-complexity tracking algorithm by changing the DeepSORT tracking algorithm.
Ship target tracking algorithm
The performance of the tracking-by-detection target tracking algorithm is very dependent on the performance of the detection algorithm. Therefore, it is aimed at the complexity of the maritime environment and the diversity of ship target size. We perform adaptive automatic data augmentation on the dataset. Given the flattening attribute of ships, the selection method of positive and negative samples is modified. At the same time, we modified the strategy of the data association module of the tracking algorithm. The overall flow chart of the model is shown in Fig. 1.

The overall framework of the target tracking model.
As shown in Fig. 1(a), the TA data enhancement strategy is used in the training stage. In the inference stage, after learning the dataset features through the encoder-decoder feature extraction network, target localization, classification, and Re-ID feature extraction are performed simultaneously. In Fig. 1(b) Detection, heatmap, and center offset are used for target positioning and classification. Bbox size is used for border regression. In Fig. 1(c), the Re-ID part is used to generate the embedding feature vector.
Since TA mainly studies the application of data augmentation in image classification scenarios. It is not suitable for our requirement of data augmentation for object detection. Data augmentation in object detection needs to consider both image and bounding box transformations. We simply adopted its effective search strategy. TA is defined as a data augmentation function a and the corresponding intensity value m (Some data augmentation functions do not work with intensity values). First, a collection of images and data augmentation methods are as input. Second, a data augmentation method is randomly sampled from a. And then it uniformly samples a value from 0, 1, 2, …, 30 as the intensity m. Third, the model returns the enhanced image. For the data augmentation method of the marine environment, we filter out data augmentation methods for the variable marine environment and ship targets through prior experience. In the beginning, for the changing sea and sky background, the image needs to be pixel enhancement, such as randomly changing the hue, saturation, lighting, and brightness of the image, adding Gaussian, salt, and pepper noise, performing Gaussian blur, histogram equalization, and subtracting the mean. Thereafter, the data enhancement method for ship targets needs to consider the occlusion, deformation, and size of the ship such as random cropping, scaling, rotation, translation, cutout, random erasing, copy-paste, mosaic, shadow enhancement [29] and rectangle training. The enhancement process is shown in Fig. 2.

Data augmentation strategy.
Figure 2 is the process visualization of TA. For each image, TA uniformly samples a data augmentation function and an intensity value. Furthermore, previous methods tend to stack multiple data augmentations, while TA only uses a single data augmentation for each image. In this way, the TA-enhanced dataset can be regarded as that picture is enhanced separately by all data enhancement methods, and then uniformly sampled from it.
In FairMOT, there are various backbone networks used for the backbone of the detection network, such as Deep layer aggregation (DLA34) [30], Deep high-resolution representation learning for human pose estimation (HrNet) [22], and you only look once version 5 (YOLOv5) [12]. We used DLA34 as our baseline which is based on Keypoint triplets for object detection (CenterNet) [9]. CenterNet divides the object detection center into two parts: center localization and bbox regression. In terms of localization, it generates a Gaussian distribution heatmap centered on the target center. The regression defines the pixel at the center of the object as a training sample. And it directly predicts the height and width of the object. This part also predicts the position offset of the down-sampled model to correct the coordinates of the object center. At the same time, to jointly train the embedding and the detection network, the detection network in FairMOT has an additional identity embedding branch for generating Re-ID features. CenterNet adopts a positive and negative sample selection method which is similar to CornerNet. The positive and negative sample is shown in Fig. 3.

Positive samples (coincident with the center point of the ground truth) and negative samples.
When coincident with the ground truth, the detection bboxes are regarded as positive samples, and the rest of the detection boxes are regarded as Gaussian-weighted negative samples. In CenterNet, the correspondence between the ground truth and the feature points is to find the corresponding feature points according to the geometric center of the ground truth given by the training data. The heatmaps are set in the category channel. Specifically, the center value of the corresponding feature point of the corresponding category is set to 1. After that, the value corresponding to the category in other feature points near this feature point continues to decrease according to the Gaussian distribution. The sample selection method is shown in Fig. 4(a).

Different positive and negative sample selection methods.
The two-dimensional Gaussian distribution used by CenterNet is the product of two one-dimensional Gaussian distributions of x and y. The standard deviation σ in the two dimensions is the same, and the center point is
Its projection on the two-dimensional surface is a circle. However, the shape of the ship target is usually flat, and it is rich in different aspect ratios. Therefore, we separate the standard deviations of the two one-dimensional Gaussian distributions of x and y. The expression of the modified two-dimensional Gaussian distribution is shown in Equation (2).
As shown in Fig. 4(c), its projection on the two-dimensional plane is an ellipse controlled by
Therefore, the two-dimensional Gaussian kernel with a different standard deviation
The detection module generally determines whether it is a target through a single threshold. However, simply ignoring the low-confidence target bboxs can reduce the precision of trajectory matching. The double-threshold method improves the probability that the low-scoring bounding box is matched. At the same time, the high-quality detection frame only uses the motion model for tracking, which reduces the calculation amount of the Re-ID feature similarity measure. The tracking strategy is shown in Algorithm 1.

Data association algorithm
The Tracking-by-detection tracking architecture consists of the following components. (1) Track creation, retention time, and deletion. (2) Using Kalman Filter for state estimation and update. (3) The data association between the trajectory and the detection frame. We assume no camera shake and an online object tracking method. Our tracking framework and data association method are shown in Fig. 5.

Different forms of double threshold data association matching.
As shown in Fig. 5 detection 3, when the detection of the current frame is not associated with any existing track, a new track will be created for the detection frame larger than the low threshold. As shown in Fig. 5 detection 2, detection bboxes smaller than the low threshold will be discarded directly. Each track will record the number of the last successful matching frame. If the current frame minus the last successful matching frame number is greater than the preset maximum retention time, the track will be deleted.
Using Kalman filter for state estimation and update
We define an 8-dimensional state space
In Equation (7), x is the state value of the trajectory at time step
In Equation (9), z is the observation space vector, which does not contain differential variation. H is called the measurement matrix, which maps the mean vector
Equation (14) calculates the Mahalanobis distance between the sample observation space and the observation mean as a threshold for whether to choose a trajectory. Then the
Data association method
The association between the current frame detection bbox and the trajectory is based on the association matrix. The correlation matrix is generally defined by the target bounding box IOU, the Mahalanobis distance of the target center, and the cosine distance of the Re-ID feature. We adopt the target bbox IOU and Re-ID feature cosine distance as a measure of two-stage data association. Usually, after establishing an association matrix, the association between trajectories and detections is formulated as a linear assignment problem. The Jonker-Volgenant [13] allocation algorithm is a low-time complexity method to solve it. In addition, based on the superior performance of ByteTrack [33], we apply a double threshold
As shown in Fig. 5 detection 1, the first detection set uses the Embedding cost association matrix to measure the degree of association between the detection frame
Trajectories τ that are still unmatched after two-stage matching will be added to
Experiment analysis
In this section, we tracked ship targets to evaluate the effectiveness of the model. Firstly, we introduced the marine target dataset required for the experiment. Secondly, the specific implementation details of the experiment were introduced. Finally, comparative experiments and robustness experiments were carried out. And the experimental results were analyzed and discussed to verify the specific effects of the model.
Dataset introduction and metrics
The public dataset used in this experiment was the SMD. All video frames were shot at
Part of the dataset
Part of the dataset
Dataset information statistics
We used mean Average Precision (
To verify the practicability of the model and the correctness of the experiment, we used anaconda to build the experimental environment on the ubuntu18 operating system. The GPU model is NVIDIA GeForce GTX 3070 and the graphics memory is 8 GB. The programming language is python. For the detection module, CenterNet were used as the basic detection module. DLA34 were used as a backbone. Common objects in context (COCO) [16] pretrained weights were used as initialization weights. The model were trained for 70 epochs using SMD with a batch size of 4. The image size of the input training was uniformly scaled to
In the tracking linear assignment step, if the IOU value of the tracking bbox and detection bbox was less than 0.5, they would not be matched. For the embedding cost matrix, the optimal estimation equation of the Kalman filter can be deduced from the chi-square distribution of the state vector. Therefore, the expectation 9.4877 of the inverse chi-square distribution with a 95% confidence interval of 4 degrees of freedom is used as
Model ablation experiment
We adoptd the SMD for model ablation experiments. With the same hyperparameters, all models were trained for 70 epochs for comparison. Our baseline model was FairMOT. The original FairMOT model adopted three data enhancement strategies such as random Hue, Saturation, Value(HSV) color enhancement, random affine transformation, and random left-right flip. Based on the FairMOT model (base), we sequentially added automatic data augmentation (Auto), positive and negative sample selection strategy (select), and data association matching strategy (match) improvement. The results of the ablation experiments are shown in Table 3.
Model ablation experiments
Model ablation experiments
It is known from the table that the automatic data enhancement and the positive and negative sample selection strategy have almost no cost to the performance improvement of the model. It improves the performance of the model during the training phase without redundant weight information. The data association matching strategy pays more attention to the performance of

Result of large number of ships.

Result of sea fog scenes.

Result of sea fog scenes.

Result of sea fog scenes.
We compared the tracking results of some specific frames under different models. In Fig. 6, we can see that the performance of the detector is relatively saturated, and the biggest difference is that the assignment of ID and the ID-Switch are serious. Figure 6 is the 1st, 3rd, and 392nd frames of the video
We choosed CenterNet, FCOS, and our positive and negative sample selection strategies. The backbone adopted dla34 uniformly. Models are post-processed in the same way. Set the minimum confidence threshold to 0.3 and set the minimum bounding box area to 60. The CenterNet and our model corresponded to the improved focal loss, and the FCOS corresponded to the focal loss. The model was uniformly trained on the SMD tracking training set for 70 epochs. In this block of experiments, we added the experimental results as shown in Fig. 10.

Comparison of positive and negative sample selection methods for detection models.
It can be seen that adopting different positive and negative sample selection strategies have a large impact on the same model. Both
We selected 4 different state-of-the-art(SOTA) tracking models, including DeepSORT, JDE, CenterTrack, and ours(AdapTrack). DeepSORT, JDE uses a combination of motion model and Re-ID similarity matching method. CenterTrack matches by finding the smallest predicted offset. We kept the model hyperparameters constant and each model was trained on the SMD for 70 epochs. Their results are shown in Fig. 11.

Comparison of tracking models.
It can be seen that our ship-tracking strategy outperforms several other models on the SMD with higher accuracy and speed. But the CenterTrack has a higher FPS value. The reason is that CenterTrack has a very simple tracking strategy. This strategy can’t relieve the ID-switch problem. Therefore, our model has higher MOTA and IDF1 values.
We compared the robustness of our method (Base, Base + Auto + Select + Match) on two ship datasets (SMD, Vision Based Large-Scale Maritime Ship Tracking Benchmark for Autonomous Navigation Applications(LMD-TShip) [20]). To be fair, we used roughly the same number of frames for training and testing. The tracking results are shown in Table 4.
Model robustness comparison
Model robustness comparison
We can find that the adaptively improved detector and the tracker applying double-threshold achieve better performance even on different datasets. Therefore, it is proved that our model has good robustness on different datasets.
Based on FairMOT, we propose an effective multi-target tracking model for ship targets. The mwodel can achieve MOTA 36.1, IDF1 47.3, FPS 28.79 on the SMD. We use the TA automatic data augmentation method during the training part. It greatly reduces the time spent on tuning parameters. However, tracking is poor for dimly lit scenes. Our tracking model is a one-shot multi-object tracking model. At detection part, we design an adaptive positive and negative sample selection strategy to improve the learning efficiency of the model and feature expression performance. In the tracking section, a dual-threshold target tracking strategy improves the utilization of low-scoring frames and reduces the matching time cost while ensuring the tracking effect of high-scoring frames. Compared with the original model, the overall model has been optimized in terms of tracking accuracy and inference time, but the phenomenon of ID-switch still occurs in the face of severe occlusion scenes. Therefore, we hope to further improve the tracking effect of difficult scenes in the future.
