Abstract
Rich image information is one of the important means through which unmanned surface vehicles effectively and reliably identify targets during autonomous navigation. However, the adaptability of traditional artificial design feature methods in target representation and differentiation remains limited due to the diversity of ship target types, different scales, and complex and dynamic outdoor scenes. This study proposed a ship target recognition method based on single shot multi-box detector (SSD) deep learning. First, training and test sample sets were constructed by acquiring and creating a ship’s target image and background image under different types and scenes in an actual river environment. Subsequently, the sample set was used to train and optimize the SSD depth model to achieve adaptive extraction and recognition of target features. Lastly, ship identification experiments with different background environments and foreground targets were performed to test the effectiveness of the proposed method. The support vector machine method based on artificial feature extraction was used for the comparative experiments. Experiment results showed that the SSD-based deep learning method achieved better results than the artificial design feature method in terms of recall and precision rates.
Introduction
Compared with traditional ships, unmanned surface vehicles (USVs) can automatically complete navigation tasks without human intervention through autonomous navigation systems. That is, USVs can be controlled on shore via remote control systems, thereby effectively reducing the impact of human factors on navigation safety. USVs have been receiving increasing attention and have been developing rapidly [1]. The automatic identification of ship targets in front of USVs is crucial to autonomous navigation or safe assisted driving [2]. Shipborne cameras are widely used in the sensing system of USVs because of their low price and rich navigational environment information [3]. Therefore, image-based ship target recognition in front of USVs has important application prospect and research value.
Image target recognition is one of the most applicable and highly targeted topics in image analysis and computer vision. In the early research on the image recognition of environmental obstacles by USVs, most of the techniques used in the literature inherit the classical target detection algorithm, including the use of wavelet transform, gradient detection, and energy accumulation [4], or texture features [5], to describe the detection of infrared weak targets. Wei et al. [6] proposed a surface ship detection method based on wavelet transform. The method first detects the water antenna via wavelet transform and then uses the principle of correlated wavelet energy synthesis to detect surface ships. Zhou et al. [7] combined an image morphology model with wavelet transform and proposed a sequence image fusion method for detecting surface ships. Llorca et al. [14] proposed the histogram of oriented gradients (HOG) and combined it with a support vector machine (SVM) classification method for target recognition; their method has been widely used in vehicle sign detection. Although these methods have achieved varying degrees of successful results, they all adopt corresponding artificial design features in accordance with the characteristics of application scenes and image objects, and their adaptability and robustness are relatively limited. In particular, USVs are utilized in dynamic outdoor environments with complex and variable backgrounds, and thus, they do not exhibit unity. Moreover, USVs have many types and different sizes, and changes in the angle of view and relative distance during movement will result in dramatic changes in image appearance. Therefore, for the application environment of USVs, conventional artificial design features can hardly describe diverse dynamic ship targets and distinguish them from complex background scenes.
Deep learning is a new pattern recognition technology that exhibits the advantages of self-adaptive feature extraction and automatic target recognition. It has been widely used in the field of voice and image recognition [8–10]. In the current study, a new ship target detection method based on single shot multi-box detector (SSD) deep learning was adopted. Ship target and background image sample sets were produced in different types and scenes by acquiring a large number of image data in real scenes. Then, the SSD depth model was trained and optimized to achieve the adaptive extraction and recognition of ship target features. The validity of the method was verified via grouping and whole tests on the test sample sets in different background environments and types of foreground targets.
The reminders of this paper are organized as follows. Section 2 introduces the proposed SSD method for ship image recognition. Section 3 presents the experimental results and discussions. We conclude this paper in Section 4.
Ship image recognition method based on SSD
SSD model
The SSD algorithm is a multi-target detection depth learning algorithm that can directly predict target category and border markers. It achieves high recognition speed and accuracy as an image recognition depth learning method [11–13]. The basic structure of the SSD model used in this study is shown in Fig. 1. The basic network adopts the visual geometry group (VGG) structure.

SSD model structure.
First, the first five sets of the volume sets of VGG were used, and the 6th and 7th layers of VGG were converted into two convolution layers using the astrous algorithm. Subsequently, three different scale convolution layers and one average pool were added, and different convolution layers were used to predict the offset of the default box and the scores of different categories. Lastly, the final detection result was obtained using the non-maximum suppression algorithm. The selection of default boxes and loss functions is crucial for the SSD model.
(1) Default box
The SSD model is similar to the default anchor box of the fast region convolutional neural network (faster R-CNN), which uses a set of fixed aspect ratio default boxes to match the ship image, and the position and size of the default box are compensated for by the regression algorithm. However, unlike faster R-CNN, SSD must be predicted in different feature layers, and thus, the default box scale should be set for each layer as shown in Equation (1):
(2) Loss function
During training, the ship’s image and corresponding labeling data are simultaneously adopted as input.
The formula for the overall loss function is follows:
Correspondingly, the confidence loss formula is as follows:
Ship image recognition includes two stages: learning training and testing of the model. First, a large number of ship image samples must be labeled. Then, the corresponding relationship between the ground-truth box and the default box of each labeled sample is established. Finally, the processed samples are divided into training and testing samples for model training and testing.
Ship image sample labeling
The training set required by the SSD model includes the ship’s image and the description file of the image. The description file contains the coordinate information of the center point of the minimum ship bounding box, the length and width of the bounding box, and the classification information of the objects in the bounding box (only ship-like objects are included in a single classification). Ship image sample annotation is performed using the open-source sample annotation tool, LableImg in GitHu. This software is based on Python and Qt language development, as shown in Fig. 2.

Sample labeling software. (a) 8×8 characteristic diagram, (b) 4×4 characteristic diagram.
Figure 2 shows the running interface for labeling ship images using LableImg in the Ubuntu system. An image may have multiple categories of objects that must be labeled, and the minimum bounding box of each category corresponds to a label. This experiment only detected ship targets, and thus, only one label was required. The lower right area is the path information of the annotated image. A minimum circumscribed rectangle is drawn for each ship to be inspected in the image. Then, a picture description file is generated as shown in Fig. 3.

Example of VOC2007 data labeling.
After dataset labeling, the target vessel in each training image acquires a ship type label. The SSD model assigns this label to a specific output of the fixed detector output set during the training process, calculates the loss function end-to-end and propagates backward, and gradually adjusts the network parameters using the stochastic gradient descent method.
During training, the ground-truth box of each positive sample must be associated with the default box. Given that each default box varies with the feature layer and its own aspect ratio, any default box with a Jaccard overlap that is higher than the threshold can be matched with the real data box by setting the overlap threshold when establishing the corresponding relationship.
Figure 4 shows the sample characteristic diagrams of different layers. The default box set of different aspect ratios is matched to the real frame during training. The red default frame is the box where the Jaccard overlap is greater than the threshold and the match is successful. These boxes are considered positive in training, whereas the remaining blue boxes are negative.

SSD default box and training method.
When testing with the trained model, the 3×3 convolution kernel is used to convolve the auxiliary classification layer to evaluate the confidence and offset compensation of the default box of different aspect ratios at each position. Then, the most probable prediction box with confidence is generated by the non-maximum suppression algorithm.
First, the appropriate location for the collection of ship image samples was selected. Subsequently, the ship recognition experiment based on SSD deep learning was performed for different scenes and ship types. Furthermore, the SVM algorithm, which is widely used in image recognition, was selected for the ship recognition experiment from different scenes and perspectives for comparison.
Ship image sample collection
After field investigation, the section from Zhonghua Road Pier to Wuhan Guanquan Pier of Wuhan Bridge along the Yangtze River Channel was selected as the experiment site for image collection, as shown in Fig. 5.

Experiment scene.
A large flow of ships is observed in this area, which include not only freight ships that frequently travel along fixed routes, but also ferries, cruise boats, and other special ships, thereby enabling the collection of numerous ship samples. During the experiment, the camera was set up 1.5 m from the water surface of the ferry, and the front image was continuously collected. Figure 6 shows sample images of ships collected under different weather conditions.

Ship images under different weather conditions. (a) Observation toward the light on sunny days, (b) Observation with backlight on sunny days, (c) Dusk, (d) Cloudy days.
After model training, image detection tests were conducted under different weather conditions. The overlap threshold was set to 0.6 in all the experiments.
Experimental results
(1) Experimental results for different scenes
The test samples were divided into four types of scenes: with backlight on sunny days, toward the light on sunny days, cloudy days, and at dusk, each with 388, 289, 215, and 108 sheets, respectively. The ships in each category were tested and counted. The results of ship detection in different scenes were statistically summarized in Table 1.
Ship detection results at different scenes
Ship detection results at different scenes
(2) Experimental results for different types of ships
The test samples were classified and tested according to different ship types and classified into four categories, namely, small yachts, large ferries, cargo ships, and special ships, with 315, 292, 314, and 93 images, respectively (because more than one ship may appear in the same picture, it should be included in different categories). Given that various ships may appear in one picture, the experiment did not statistically predict the number of ships and precision rate.
The results are summarized in Table 2.
Detection results for different types of ships
(3) Experimental results for ungrouped ships
The ships in all the test samples were tested and counted. The results obtained from Experiment 1 are summarized in Table 3.
Non-packet ship detection results
The P–R curve of ship detection was plotted at different thresholds as shown in Fig. 7 and the detection effect is shown in Fig. 8.

P–R curve of ship detection.

Examples of ship detection effect.
From the preceding experimental results, the following conclusions can be drawn.
(1) Under the four weather conditions, the SSD model in the backlight environment of sunny days achieved the best detection effect. The deep neural network can extract sufficient features to determine its category probably due to the clear texture details under sufficient light.
(2) The model was better in detecting small yachts and large ferries than cargo ships and special ships. The possible reasons for this result can be summarized as follows. 1) The first reason is the unbalanced training samples. In the sample collection process, given the fast speed of small cruise ships and large ferries and their frequent round-trips, more samples of these ships were obtained than those of cargo ships, which have slower speed, diverse ship types, and different appearances. 2) The second reason is the model sample frame. The aspect ratio was considerably larger than 4 : 1 because some types of cargo ships are low and long, thereby making the distance between the default frame and the real frame a nonlinear regression problem. At this time, the application of a linear model could not correctly recognize these ship images.
(3) In the ungrouped experiment, the overall detection effect of the SSD model was good, and the recall and the precision rates reached approximately 0.9.
Comparative experiment of SSD and SVM classifiers
To verify the relative effectiveness of the SSD algorithm for ship image recognition under different weather conditions and from different perspectives, the SVM classifier based on the HOG feature was used for the comparison of ship identification experiments in an open background.
SVM sample production and parameter setting
(1) SVM sample production
In contrast with the SSD model, the samples required by SVM are positive and negative ship images. To improve the efficiency of sample extraction, a ship image sample acquisition platform was built, as shown in Fig. 9. The platform consists of an image and video display module, an image and video acquisition module, a sample acquisition module and a classifier module.

Sample collection software.
Each collection was limited to the length–width ratio of the sample. The ship image was selected by adjusting the position of the sample box. After completion, the sample was collected into the next frame image, as shown in Fig. 10. The positive sample is the frame image of the ship that should be tested, whereas the negative sample is the image of the specified aspect ratio that was randomly captured in the background image. All the samples were normalized to the specified pixel size, which considerably accelerated the production of samples. To combine the actual river surface environment, the negative samples must include samples images apart from those of ships, such as bridges, piers, surface floats, and shore objects.

Example of sample collection.
During the experiment, 3856 images of ships were collected, in which 2856 were selected to produce positive samples of ships. The remaining 1000 were used as test sets, and 4054 negative samples were generated. In this experiment, sample size was 128×64, and the feature dimension that should be calculated was 3780. Two samples are presented in Fig. 11.

Samples of images.
(2) SVM parameter setting
The SVM classifier must be set with relevant parameters during training. Two key parameters must be set, namely, HOG and SVM.
1) HOG parameters
HOG parameters include seven major parameters, which are set as shown Table 4.
Setting of HOG parameters
2) SVM parameters
After the feature descriptor of a sample is generated, a 1D number must be added after the feature descriptor of each sample, typically a positive sample plus 1 and a negative sample plus 0. Four types of kernel functions, namely, linear, polynomial, radial basis (Gauss), and sigmoid kernels, were used in the experiment. Finally, two kernels with the best performance were selected for further experimental comparison and analysis. The linear kernel function has no parameter setting, and the polynomial kernel function is set as follows:
Among them, parameter d was set to 2.0 and γ was set to 8.0.
Figure 12 shows an example of a ship detection comparison of four different scenes. The green thick line frame in the figure shows the classification result of the HOG+SVM model, whereas the blue thin line frame presents the SSD classification result. The following information can be inferred from the figures.

Example of ship detection in different environments. (a) Toward light on sunny days, (b) Backlight on sunny days, (c) Cloudy days, (d) Dusk.
1) The pictures in Group (a) indicate that the two models performed well in the backlight environment of sunny days, and another ship was found mixed in the background of the SSD model.
2) The use of both models in Group (b) correctly indicated the position of the ship, but the detection frame of the HOG+SVM method did not completely cover the ship.
3) In Group (c), the HOG+SVM model presented a false detection phenomenon, whereas the SSD model exhibited a missed detection phenomenon, thereby highlighting the tendency of the two classifiers in poor light environment. When light was insufficient to extract sufficient features, the default frame of the SSD that matches the missed ship cannot obtain a feature confidence that is greater than the threshold. Thus, correct targets cannot be detected. The gradient direction feature extracted from the sliding window of the HOG+SVM model was similar to the feature in the sample set. Thus, non-ship targets were detected.
4) In Group (d), the detection effect of the two models was extremely poor because of the dim light and the angle of view.
These findings indicate that the features extracted from the deep learning model based on SSD are sufficient, including not only the texture details of a ship, but also other features, such as ship contour. The HOG feature mostly contains the characteristic information of gradient direction, and thus, can detect any image information that is consistent with the training sample in the gradient direction, thereby increasing the false detection rate.
The experimental results of the four groups of typical different angles are shown in Fig. 13. Similarly, the thick green line frame in the figure denotes the classification result of the HOG+SVM model, whereas the thin blue line frame represents the SSD classification result. The classification result of the SSD model is more accurate than that of the HOG+SVM model. The specific conclusions are as follows.

Typical problems of the detection results. (a) Toward the light on sunny days, (b) With backlight on sunny days, (c) Cloudy days, (d) Dusk.
(1) The performance of the SSD model was better than that of HOG+SVM as reflected by the comparison of Groups (a), (b), and (c). Given that the SVM classifier was limited by the features used and the size of the sliding window, its detection scale was only consistent with the HOG features in the window. By contrast, the SSD model could extract features from multiscale images.
(2) In the comparison of Group (d), the HOG+SVM model detected an extremely long and narrow as two ships, whereas the SSD model did not detect any ship. This result may be attributed to the ship being too low, and thus, SSD lost the information through several layers of sampling in the process of feature extraction. The distinct aspect ratio of the ship, which hindered the model from being correctly returned to the real data box, may be another reason.
In accordance with the experimental results of the SVM and SSD classifiers, the performance of the SSD model was evaluated by adjusting different thresholds and determining the precision and recall rates of target detection. The comparison results are shown in Fig. 14.

Comparison of P–R curves in different models.
The curve in the figure indicates that the deep learning algorithm based on the SSD model exerts a better comprehensive effect on ship detection than the machine learning method based on HOG+SVM, and the recall and precision rates can reach or exceed 0.9.
In this study, an image recognition algorithm for ships in front of USVs was proposed based on the SSD deep learning model, and a large number of real ship image samples were collected in an actual river environment. Ship image recognition experiments based on SSD were performed. Finally, the experiments were compared with the HOG+SVM artificial feature method in different scenarios and from various perspectives. The following results are found.
1) The SSD model can maintain high recall and precision rates under different weather conditions and from various angles, and the detection effect on ships is good.
2) The comparison of the detection results of the HOG+SVM and SSD models shows that the SSD deep learning method has a lower false detection rate than the SVM algorithm and can detect ship targets in multiscale.
In general, the ship image recognition algorithm based on SSD deep learning exhibits good adaptability to complex interference factors in an actual scene of USVs.
Footnotes
Acknowledgments
This study is supported by the Fujian Province Natural Science Foundation (No: 2018J01506), University-industry cooperation program of Department of Science and Technology of Fujian Province (No.2019H6018), Fuzhou Science and Technology Planning Project (No: 2018S113, 2018G92), the Educational Research Projects of Young Teachers of Fujian Province (No. JK2017038, JAT170439), and the 2017 Outstanding Young Scientist Training Program of Colleges in Fujian Province.
