Abstract
The large scale, high speed and increasing number of vessels along with busy sea routes increase the complexity of marine traffic. It is important for traffic controllers or mariners to understand the traffic situation and pay more attention to high-complexity area. In previous studies, density-based clustering algorithm was often used to discover high-density vessel clusters, so as to evaluate collision risk in waters. However, it can be argued that ship’s encounter situation was ignored with those algorithms. This paper focuses on complexity modeling of the two encountering ships and clustering using data mining technology. A complexity model is proposed by employing intrinsic features to reflect pair-wise interactions between ships. A clustering method of ship to ship encountering risk is presented on the basis of complexity by proposing a new distance definition, to quickly calculate the complexity of a large number of ships in an area.
Keywords
Introduction
The automatic identification system (AIS), which is to be compulsorily installed on vessels of 500 gross tonnages and above, has brought a large amount of ship dynamic data. However, the massive traffic data do not always help traffic controllers or mariners to understand the traffic situation. On the contrary, too many irrelative data may decrease the cognition of other important information [1]. Therefore, how to draw useful massage from the massive data has attracted considerable attention in terms of both academic research and practical applications.
Currently, there are primarily two methods to demonstrate traffic situation. First focuses on the statistics and analysis of traffic characteristics from a macroscopic perspective by mining historical AIS data. The research waters are divided into grids, and the number of AIS objects on each cell is counted to obtain the ship density distribution [2]. The routes and mooring areas are estimated using the AIS trajectory clustering method [3]. Second studies on ship behaviors at the micro level using ship domain, traffic conflict and probability theories. Encounter situation and near-miss are detected to evaluate the ship collision risk on the basis of AIS data [4, 5, 6]. Ship behaviors such as fishing, barging, smuggling and other activities, can also be determined by analyzing AIS data [7].
Nevertheless, there are few researches that have applied both of the two methods simultaneously. In previous AIS-based studies, for instance, density-based clustering algorithm was often used to discover high-density vessel clusters, so as to evaluate collision risk in waters [8, 9]. It can be argued that the encounter situation was ignored with those algorithms. Apparently, there is no higher risk for well-organized high-density ships in traffic separation schemes. Therefore, the various encounter situations such as approaching or receding, head-on or crossing should be significant factors in clustering.
Actually, there are three approaches to rank encounter risk in general. The first approach uses the progress of the distance at closest point of approach (DCPA) and time to closest point of approach (TCPA) during encounters. Second estimates the normally maintained ship domains, and determines whether these domains are entered [10]. Third proposes a frame to rank the conflict severity or complexity [11, 12, 13]. However, due to the enormous complexity in accounting for the spatio-temporal relations of multiple interacting traffic users, all methods mentioned above mainly focus on pair-wise encounters between vessels.
Recently, traffic complexity has received more and more attention in academic research. It was developed to estimate air traffic controllers’ workload at the beginning. Along with the development of relative research, the traffic complexity was gradually applied to describing traffic situation. Some researchers believe that traffic situation are only related to traffic intrinsic features (like: location and motion) [14]. A marine traffic complexity model was built on the basis of intrinsic features, to describe the traffic situation [1]. Nonetheless, the proposed model has just assumed the standard length overall (LOA) and its utility in data mining has not yet been discussed.
Accordingly, this paper primarily focuses on complexity modeling of the two encountering ships and clustering using data mining technology. Compared with previous academic work, its main contributions are involved in three fields. Firstly, the research is divided into two steps, complexity modeling and density-based clustering, combining the microscopic analysis of ship encounter behavior with the macroscopic statistics of ship density distribution. Secondly, the complexity model is proposed by employing intrinsic features, different LOA and a novel analysis of crossing angle factor. Eventually, a clustering method of ship-ship encountering risk is presented on the basis of complexity and risk factors analysis by proposing a new distance definition.
Complexity model
Basic concept
Two encountering vessels at close range,
There is a distance boundary (
Complexity analysis
Relative distance, movement trend and crossing angle are selected as influence factors on the complexity of vessel couple [13].
It follows the hypothesis that the complexity of vessel couple has continuity, that is to say, the complexity values change continuously with the changes of the factors [1].
Diagram of movement trend analysis.
If both movement trend and crossing angle are invariant, the complexity with closer relative distance becomes greater. When
Movement trend factor
In the case of the same distance and crossing angle, there are two opposite kinds of movement tendency, approaching or receding, depending on the relative position and orientation of the vessels. In most cases, there is no risk involved in the receding vessels [12]. The complexity of receding vessel couple can be defined as zero.
The relative motion of two vessels can be determined as [16]
It can be deduced from Fig. 1,
Set
The diagram of ship encounter situation.
Two-ship encountering is shown in Fig. 2 [17], with reference to “convention on the international regulations for preventing collisions at sea, 1972” (COLREGS). With
Merely, when numerous ships approach at the same time, they shall search for an optimal set of safe trajectories of all ships involved in an encounter [18]. A power-driven vessel which takes action in a crossing situation in accordance with sub-paragraph 17.3 of the COLREGS to avoid collision with another power-driven vessel shall, if the circumstances of the case admit, not alter course to port for a vessel on her own port side. Whereas, the common practice of seafarers in multi-ship encounter situation is to obey the regulations on specific situation of encounter. For instance, it tends to turn to starboard when avoiding a ship coming in front [19], which is contradictory to the give-way ship’s anti-collision behavior with
Therefore, we suppose that the complexity is maximized when
The complexity increases nonlinearly with decreasing
Angle complexity as a function of 
Besides, to ensure the assumption that the complexity is continuous, the following functions are constructed [20]:
As mentioned above, the encounter situation should be an influential factor in the density-based AIS data clustering. Supposing that complexity of vessel couple is sum of the complexity generated by encounter situation and the complexity due to spatial distance, it can be expressed as follows:
On account of LOA’s difference, the ship domains have distinguishing radiuses,
Typically, the more complexity, the more dangerous the vessel couples are. This paper proposes a new distance definition of order distance, that is, the narrower the order distance is, the more hazardous the vessel couples are.
Scholars usually study how to compress ship trajectory big data to minimize data storage capacity while ensuring data quality, for instance, using Douglas-Peucker algorithm [21], and focus on algorithms for detecting and pre-processing errors of AIS tracks including physical integrity, spatial logical integrity and time accuracy [22]. This paper highlights the issue of ship AIS data on the same time slices on account that encounter is a simultaneous appearance of vessels in a certain limited area.
In order to calculate the complexity of vessel couples, the AIS data points should be obtained from the same time slice
For an area, a large number of AIS data points are to be extracted from vessel traffic service (VTS) system over a period of time. Specifically, each point has seven attributes that are time, MMSI, longitude, latitude, SOG, COG and LOA, denoted by
AIS data are reduced by the following steps:
Remove the AIS data in which MMSI is substandard. Remove the AIS data with speed less than or equal to 2.0 knots. When SOG is larger than 2.0 knots, AIS data transmission internal is less than 30 seconds regardless of being Class A or Class B AIS equipment [23], so that there are sample points in a minute all the time. Remove AIS data except the first one in a minute. For example, if a vessel Calculate the vessels’ location at exactly minute respectively. Assume that the speed and course of vessel
Diagram of AIS data reduction.
Ordering points to identify the clustering structure (OPTICS) is an algorithm for finding density-based clusters in spatial data [24], which is originally stemmed from the spatial distance and its fundamental approach cannot deal with spatio-temporal data. In order to get complete complexity of each time slice in a period of time, the OPTICS definition and procedure are improved in this paper.
Algorithm redefinition
Actually, improved OPTICS algorithm still requires two parameters:
AIS points of
Procedure of improved OPTICS (DB, 
As illustrated in Fig. 5, the main procedure is OPTICS (DB,
Update (N,
Apparently, through the algorithm, AIS data points are linearly ordered so that the points which are the closest on order distance become well-aligned neighbors. The diagram of reachability distance can distinctly reveal the low-concave area, indicating the greater risk among vessels, which is more accident-prone.
Complexity of marine traffic
At a time slice, there are numerous vessels in an area and every two of them may make up a pair of vessel couple.
When the algorithm outputs some vessel objects whose reachability distances are defined, there will be complexity of marine traffic among the vessel objects. MinPts
Procedure of update (N, 
Diagram of research waters.
Usually, the complexity of marine traffic may be efficient to rank risk level at a time slice, so as to reveal traffic situation minute by minute in real time. It is crucial for VTS operators to grasp the complexity variation over time and lay more emphasis on the high-complexity area timely.
The above-mentioned period can be divided into
Admittedly, the average complexity of marine traffic can not only discover long-term traffic pattern and knowledge, but also assess current traffic status of VTS area and the workload of VTS operator. Through traffic organizations, VTS operator makes the reduction of multi-ship encounter at close range possible, so that his performance can be measured by average complexity of marine traffic with his work hours.
AIS data collection and reduction
For case study, respectively, four points (ABCD) of the water area in Zhoushan North Sea of China are selected, where the East Route, West Route, Majishan Fairway and Yangshan Fairway converge, as shown in Fig. 7. The marine traffic flow here is relatively complicated.
A total of 2,525,478 AIS data points for a month (1 July to 31 July 2017) are extracted from VTS system. After the AIS data reduction caused by the method explained in Section 3.1, 447,632 data records are kept in total, demonstrated in Table 1.
Parameter settings
Experts are invited to set parameters. Given the rather extensive scope of the elicitation, it was preferred to select only a limited number of experts who were able to contribute their expertise over a longer time. The experts include one captain (10 years of experience), two first officers (10 and 5 years of experience) and two VTS operators (8 and 6 years of experience).
AIS data list after reduction (part)
AIS data list after reduction (part)
Besides, more information needs to be provided by mariners and traffic controllers in order to obtain the value of
It can be calculated as
Illustration of OPTICS clustering (part).
Programming the optimized OPTICS algorithm on the platform of waikato environment for knowledge analysis (WEKA) based on JAVA language, the clustering can be interpreted as follows.
The clustering result of improved OPTICS algorithm (part)
The clustering result of improved OPTICS algorithm (part)
px
Setting
The AIS data object (MMSI: 41348300) at 03:22 o’clock on 2 July 2017 is denoted by column A (key: 100950), whose core distance is 0.21542 and reachability distance is undefined. It is density-connected with nine adjacent AIS data objects (in output order, their MMSI are 413702090, 482458290, 373134000, 566047000, 413370750, 412436360, 413489830, 412148000, 413556320, according to Table 2) on the right side of Fig. 8 that constitute a cluster. Whereas, the same object (MMSI: 41348300) at 03:23 o’clock is denoted by column B (key: 100951), whose core distance is 0.04267 and reachability distance is also undefined. It is density-connected with twenty adjacent AIS data objects (in output order, their MMSI are 412436360, 412501270, 482458290, …, 413696320, according to Table 2) on the right part of Fig. 8, forming another cluster. The different clusters are separated by columns whose value is undefined.
Afterwards, the data object (MMSI: 412762270) at 03:23 o’clock is denoted by column C (key: 53087), whose core distance is 0.05716 and reachability distance is 0.0457. Column C is in the trough of the diagram, representing smaller order distance and bigger complexity with the surrounding objects. It can be considered as the more dangerous situation by VTS operator.
Complexity of marine traffic
It can be calculated by the Eq. (9) and data from Table 2,
The complexity of marine traffic at 03:23 o’clock is larger than that at 03:22 o’clock, indicating a process of complexity evolution. The complexity of marine traffic in specific area at exactly every minute (time slice) are calculated by Eq. (9). The statistics are illustrated as Table 3.
Statistics of time slices with complexity of marine traffic
Statistics of time slices with complexity of marine traffic
Totally, there are 44,640 minutes or time slices (31 days multiply 24 hours and multiply 60 minutes) and 20448 minutes or time slices at which the complexity of marine traffic is greater than or equal to 0 and less than 1. Therefore, the ratio is 45.8%. It can be calculated by the same way that, 80.6% (45.8% plus 34.8%) is between 0 and 25, 16.3% (9.9% plus 6.4%) is between 25 and 100, and 3.1% (2.8% plus 0.3%) is greater than 100.
The notion of “safety pyramid” was discussed as early as 1931 by Heinrich and was further developed by Bird based on his 1969 study of industrial accidents [27]. The bottom region of the pyramid was expanded to include events with no adverse effects, such as observation of a condition that has the potential of causing an incident, and the lower portion of the pyramid was identified as the “near-miss” region, which in down-up order are: “positive illusions, unsafe condition and unawareness, ignorance, complacency”, “foreshadowing events and observations”, “minor-loss events” [28]. In this paper, the “near-miss” region is divided into three parts, “low complexity”, “medium complexity” and “high complexity”. Furthermore, it is assumed that low complexity accounts for about 80%, medium complexity for about 16%, and high complexity for 4% or so at sea in the long term. Accordingly, we can measure the complexity of marine traffic in the research area that the value between 0 and 25 is relatively low (accounting for 80.6%), the value between 25 and 100 is medium (occupying 16.3%), and the value larger than 100 is in high level (taking the proportion of 3.1%). Thus, if the value (115) is computed according to given formula, it is highly risky, requiring VTS operator to pay extra attention.
In accordance with crew’s on-duty schedule, a day is divided into 6 equal periods. The average complexity of marine traffic in a period of time can be calculated by the Eqs (9) and (10). Its statistics are illustrated by Table 4.
Statistics of the average complexity of marine traffic within periods of time
The average complexity of marine traffic reaches the summit between 1200 and 1600 while falling to the bottom between 0800 and 1200 in the waters, indicating that it is usually of high risk at afternoon and VTS operator should be cautious.
This research primarily focuses on the complexity modeling of vessel couple and clustering using data mining technology. A complexity model is proposed by employing intrinsic features so as to reflect pair-wise interactions between ships. The clustering method of ship to ship encountering risk is presented on the basis of complexity and risk factors analysis through proposing a new distance definition, in order to efficiently calculate the complexity of numerous ships in an area ultimately.
Actual AIS data within a month from Zhoushan North Sea of China are employed to demonstrate the model and the algorithm. The statistical results illustrate that the low complexity is between 0 and 25, the medium is between 25 and 100, and the high is larger than 100 in the area. Moreover, it can also be concluded that the average complexity of marine traffic peaks between 1200 and 1600 and tumbles to the lowest point between 0800 and 1200, indicating the most risky moment is in the afternoon while the least risky one is at morning in the waters.
Based on the complexity modeling of vessel couple and clustering using data mining technology, the research methodology may be beneficial to all marine traffic waters for safe marine operations.
