Abstract
Global Positionning System (GPS) trajectory is an ordered list of GPS points, which are approximate since they depend on the quality of the GPS sensor and the covering satellites. Finding common frequent sub-trajectories in a given trajectories database enables to detect what are the most used paths encapsulating the objects behaviours. Most trajectories mining algorithms proposed in the literature require a preprocessing discretization step where the plan is discretized into tile blocks, enabling to use classical sequential mining algorithms. However, this step is time consuming and improper for real time applications. In this paper, we propose an algorithm, named TrajGrowth, which directly works on the raw data, without any preprocessing step and without requiring a laborious parameter setting for its execution. Clearly, instead the costly discretization step of standard approaches, we used a precision parameter for which low values push down the mining process to find more precise patterns. The experimental results show that our proposed approach is more precise than the discretization based approaches with a better processing time and avoiding redundant patterns.
Keywords

Introduction
GPS row data has recently registred an impressive growth [1, 2], induced by the diffusion of GPS devices, particularly in smartphones. The number of applications that use this data has also grown, supported by the numerous needs that can be expressed on geolocalized data.
Problem statement
GPS sensors enable to determine user’s trajectory from a starting to a destination points. GPS navigation applications generate a very large amount of GPS trajectories, which encapsulate user behaviors [3]. Note that a GPS point has uncertainties related to the accuracy of the sensor and the quality of GPS coverage (number of reachable satellites). In fact, the GPS-equipped vehicles traffic sensor provides approximate tracking data. Every object (e.g., vehicle, person, …) provides a sequence of GPS points forming a GPS trajectory, and a set of GPS trajectories can be obtained from objects, called trajectories database. Mining in a trajectories database consists in finding a set of sub-trajectories, called also frequent patterns, appearing with at least some given percentage in the database. This problem of trajectories mining has numerous applications [4]:
In the context of urban trajectories, frequent sub-trajectories reveal the most used roads in the city. Such highly used sub-trajectories can be used in the urban decision making process towards a better fluent city traffic. Extracting information from urban trajectories is essential for better city management. In general, this management is categorized into three groups: improving driver experience [5, 6, 7, 8], taxi or bus services [9, 10], scheduling and control transport system [11, 12].
Mining trajectories of animals (e.g., bumble, bees, birds, …) enable to discover shared animal movements patterns.
In team sports events (e.g., soccer, baseball, …), frequent trajectories give insight about tactics used in the game.
Research issues
Searching frequent sub-trajectories on moving objects, raises many challenging research issues:
GPS points within trajectories have uncertainties related to the accuracy of the sensor. In the context of urban trajectories, there are efficient map-matching algorithms [13, 14, 15] that allow to find the path in the map that is as close as possible to the GPS trajectory. This solution is viable only in the context of urban data, and moreover it requires a complete map of the city. For all of these reasons, we raise the challenge to mine trajectories directly on the raw data, without requiring any additional information.
Since GPS points have uncertainties, trajectories data suffer from redundancies of GPS raw data, which should be considered to detect redundant data by taking into account the precision of GPS sensors.
In practice, GPS trajectories database is large. We need to design a mining algorithm, which should be as incremental as possible.
Current methods in the litterature introduce a preprocessing step before mining trajectories data. This step consists in discretizing the two-dimensional space encompassing all the trajectories in the form of a regular paving of elements. Afterwards it transforms each of the GPS trajectories into a series of blocks (e.g., hexagons, etc.) containing this track. Each trajectory becomes a simple sequence of items, allowing to transform the GPS trajectory mining problem to a sequential pattern mining problem that can be tackled with dedicated algorithms [16], such as PrefixSpan [17] algorithm. These methods require a pre-processing phase which has an immediate and significant quality and performance impact on the sequential mining that will follow. The finer the paving is, the better is the precision of the mining, and more is expensive the running time. Obviously if the paving is rough, the mining process will generate many redundant imprecise paths. This third challenge consists in designing a mining algorithm which does not require any preprocessing costly step, and handles the precision through a unique given numeric tolerance value, which takes into account the precision of the GPS sensors.
Moreover, these state of the art discretization based algorithms are improper for real time mining applications. This is due to the fact that data stream is difficult to process in a two steps approach.
Contributions
In order to tackle these issues, the contributions of this paper are as follows:
We have specified mathematically the main problem of frequent trajectory pattern mining problem, which enables to consider the challenging research issues of the problem. Such a formalization enables to show the correctness of the proposed mining algorithm. We have proposed an incremental algorithm named TrajGrowth, which directly works on the raw data, without any preprocessing step and without requiring a laborious parameter setting for its execution. Clearly, instead the costly discretization step of standard approaches, we used a precision parameter for which low values push down the mining process to find more precise patterns. We have experimented our approach on both synthetic and real benchmark trajectories to show its effectiveness.
The remainder of the paper is organized as follows: In Section 2 we give a literature review of some classical frequent trajectories mining methods. In Section 3, we specify formally the problem of mining frequent trajectory patterns. Then we detailed our TrajGrowth approach with an illustrative example showing the interest of the proposed approach. Section 5 reports experiments. Finally, we conclude and draw some perspectives.
Trajectory pattern mining (TPM) is an interesting research topic that has been studied in different context of the Internet of Things (IoT) and smart cities. Many technics have been proposed to mine frequent patterns and moving behaviors from spatio-temporal trajectories. They have been performed over many applications, including animal movement [18], trafic trajectory data [5, 6, 7, 8, 9, 10, 11, 12], and personal travel history [19].
These techniques generally involve two steps, which are: trajectory preprocessing and pattern mining.
Trajectory preprocessing
Preprocessing takes spatio-temporal trajectories and the road network as input, then performs the following three tasks:
Filtering
This step cleans the trajectory dataset by parsing the noisy GPS points using a heuristic-based detection method, such as proposed in [20].
Trajectory Map-Matching
In this step, the system tries to find a corresponding road segment for each GPS points. There are some efficient map-matching algorithms as proposed in [13, 14, 15] that allow to find the path in the map that is as close as possible to the GPS trajectory.
Spatial discretization
It is a symbolization process. More precisely, each trajectory is transformed into a sequence of symboles, where each symbol corresponds to a GPS point. Most methods in the literature discretrize the two-dimensional space including all the trajectories in the form of regular paving elements. Giannoti et al. [21] applied uniform grid partitioning technique to generate region of interest (RoI) that are locations from which trajectory patterns can be extracted. Wang et al. [16] proposed two trajectory pattern mining algorithms called VTPM-PrefixSpan and VTPM-GSP on the sequential database output of the spatial discretization step. In this step, workspace is divided into uniformly sized cells that are provided by the user as a parameter. However, Kang et al. [22] have already highlighted the weakness of these methods based on the uniform partitioning of cells. Their weakness is related to the fact that there is no accurate method for determining the exact size of the grid’s cell. Indeed, if the grid is very wide, objects with very different trajectories can be considered similar. Otherwise, if the grid is too small, objects with similar trajectories can be considered as belonging to different groups. Finally, in the case of unevenly distributed trajectories, the two previous cases occur simultaneously.
In order to overcome this drawback, Manta et al. [23] proposed to apply a RoI generation method that can adapt to irregular spatial data distribution, then include it to a trajectory pattern mining method.
Another drawback of spatial discretization step concerns the high temporal complexity required for the accomplishment of such task. Masciari et al. [24] overcome this limitation by proposing a fast algorithm for partitioning incoming streams of trajectories using a sliding windows approach combined with a counting algorithm. The proposed strategy was compared against the method described in [21] by measuring executions times and the number of extracted regions and patterns. Motivated by this last shortcoming, in this paper we propose a scalable algorithm called TrajGrowth, which operates directly on the raw data without any discretization step.
Trajectory pattern mining
Discovering frequent patterns from historical trajectories of object is a very challenging task. The prior work can be traced back to Chang et al. [25], where authors used sequential pattern mining (SPM) method, first introduced by Agrawal et al. [26]. Gidofalvi et al. [27] adopted also sequential pattern mining algorithm to mine long trajectories of moving objects. Bayir et al. [28] used association rules mining for discovering mobility profiles from cell phones of users. However, using sequential pattern mining algorithms for trajectories data, such as sequential pattern mining (e.g., PrefixSpan [17], SPADE [29]) and association rules mining (e.g., FP-tree [30], Appriori [31], …), can not be directly applied for many reasons [32]: these algorithms do not consider the spatial information, the contiguity between items, and the uncertainties due to GPS censors accuracy. To avoid these problems, Lv et al. [33] proposed an algorithm called SCPM (Spatial Continuity based Pattern Mining) which considers the spatial continuity property of elements in route to derive longer and more complete patterns from personal trajectory data. Fu et al. [32] proposed an algorithm based on the spatial and temporal adjacency relationship and constraint mechanism to mine frequent route pattern from personal trajectory. Other approaches used the distance threshold to apply clustering algorithms for dividing each trajectory into segments, then grouping these segments based on their geometrical features [34, 35]
Contrarily to approaches cited above, our work is novel in the following points:
Formal specification of frequent trajectory pattern mining problem directly on the raw data. Incremental algorithm named TrajGrowth, which works directly on the raw data, without any preprocessing step. Our method is more adapted for scalable environment, due to its one-step approach property.
In this section, we formally model and define the trajectory mining problem. Trajectories are expressed as sequences of GPS positions. We define below the basic definitions for trajectories mining. In the following, most of the definitions are based on an illustrative example of the trajectory database given in Fig. 1 and described in Table 1. Figure 1 shows an example of six distinct synthetic trajectories. The map used in this illustration is taken from Open Street Map repository.1
GPS Trajectoriy database TDB1
GPS Trajectoriy database TDB1
Illustrated trajectory database.
.
GPS Point. A GPS position is determined by:
Id is the identifier of the moving object,
Location is the spatial coordinates (longitude, latitude, altitude) of the object,
Timestamp is the time and date stamp of the data collection,
Velocity is measured in
Precision expresses the tolerance during the sampling of the location by the GPS sensor of the object’s location (which depends on the strength of the GPS signal (number of satellites covered)).
We define now the three fundamental objects of the trajectory mining process, respectively the segment connecting two GPS points, the trajectory, and finally the definition of a trajectory database.
.
Segment, Trajectory, Trajectory database. A Segment of a trajectory is defined as the direct link between two consecutive GPS points on the same trajectory. Trajectory
A TDB trajectory database is a set of trajectories
To compare trajectories, we need to define the distance between two segments. Let
which consists of assuming the greatest distance between the extremities. It conforms also with the standard Fréchet distance [36], the most adopted distance in the literature on trajectories [37, 38, 39]. We denote “dist” as the operator for the computation of geodesic distance.2
The main objective of our work is to discover all the frequent trajectories in a trajectories database. In order to accomplish such a task, we need to define when a trajectory is considered as sub-trajectory or contained in another trajectory.
.
Sub-trajectory, Relation
Let’s take the example of Fig. 1. Given the trajectory
In the following definition, we formulate the coverage and frequency functions, necessary to define a frequent trajectory (also called trajectory pattern).
.
Coverage, Support, Trajectory pattern. Let TDB be a trajectory database,
A trajectory
Consider the trajectories database given in Fig. 1 and trajectory
Now, we are able to define the problem of extracting frequent trajectories.
.
Extracting Trajectory Patterns (ETP). Given a minimum support threshold of
Solving ETP problem on
We point out that Definition 5 captures what is expected from mining trajectories database, by taking into account the precision related mainly on the censors accuracy, and the specificity of the sub-trajectory relation. Note that the definition is constructed on the raw data, without requiring any additional information.
From computational complexity point of view, EMT problem is at least as complex as the sequence mining problem, which is a special case where
Overall principle
TrajGrowth algorithm is based on the pattern-growth [40] approach, which is one of the most used strategies in data mining [41]. The particularity of the pattern-growth approach is the fact that it adopts the principle of divide and conquer to project and partition the database. The approach starts by searching the smallest frequent patterns
One of the most efficient algorithms adopting the pattern-growth algorithmic approach is the PrefixSpan algorithm [42] used for sequence mining. His general idea is to start by finding frequent items in the sequential database. Then, for each frequent item, it finds the trajectories of which it is a prefix, forming what is called the projected database. In fact, any frequent item in a projected database represents a valid candidate for the considered frequent item. The process follows the same strategy on the frequent patterns discovered. In other words, given a pattern
This is an incremental approach, by using the projected databases that take advantage of the current pattern to continue the search only on the portion that can be used to extend the pattern.
In our work, we define the smallest entity by a segment, which is frequent if it appears in more than minsup trajectories. We will follow the PrefixSpan approach, by adapting the definitions to our more semantically richer case, where we should mine on GPS points instead simple items.
.
Prefix, Projection, Suffix. Let us consider a given precision accuracy
Trajectory
Trajectory
There is no proper super trajectory
Trajectory
Consider
The next definition formulates the projected database that considers a portion of the initial database that could contain the GPS segments that will serve as extensions to the current pattern.
.
Trajectories projected database. Let TDB be a trajectory database and
TrajGrowth algorithm
We introduce below the TrajGrowth Algorithm 1, which iterates on the frequent segments
The frequent segments are computed with Algorithm 2 which consists of calling the sub-trajectory operator
Algorithm 3 shows how the projected database is generated from a given pattern. It simply considers the suffixes of the projections of each trajectory from the database to the input pattern.
Algorithm TrajGrowth finds progressively and incrementally frequent trajectory patterns, where each generated pattern is found by concatenating frequent segments (see Algorithm 2), for which contiguity is enforced by database projection operator ensured by Algorithm 3. Algorithm TrajGrowth is correct, since that it ensures the frequency constraint of the produced patterns regarding the main ETP problem (see Definition 5). Algorithm TrajGrowth is also complete since that it considers all of the possible frequent segments to extend the current frequent pattern Pat.
Complexity
Let TDB be the trajectory database (see Definition 2). We suppose that the number of trajectories is
Consequently, the number of patterns which can be produced is
Note that each frequent trajectory is found with a reasonable worst case complexity of
Illustrating example
Consider the trajectories database
In the following we consider the discretization-based approaches [21, 43, 44] introduced in Section 1, to be compared with our approach TrajGrowth. Figure 2 represents one of the methods of space partitioning. The result (see Table 2) of the discretization methods are given in the form of a sequence of tile identifiers. It consists of splitting the space into a set of hexagonal items. These methods allocate a point to an item or set of items according to its distance from the center of the hexagon.
Coordinates of the pavings centers
Results comparison of TrajGrowth with discretization based algorithms
VSP Discretization [16] illustrated in OSM.
In order to illustrate the behaviour of our approach and those of discretization, we give in Table 3 the frequent trajectories (
We made several experiments to compare and evaluate the mining algorithm TrajGrowth with the VSP method [16], which is an efficient representative of the state of the art discritization based approaches (see Section 2). We used several real and synthetic trajectory databases.
Synthetic benchmark trajectories
We propose a random path generation algorithm 4 on some region by using OpenStreetMap editable map of the world. Our synthetic path generation algorithm generates NbTraj trajectories, where each trajectory is generated as follows:
The algorithm begins by generating a random point
With a random direction from
The previous step is repeated until the whole path length Lmin is exceeded. At this time, the algorithm adds the new path to the trajectory database.
GPS databases trajectories
GPS databases trajectories
The parameters of this random generation are described in the first part of Table 4. We have generated three synthetic datasets, with respectively 100, 500 and 1 000 trajectories, having respectively 3 612, 10 354 and 22 845 segments.
The experimental results are based on 3 real databases of trajectories. The database “Go. Traj” is based on data provided by users of the GO! application in Brazil.5 The “Truck” database is a baseline generated by 50 trucks over several days.6 The “Porto” database is generated from data provided by Taxis across the city of Porto.7
Experimental protocol
The implementation of TrajGrowth was carried out in the java programming language. All experiments were conducted on an Intel Xeon E5-2680 @ 2.5 GHz with 128 Gb of RAM with a timeout “TO” of 14 Hours. We reported the number of frequent patterns and the overall running time.
We have implemented TrajGrowth algorithm as described in Section 4. The comparison has been done with VSP method [16] which is an efficient representative of the concurrent approaches. We have also implemented VSP in conformance with its algorithmic details. The parameter
Experimental result for different datasets with 
Experimental result for different datasets with 
Experimental result for different datasets with 
The experimental results are given in Figs 3–5, where we compared TrajGrowth algorithm with VSP [16]. NP and RT stand for respectively the number of patterns and the running time. Concerning the number of patterns, TrajGrowth finds usually much more patterns than VSP. This is an expected result, since that: (1) our approach finds patterns by matching directly with the trajectories; (2) VSP transforms the whole trajectories into tile blocks, and the patterns are consequently a sequence of tile blocks which is a rough approximation, which explains the lower number of patterns. Indeed, for the three values of radius: 30 meter, 60 meter and 100 meter, our algorithm TrajGrowth is significantly better than VSP, whatever the value of the minsup parameter of the algorithm: 2.5%, 5% and 10%. However, there are some cases where VSP founds more patterns. For instance, concerning the benchmark “Turk”, VSP algorithm discovers more patterns than TrajGrowth for the radius 30 meter, while the latter TrajGrowth is better for the radius 60 meter. For the synthetic benchmark “Traj 100”, VSP was also slightly better than TrajGrowth in term of NP. Whereas, TrajGrowth runs quickly. With the synthetic benchmark “Traj 1000”, the two algorithms are comparable. But, after analyzing the patterns found by VSP for radius 30 meter, we noticed that they are redundant and more precisely, there are numerous close patterns. TrajGrowth is less sensitive to redundancies, since that it performs mining directly on the input trajectories without no preprocessing step. Concerning the running time, TrajGrowth is faster than VSP for most of the results.
Conclusion
In this paper, we have proposed a new GPS trajectory mining approach that takes advantage of the incremental strategy of the well known Pattern-Growth algorithmic paradigm popular in itemset and sequence mining. We have adapted prefix, suffix, and projection operators on the trajectory database. We proposed TrajGrowth mining algorithm that uses the sub-trajectory operator and projected databases to find all the frequent trajectories in a trajectory database. This algorithm avoids the costly discretization step of standard approaches by using a precision parameter for which low value push down the mining process to find more precise patterns.
Experimenting TrajGrowth on synthetic and real databases, shows that our approach of trajectory mining is efficient by providing more precise and less redundant patterns than the state of the art VSP approach. The perspective is to implement our approach in parallel and exploit multicore architectures.We want also to integrate our approach in a constraint programming environment enabling to make our approch more declarative by taking into account various patterns constraints [45, 46].
Footnotes
Roughly speaking, in the worst case, we ignore the redundancies in the trajectories by assuming that each point GPS is a new point.
Acknowledgments
This research is supported by the Directorate General for Scientific Research and Technological Development (DGRSDT) of Algeria.
Author’s Bios
