A pattern-growth approach for mining trajectories

Abstract

Global Positionning System (GPS) trajectory is an ordered list of GPS points, which are approximate since they depend on the quality of the GPS sensor and the covering satellites. Finding common frequent sub-trajectories in a given trajectories database enables to detect what are the most used paths encapsulating the objects behaviours. Most trajectories mining algorithms proposed in the literature require a preprocessing discretization step where the plan is discretized into tile blocks, enabling to use classical sequential mining algorithms. However, this step is time consuming and improper for real time applications. In this paper, we propose an algorithm, named TrajGrowth, which directly works on the raw data, without any preprocessing step and without requiring a laborious parameter setting for its execution. Clearly, instead the costly discretization step of standard approaches, we used a precision parameter for which low values push down the mining process to find more precise patterns. The experimental results show that our proposed approach is more precise than the discretization based approaches with a better processing time and avoiding redundant patterns.

Keywords

Data mining pattern-growth GPS trajectory urban trajectories frequent trajectory pattern algorithms algorithms for mining

ï»¿

1. Introduction

GPS row data has recently registred an impressive growth [1, 2], induced by the diffusion of GPS devices, particularly in smartphones. The number of applications that use this data has also grown, supported by the numerous needs that can be expressed on geolocalized data.

1.1 Problem statement

GPS sensors enable to determine user’s trajectory from a starting to a destination points. GPS navigation applications generate a very large amount of GPS trajectories, which encapsulate user behaviors [3]. Note that a GPS point has uncertainties related to the accuracy of the sensor and the quality of GPS coverage (number of reachable satellites). In fact, the GPS-equipped vehicles traffic sensor provides approximate tracking data. Every object (e.g., vehicle, person, …) provides a sequence of GPS points forming a GPS trajectory, and a set of GPS trajectories can be obtained from objects, called trajectories database. Mining in a trajectories database consists in finding a set of sub-trajectories, called also frequent patterns, appearing with at least some given percentage in the database. This problem of trajectories mining has numerous applications [4]:

In the context of urban trajectories, frequent sub-trajectories reveal the most used roads in the city. Such highly used sub-trajectories can be used in the urban decision making process towards a better fluent city traffic. Extracting information from urban trajectories is essential for better city management. In general, this management is categorized into three groups: improving driver experience [5, 6, 7, 8], taxi or bus services [9, 10], scheduling and control transport system [11, 12].

Mining trajectories of animals (e.g., bumble, bees, birds, …) enable to discover shared animal movements patterns.

In team sports events (e.g., soccer, baseball, …), frequent trajectories give insight about tactics used in the game.

1.2 Research issues

Searching frequent sub-trajectories on moving objects, raises many challenging research issues:

GPS points within trajectories have uncertainties related to the accuracy of the sensor. In the context of urban trajectories, there are efficient map-matching algorithms [13, 14, 15] that allow to find the path in the map that is as close as possible to the GPS trajectory. This solution is viable only in the context of urban data, and moreover it requires a complete map of the city. For all of these reasons, we raise the challenge to mine trajectories directly on the raw data, without requiring any additional information.

Since GPS points have uncertainties, trajectories data suffer from redundancies of GPS raw data, which should be considered to detect redundant data by taking into account the precision of GPS sensors.

In practice, GPS trajectories database is large. We need to design a mining algorithm, which should be as incremental as possible.

Current methods in the litterature introduce a preprocessing step before mining trajectories data. This step consists in discretizing the two-dimensional space encompassing all the trajectories in the form of a regular paving of elements. Afterwards it transforms each of the GPS trajectories into a series of blocks (e.g., hexagons, etc.) containing this track. Each trajectory becomes a simple sequence of items, allowing to transform the GPS trajectory mining problem to a sequential pattern mining problem that can be tackled with dedicated algorithms [16], such as PrefixSpan [17] algorithm. These methods require a pre-processing phase which has an immediate and significant quality and performance impact on the sequential mining that will follow. The finer the paving is, the better is the precision of the mining, and more is expensive the running time. Obviously if the paving is rough, the mining process will generate many redundant imprecise paths. This third challenge consists in designing a mining algorithm which does not require any preprocessing costly step, and handles the precision through a unique given numeric tolerance value, which takes into account the precision of the GPS sensors.

Moreover, these state of the art discretization based algorithms are improper for real time mining applications. This is due to the fact that data stream is difficult to process in a two steps approach.

1.3 Contributions

In order to tackle these issues, the contributions of this paper are as follows:

1)
We have specified mathematically the main problem of frequent trajectory pattern mining problem, which enables to consider the challenging research issues of the problem. Such a formalization enables to show the correctness of the proposed mining algorithm.
2)
We have proposed an incremental algorithm named TrajGrowth, which directly works on the raw data, without any preprocessing step and without requiring a laborious parameter setting for its execution. Clearly, instead the costly discretization step of standard approaches, we used a precision parameter for which low values push down the mining process to find more precise patterns.
3)
We have experimented our approach on both synthetic and real benchmark trajectories to show its effectiveness.

The remainder of the paper is organized as follows: In Section 2 we give a literature review of some classical frequent trajectories mining methods. In Section 3, we specify formally the problem of mining frequent trajectory patterns. Then we detailed our TrajGrowth approach with an illustrative example showing the interest of the proposed approach. Section 5 reports experiments. Finally, we conclude and draw some perspectives.
2. Literature review

Trajectory pattern mining (TPM) is an interesting research topic that has been studied in different context of the Internet of Things (IoT) and smart cities. Many technics have been proposed to mine frequent patterns and moving behaviors from spatio-temporal trajectories. They have been performed over many applications, including animal movement [18], trafic trajectory data [5, 6, 7, 8, 9, 10, 11, 12], and personal travel history [19].

These techniques generally involve two steps, which are: trajectory preprocessing and pattern mining.

2.1 Trajectory preprocessing

Preprocessing takes spatio-temporal trajectories and the road network as input, then performs the following three tasks:

Filtering

This step cleans the trajectory dataset by parsing the noisy GPS points using a heuristic-based detection method, such as proposed in [20].

Trajectory Map-Matching

In this step, the system tries to find a corresponding road segment for each GPS points. There are some efficient map-matching algorithms as proposed in [13, 14, 15] that allow to find the path in the map that is as close as possible to the GPS trajectory.

Spatial discretization

It is a symbolization process. More precisely, each trajectory is transformed into a sequence of symboles, where each symbol corresponds to a GPS point. Most methods in the literature discretrize the two-dimensional space including all the trajectories in the form of regular paving elements. Giannoti et al. [21] applied uniform grid partitioning technique to generate region of interest (RoI) that are locations from which trajectory patterns can be extracted. Wang et al. [16] proposed two trajectory pattern mining algorithms called VTPM-PrefixSpan and VTPM-GSP on the sequential database output of the spatial discretization step. In this step, workspace is divided into uniformly sized cells that are provided by the user as a parameter. However, Kang et al. [22] have already highlighted the weakness of these methods based on the uniform partitioning of cells. Their weakness is related to the fact that there is no accurate method for determining the exact size of the grid’s cell. Indeed, if the grid is very wide, objects with very different trajectories can be considered similar. Otherwise, if the grid is too small, objects with similar trajectories can be considered as belonging to different groups. Finally, in the case of unevenly distributed trajectories, the two previous cases occur simultaneously.

In order to overcome this drawback, Manta et al. [23] proposed to apply a RoI generation method that can adapt to irregular spatial data distribution, then include it to a trajectory pattern mining method.

Another drawback of spatial discretization step concerns the high temporal complexity required for the accomplishment of such task. Masciari et al. [24] overcome this limitation by proposing a fast algorithm for partitioning incoming streams of trajectories using a sliding windows approach combined with a counting algorithm. The proposed strategy was compared against the method described in [21] by measuring executions times and the number of extracted regions and patterns. Motivated by this last shortcoming, in this paper we propose a scalable algorithm called TrajGrowth, which operates directly on the raw data without any discretization step.

2.2 Trajectory pattern mining

Discovering frequent patterns from historical trajectories of object is a very challenging task. The prior work can be traced back to Chang et al. [25], where authors used sequential pattern mining (SPM) method, first introduced by Agrawal et al. [26]. Gidofalvi et al. [27] adopted also sequential pattern mining algorithm to mine long trajectories of moving objects. Bayir et al. [28] used association rules mining for discovering mobility profiles from cell phones of users. However, using sequential pattern mining algorithms for trajectories data, such as sequential pattern mining (e.g., PrefixSpan [17], SPADE [29]) and association rules mining (e.g., FP-tree [30], Appriori [31], …), can not be directly applied for many reasons [32]: these algorithms do not consider the spatial information, the contiguity between items, and the uncertainties due to GPS censors accuracy. To avoid these problems, Lv et al. [33] proposed an algorithm called SCPM (Spatial Continuity based Pattern Mining) which considers the spatial continuity property of elements in route to derive longer and more complete patterns from personal trajectory data. Fu et al. [32] proposed an algorithm based on the spatial and temporal adjacency relationship and constraint mechanism to mine frequent route pattern from personal trajectory. Other approaches used the distance threshold to apply clustering algorithms for dividing each trajectory into segments, then grouping these segments based on their geometrical features [34, 35]

Contrarily to approaches cited above, our work is novel in the following points:

1)
Formal specification of frequent trajectory pattern mining problem directly on the raw data.
2)
Incremental algorithm named TrajGrowth, which works directly on the raw data, without any preprocessing step.
3)
Our method is more adapted for scalable environment, due to its one-step approach property.

3. Problem statement

In this section, we formally model and define the trajectory mining problem. Trajectories are expressed as sequences of GPS positions. We define below the basic definitions for trajectories mining. In the following, most of the definitions are based on an illustrative example of the trajectory database given in Fig. 1 and described in Table 1. Figure 1 shows an example of six distinct synthetic trajectories. The map used in this illustration is taken from Open Street Map repository.1

Table 1
GPS Trajectoriy database TDB1

Trajectories (tid)	Locations point (Pt.id, latitude, longitude)
Traj 1	Pt.id	pt1	pt2	pt3	pt4	pt5	pt6	pt7	pt8	pt9	pt10
	Lat	48.917513	48.917783	48.918013	48.917918	48.917936	48.917875	48.917813	48.917961	48.918282	48.918487
	Lon	2.377394	2.377455	2.377174	2.376589	2.375879	2.375365	2.374947	2.374755	2.374661	2.374481
Traj 2	Pt.id	pt11	pt12	pt13	pt14	pt15	pt16	pt17
	Lat	48.917425	48.917750	48.918170	48.918434	48.918585	48.918813	48.919036	–	–	–
	Lon	2.377556	2.377643	2.377510	2.377804	2.377873	2.377802	2.378145	–	–	–
Traj 3	Pt.id	pt18	pt19	pt20	pt21	pt22
	Lat	48.917699	48.917772	48.917982	48.918268	48.918555	–	–	–	–	–
	Lon	2.374200	2.374662	2.374642	2.374545	2.374496	–	–	–	–	–
Traj 4	Pt.id	pt23	pt24	pt25	pt26	pt27	pt28	pt29	pt30	pt31	pt32
	Lat	48.918884	48.918644	48.918484	48.918312	48.918108	48.918033	48.918022	48.917973	48.917924	48.917857
	Lon	2.378002	2.377819	2.377703	2.377638	2.377491	2.377055	2.376520	2.375914	2.375424	2.374962
Traj 5	Pt.id	pt33	pt34	pt35	pt36	pt37	pt38	pt39
	Lat	48.917508	48.917776	48.917850	48.917843	48.918080	48.917913	48.918189	–	–	–
	Lon	2.374925	2.375075	2.375463	2.375921	2.376604	2.377171	2.377691	–	–	–
Traj 6	Pt.id	pt40	pt41	pt42	pt43	pt44	pt45	pt46	pt47
	Lat	48.917484	48.917768	48.917976	48.917894	48.917928	48.917771	48.917730	48.917664	–	–
	Lon	2.377506	2.377561	2.377196	2.376513	2.375979	2.375389	2.374994	2.374651	–	–
Traj 7	Pt.id	pt48	pt49	pt50	pt51
	Lat	48.917560	48.917983	48.918314	48.918558	–	–	–	–	–	–
	Lon	2.374918	2.374702	2.374617	2.374568	–	–	–	–	–	–
Traj 8	Pt.id	pt52	pt53	pt54	pt55	pt56	pt57
	Lat	48.917921	48.917832	48.917736	48.917768	48.917748	48.917689	–	–	–	–
	Lon	2.376474	2.375967	2.375381	2.374962	2.374519	2.374102	–	–	–	–

Figure 1.

Illustrated trajectory database.

.

GPS Point. A GPS position is determined by:

Id is the identifier of the moving object,

Location is the spatial coordinates (longitude, latitude, altitude) of the object,

Timestamp is the time and date stamp of the data collection,

Velocity is measured in $Km/h$ ,

Precision expresses the tolerance during the sampling of the location by the GPS sensor of the object’s location (which depends on the strength of the GPS signal (number of satellites covered)).

We define now the three fundamental objects of the trajectory mining process, respectively the segment connecting two GPS points, the trajectory, and finally the definition of a trajectory database.

.

Segment, Trajectory, Trajectory database. A Segment of a trajectory is defined as the direct link between two consecutive GPS points on the same trajectory. Trajectory $T$ is a temporal sequence of GPS points $T=[T_{1},T_{2},\ldots,T_{n(T)}]$ where $\forall i$ , $T_{i}$ is a GPS point. $n(T)$ is the length of the trajectory.

A TDB trajectory database is a set of trajectories $(\textit{tid},T)$ , where tid is an unrepeated identifier and $T$ a trajectory.

To compare trajectories, we need to define the distance between two segments. Let $[a,b]$ and $[c,d]$ be two segments, the distance $d([a,b],[c,d])$ is defined as follows:

$\displaystyle d([a,b],[c,d])=\text{Max}(\text{Min}(\textit{dist}(a,c),\textit{% dist}(b,d)),\text{Min}(\textit{dist}(a,d),\textit{dist}(b,c)))$ (1)

which consists of assuming the greatest distance between the extremities. It conforms also with the standard Fréchet distance [36], the most adopted distance in the literature on trajectories [37, 38, 39]. We denote “dist” as the operator for the computation of geodesic distance.2

The main objective of our work is to discover all the frequent trajectories in a trajectories database. In order to accomplish such a task, we need to define when a trajectory is considered as sub-trajectory or contained in another trajectory.

.

Sub-trajectory, Relation $\preceq_{\epsilon}$ . Let $\epsilon$ be a given accuracy threshold. A trajectory $\alpha=[\alpha_{1}\ldots\alpha_{n(\alpha)}]$ is a sub-trojectory of $T=[T_{1},T_{2},\ldots,T_{n(T)}]$ , denoted by ( $\alpha\preceq_{\epsilon}T$ ), iff there exists integers $1\leqslant j_{1}\leqslant\ldots\leqslant j_{n(\alpha)}\leqslant n(T)$ , such that $d(\alpha_{i},T_{j_{i}})\leqslant\epsilon$ (see Eq. (1)). We say that $\alpha$ is a sub-trajectory of $T$ , or that $T$ is a super trajectory of $\alpha$ . A tuple $(\textit{tid},T)$ is a super trajectory of $\alpha$ , if $\alpha\preceq_{\epsilon}T$ , therefore we can also write $\alpha\preceq_{\epsilon}\textit{tid}$ .

Let’s take the example of Fig. 1. Given the trajectory $tr=[pt3,pt4,pt5]$ . We already have $tr\preceq_{0}\textit{traj}1$ , because $t r$ is a subsequence of $\textit{traj}1$ . Suppose the second trajectory $tr1=[pt29,pt30]$ , we have $tr1\preceq_{30}\textit{traj}1$ .

In the following definition, we formulate the coverage and frequency functions, necessary to define a frequent trajectory (also called trajectory pattern).

.

Coverage, Support, Trajectory pattern. Let TDB be a trajectory database, $t$ a trajectory, $\epsilon$ a given precision accuracy and minsup a coverage threshold. The coverage of $t$ in TDB is the set of all TDB tuples that contain $t$ : $\textit{cover}_{\textit{TDB}}^{\varepsilon}(t)=\{(\textit{tid},T^{\prime})\in% \textit{TDB}|t\preceq_{\varepsilon}T^{\prime}\}$ , and its support is defined by $\textit{sup}_{\textit{TDB}}^{\varepsilon}(t)=|\textit{cover}_{\textit{TDB}}^{% \varepsilon}(t)|.$

A trajectory $t$ is said to be frequent in a TDB database, if its support contains at least minsup elements: $\textit{sup}_{\textit{TDB}}^{\varepsilon}(t)\geqslant\textit{minsup}$ . $t$ is called a trajectory pattern.

Consider the trajectories database given in Fig. 1 and trajectory $tr=[pt43,pt44]$ . $\textit{cover}_{\textit{TDB}1}^{30}(tr)=\{(\textit{traj}1,T_{1}),(\textit{traj% }4,T_{4}),(\textit{traj}6,T_{6}),(\textit{traj}8,T_{8})\}$ . If we consider $\textit{minsup}=$ 4 then, $tr1$ is a frequent trajectory pattern since that $\textit{sup}_{\textit{TDB}}^{30}(tr1)\geqslant 4$ .

Now, we are able to define the problem of extracting frequent trajectories.

.

Extracting Trajectory Patterns (ETP). Given a minimum support threshold of $\textit{minsup}\geqslant 1$ , an accuracy $\varepsilon$ and a trajectory database TDB. The problem of extracting trajectory patterns is to find all $T$ trajectories such that $\textit{sup}_{\textit{TDB}}^{\varepsilon}(T)\geqslant\textit{minsup}$ .

Solving ETP problem on $\textit{TDB}1$ database of Fig. 1 with $\textit{minsup}=3$ , generates three patterns $\{pt1\rightarrow pt2\rightarrow pt3\rightarrow pt4\rightarrow pt5\rightarrow pt% 6\rightarrow pt7\}$ , $\{pt7\rightarrow pt8\rightarrow pt9\rightarrow pt10\},\{pt11\rightarrow pt12\}$ .

We point out that Definition 5 captures what is expected from mining trajectories database, by taking into account the precision related mainly on the censors accuracy, and the specificity of the sub-trajectory relation. Note that the definition is constructed on the raw data, without requiring any additional information.

From computational complexity point of view, EMT problem is at least as complex as the sequence mining problem, which is a special case where $\epsilon=0$ . It is well known that the problem of sequence mining is NP-complete. As a consequence, EPT will be resolved in exponential time in the worst case. The purpose of our work is to propose an algorithm that takes full advantage of the database structure to have a reasonable behavior on real GPS trajectories. In the following section, we propose a trajectory mining algorithm that takes advantage of the sequential structure of the trajectory database.

4. TrajGrowth approach

4.1 Overall principle

TrajGrowth algorithm is based on the pattern-growth [40] approach, which is one of the most used strategies in data mining [41]. The particularity of the pattern-growth approach is the fact that it adopts the principle of divide and conquer to project and partition the database. The approach starts by searching the smallest frequent patterns $M$ in the database. After that, each pattern in $M$ is expanded to find new patterns added to $M$ . The process stops when no more pattern can be extended on $M$ .

One of the most efficient algorithms adopting the pattern-growth algorithmic approach is the PrefixSpan algorithm [42] used for sequence mining. His general idea is to start by finding frequent items in the sequential database. Then, for each frequent item, it finds the trajectories of which it is a prefix, forming what is called the projected database. In fact, any frequent item in a projected database represents a valid candidate for the considered frequent item. The process follows the same strategy on the frequent patterns discovered. In other words, given a pattern $P$ , PrefixSpan finds the sequences where $P$ is a prefix, then the suffixes of these sequences form the projected sequential database $B P$ ; $B P$ is analysed to identify the frequent items that will still serve as an extension to $P$ . This process continues recursively until it reaches projected databases that contain no frequent items.

This is an incremental approach, by using the projected databases that take advantage of the current pattern to continue the search only on the portion that can be used to extend the pattern.

In our work, we define the smallest entity by a segment, which is frequent if it appears in more than minsup trajectories. We will follow the PrefixSpan approach, by adapting the definitions to our more semantically richer case, where we should mine on GPS points instead simple items.

.

Prefix, Projection, Suffix. Let us consider a given precision accuracy $\epsilon$ . Let $\alpha=[\alpha_{1}\ldots\alpha_{m}]$ and $\beta=[\beta_{1}\ldots\beta_{n}]$ be two trajectories:

Trajectory $\alpha$ is the prefix of $\beta$ iff $\alpha\preceq_{\epsilon}\beta$ and $\alpha$ appears at the beginning of $\beta$ .

Trajectory $\beta=[\beta_{1}\ldots\beta_{n}]$ is a projection of the trajectory $T$ w.r.t. $\alpha$ , denoted $T|^{\epsilon}_{\alpha}$ , iff

$\beta\preceq_{\epsilon}T$ where $\alpha$ is a prefix of $\beta$ .

There is no proper super trajectory $\beta^{\prime}$ of $\beta$ such that $\beta^{\prime}\preceq_{\epsilon}T$ and $\alpha$ is prefix of $\beta^{\prime}$ .

Trajectory $\gamma=[\beta_{m+1},\ldots,\beta_{n}]$ is called a suffix of $T$ w.r.t. $\alpha$ . With the standard concatenation operator concat, we write $\beta=\textit{concat}(\alpha,\gamma)$ .

Consider $\textit{traj}8$ the eighth trajectory in $\textit{TDB}1$ (Table 1). For instance, trajectory $\alpha=[pt52,pt53]$ is the prefix of $\beta=[pt52,pt53,pt54,pt55,pt56,pt57]$ and $\gamma=[pt54,pt55,pt56,pt57]$ is its suffix. Trajectory $\beta$ is the projection of the trajectory $\textit{traj}8$ w.r.t. $\alpha$ .

The next definition formulates the projected database that considers a portion of the initial database that could contain the GPS segments that will serve as extensions to the current pattern.

.

Trajectories projected database. Let TDB be a trajectory database and $\alpha=[\alpha_{1}\ldots\alpha_{m}]$ some trajectory. The projected database (or $\alpha$ -projection with a tolerance $\varepsilon$ ) of TDB w.r.t. $\alpha$ , noted $\textit{TDB}|_{\alpha}^{\varepsilon}$ , is the set of all suffixes of the projections of TDB trajectories w.r.t. $\alpha$ .

4.2 TrajGrowth algorithm

We introduce below the TrajGrowth Algorithm 1, which iterates on the frequent segments $S$ of the input database Base, then projects Base on $S$ and obtains the projected database. If this database contains more trajectories than minsup, the current pattern is extended with $S$ , and the algorithm continues incrementally on the projected database with the extended pattern.

Algorithm 1: TrajGrowth ( $<\textit{Base}>$ , Pat, $\varepsilon$ , minsup)
Require: $<\textit{Base}>$ : Trajectory database
Pat: current pattern
$\varepsilon$ : precision
minsup: minimum support threshold
Ensure:FPatterns: new frequent patterns
if $(\|<\textit{Base}>\|\geqslant\textit{minsup})$ then
for all ( $s\in\textit{FrequentSegments}(<\textit{Base}>,\varepsilon,minsup)$ ) do
$<\textit{ProjectedBase}>\leftarrow\textit{Project}(<\textit{Base}>,s)$ ;
if ( $\|<\textit{ProjectedBase}>\|\geqslant\textit{minsup}$ ) then
$\textit{NewPat}\leftarrow\textit{Pat}+s$ ;
$\textit{FPatterns}\leftarrow\textit{FPatterns}\cup\{\textit{NewPat}\}$ ;
TrajGrowth ( $<\textit{ProjectedBase}>$ , NewPat, $\varepsilon$ , minsup)
end if
end for
end if

The frequent segments are computed with Algorithm 2 which consists of calling the sub-trajectory operator $\preceq_{\epsilon}$ to identify the most frequent segments.

Algorithm 2:FrequentSegments ( $<\textit{Base}>$ , $\varepsilon$ , minsup)
Require: $<\textit{Base}>$ : Trajectory database
$\varepsilon$ : precision
minsup: minimum support threshold
Ensure: $<\textit{FreqSegments}>$ : frequent segments
for all ( $\textit{Seg}\in$ “segments in $\langle$ Base $\rangle$ ”) do
$f\leftarrow 0$
for all ( $(\textit{tid},T)\in<\textit{Base}>$ ) do
if $(\textit{Seg}\preceq_{\epsilon}T)$ then
$f\leftarrow f+1$
end if
end for
if $(f\geqslant\textit{minsup})$ then
$<\textit{FreqSegments}>\leftarrow<\textit{FreqSegments}>\cup\{\textit{Seg}\}$ ;
end if
end for

Algorithm 3 shows how the projected database is generated from a given pattern. It simply considers the suffixes of the projections of each trajectory from the database to the input pattern.

Algorithm 3:Project ( $<\textit{Base}>$ , $s$ , $\varepsilon$ )
Require: $<\textit{Base}>$ : trajectory database
$s$ : segment
$\varepsilon$ : precision
Ensure: $<\textit{ProjectedBase}>$
for all ( $(\textit{tid},T)\in<\textit{Base}>$ ) do
if $(s\preceq_{\epsilon}T)$ then
Let $p r$ be the suffix of the projection of $T$ on $s$ ;
$<\textit{ProjectedBase}>\leftarrow<\textit{ProjectedBase}>\cup\{(\textit{tid},% pr)\}$ ;
end if
end for

Algorithm TrajGrowth finds progressively and incrementally frequent trajectory patterns, where each generated pattern is found by concatenating frequent segments (see Algorithm 2), for which contiguity is enforced by database projection operator ensured by Algorithm 3. Algorithm TrajGrowth is correct, since that it ensures the frequency constraint of the produced patterns regarding the main ETP problem (see Definition 5). Algorithm TrajGrowth is also complete since that it considers all of the possible frequent segments to extend the current frequent pattern Pat.

4.3 Complexity

Let TDB be the trajectory database (see Definition 2). We suppose that the number of trajectories is $|\textit{TDB}|=m$ and the number of GPS points is $n$ . In the case of a trajectory containing $n$ GPS points,3 the number of possible sub-trajectories is $O(2^{n})$ in the worst case, due to the fact that by respecting the order of the available $n$ points, each point can appear or not, which leads to the combination of $n-1$ multiplications $2\times\ldots\times 2$ .

Consequently, the number of patterns which can be produced is $O(2^{n})$ in the worst case. The memory consumption is of exponential nature in the worst case, since that to generate a frequent pattern (i.e., assignment $\textit{NewPat}=\textit{Pat}+s$ ), Algorithm 1 starts by enumerating all the frequent segments $\textit{FrequentSegments}(<\textit{Base}>,\varepsilon,\textit{minsup})$ which can be done in the worst case $O(n^{2}m)$ . After that, the number of generated segments is $O(n)$ . The projection on some segment $<\textit{ProjectedBase}>=\textit{Project}(<\textit{Base}>,s)$ is of complexity $O(nm)$ . Finally, the processing steps required to generate some pattern is $O(n^{2}m)$ . Since that the number of patterns in the worst case is $2^{n}$ , the complexity of algorithm TrajGrowth is $O(n^{2}m2^{n})$ . It is clear from the context that the origin of exponential complexity is the exponential number of possible patterns. This phenomena of exponential complexity due to the exponential number of patterns, is also the same as in classical sequence and itemset mining.

Note that each frequent trajectory is found with a reasonable worst case complexity of $O(n^{2}m)$ , which explains the success of our approach (but also pattern growth based approaches) when the number of frequent patterns is small.

4.4 Illustrating example

Consider the trajectories database $\textit{TDB}1$ (Table 1), with the three prefixes A $=$ [pt29, pt30], B $=$ [pt30, pt31], and C $=$ [pt31, pt32]. we have:

•
$\textit{TDB}_{1}|_{A}=\{(1,[pt5,pt6,pt7,pt8,pt9,pt10])$ , $(4,[pt30,pt31,pt32])$ , $(6,[pt44,pt45,pt46,pt47]$ , $(8,[pt53,pt54,pt55,pt56,pt57])\}$
•
$\textit{TDB}_{1}|_{AB}=\{(1,[pt6,pt7,pt8,pt9,pt10])$ , $(4,[pt31,pt32])$ , $(6,[pt45,pt46,pt47]$ , $(8,[pt54,pt55,\linebreak pt56,pt57])\}$
•
$\textit{TDB}_{1}|_{\textit{ABC}}=\{(1,[pt7,pt8,pt9,pt10])$ , $(4,[∼{}])$ , $(6,[pt46,pt47]$ , $(8,[pt55,pt56,pt57])\}$

In the following we consider the discretization-based approaches [21, 43, 44] introduced in Section 1, to be compared with our approach TrajGrowth. Figure 2 represents one of the methods of space partitioning. The result (see Table 2) of the discretization methods are given in the form of a sequence of tile identifiers. It consists of splitting the space into a set of hexagonal items. These methods allocate a point to an item or set of items according to its distance from the center of the hexagon.

Table 2
Coordinates of the pavings centers

Cell-idCentre Latitude Longitude

A 48.91941 2.3732

B 48.91707 2.3732

C 48.91824 2.37627

D 48.91707 2.3773

E 48.92057 2.38038

F 48.91707 2.37935

G 48.91824 2.38243

H 48.9159 2.38243

I 48.9159 2.3773

Table 3
Results comparison of TrajGrowth with discretization based algorithms

Approche Results

Discretization based algorithm (Crisp Space Partition ) $\{I\rightarrow F\rightarrow D\}$

VSP Prefix-Span ( Vague Space Partition ) $\{I\}$ , $\{I\rightarrow F\rightarrow D\}$ , $\{I\rightarrow G\rightarrow D\}$ , $\{G\rightarrow D\}$ ,

$\{G\rightarrow F\rightarrow D\}$ , $\{D\rightarrow B\}$

TrajGrowth (tid–Pt.id) $\{$ pt3 $\rightarrow$ pt4 $\rightarrow$ pt5 $\rightarrow$ pt6 $\rightarrow$ pt7 $\}$

Figure 2.
VSP Discretization [16] illustrated in OSM.

In order to illustrate the behaviour of our approach and those of discretization, we give in Table 3 the frequent trajectories ( $\textit{minsup}=4$ and $\varepsilon=30$ ) found in the database of Fig. 2. Through our method, we directly provide GPS coordinates of frequent trajectories, which is due to the fact that we work directly on the raw data. On the other hand, discretization methods produce trajectories across the paving blocks, which are obviously of low precision compared to our approach.
5. Experimental results and analysis

Cell-idCentre	Latitude	Longitude
A	48.91941	2.3732
B	48.91707	2.3732
C	48.91824	2.37627
D	48.91707	2.3773
E	48.92057	2.38038
F	48.91707	2.37935
G	48.91824	2.38243
H	48.9159	2.38243
I	48.9159	2.3773

Approche	Results
Discretization based algorithm (Crisp Space Partition )	$\{I\rightarrow F\rightarrow D\}$
VSP Prefix-Span ( Vague Space Partition )	$\{I\}$ , $\{I\rightarrow F\rightarrow D\}$ , $\{I\rightarrow G\rightarrow D\}$ , $\{G\rightarrow D\}$ ,
	$\{G\rightarrow F\rightarrow D\}$ , $\{D\rightarrow B\}$
TrajGrowth (tid–Pt.id)	$\{$ pt3 $\rightarrow$ pt4 $\rightarrow$ pt5 $\rightarrow$ pt6 $\rightarrow$ pt7 $\}$

We made several experiments to compare and evaluate the mining algorithm TrajGrowth with the VSP method [16], which is an efficient representative of the state of the art discritization based approaches (see Section 2). We used several real and synthetic trajectory databases.

5.1 Synthetic benchmark trajectories

We propose a random path generation algorithm 4 on some region by using OpenStreetMap editable map of the world. Our synthetic path generation algorithm generates NbTraj trajectories, where each trajectory is generated as follows:

The algorithm begins by generating a random point $P i$ from the region.

With a random direction from $P i$ , the algorithm selects the next point $P_{j}$ and the next road $P_{i}\rightarrow P_{j}$

The previous step is repeated until the whole path length Lmin is exceeded. At this time, the algorithm adds the new path to the trajectory database.

Table 4
GPS databases trajectories

Database name	Type	Nbr of trajectories	Nbr of GPS point	Nbr of segments	Area ( $Km^{2}$ )	Source	Size
Traj_Syn 100	Simulated	100	3712	3612	472.94	Generated with TrajGen from Oran city (Algeria) using Epsilon $=$ 15 m	4 Mo
Traj_Syn 500	Simulated	500	10854	10354	472.94	Generated with TrajGen from Oran city (Algeria) using Epsilon $=$ 15m	5 Mo
Traj_Syn 1000	Simulated	1000	23845	22845	472.94	Generated with TrajGen from Oran city (Algeria) using Epsilon $=$ 30 m	18 Mo
Go traj	Real	163	18102	17939	2553148	The database is provided by buses and cars in Brazil that use an Android app called Go!	2 Mo
Truck	Real	1099	93967	92868	2482.35	Trucks dataset consists of 1099 trajectories of 50 trucks delivering concrete to several construction places around Athens metropolitan area in Greece	6 Mo
Taxi portugal	Real	7700	381844	374144	532	The dataset is collected trajectories performed by 442 taxis running in the city of Porto, in Portugal	30 Mo

The parameters of this random generation are described in the first part of Table 4. We have generated three synthetic datasets, with respectively 100, 500 and 1 000 trajectories, having respectively 3 612, 10 354 and 22 845 segments.

5.2 Real benchmark trajectories

The experimental results are based on 3 real databases of trajectories. The database “Go. Traj” is based on data provided by users of the GO! application in Brazil.5 The “Truck” database is a baseline generated by 50 trucks over several days.6 The “Porto” database is generated from data provided by Taxis across the city of Porto.7

5.3 Experimental protocol

The implementation of TrajGrowth was carried out in the java programming language. All experiments were conducted on an Intel Xeon E5-2680 @ 2.5 GHz with 128 Gb of RAM with a timeout “TO” of 14 Hours. We reported the number of frequent patterns and the overall running time.

We have implemented TrajGrowth algorithm as described in Section 4. The comparison has been done with VSP method [16] which is an efficient representative of the concurrent approaches. We have also implemented VSP in conformance with its algorithmic details. The parameter $\varepsilon$ has been tested with values 30 meter, 60 meter, and 100 meter. The case $\varepsilon=30$ enables to enforce a low distance between the produced patterns and their covers, whereas $\varepsilon=100$ enables a rough matching distance. minsup parameter is tested with values 10%, 5%, 2.5%. It is clear from the context that with $\textit{minsup}=$ 2.5%, we allow the trajectory miners to find patterns covering a small portion of the dataset, whereas with $\textit{minsup}=$ 10%, patterns covers a significant portion of the dataset. We have limited $\textit{minsup}\leqslant$ 10%, due to the fact that with a minsup greater than 10%, there is practically no chance to find patterns.

Figure 3.

Experimental result for different datasets with $\textit{minsup}=$ 2.5%.

Figure 4.

Experimental result for different datasets with $\textit{minsup}=$ 5%.

Figure 5.

Experimental result for different datasets with $\textit{minsup}=$ 10%.

5.4 Results and discussion

The experimental results are given in Figs 3–5, where we compared TrajGrowth algorithm with VSP [16]. NP and RT stand for respectively the number of patterns and the running time. Concerning the number of patterns, TrajGrowth finds usually much more patterns than VSP. This is an expected result, since that: (1) our approach finds patterns by matching directly with the trajectories; (2) VSP transforms the whole trajectories into tile blocks, and the patterns are consequently a sequence of tile blocks which is a rough approximation, which explains the lower number of patterns. Indeed, for the three values of radius: 30 meter, 60 meter and 100 meter, our algorithm TrajGrowth is significantly better than VSP, whatever the value of the minsup parameter of the algorithm: 2.5%, 5% and 10%. However, there are some cases where VSP founds more patterns. For instance, concerning the benchmark “Turk”, VSP algorithm discovers more patterns than TrajGrowth for the radius 30 meter, while the latter TrajGrowth is better for the radius 60 meter. For the synthetic benchmark “Traj 100”, VSP was also slightly better than TrajGrowth in term of NP. Whereas, TrajGrowth runs quickly. With the synthetic benchmark “Traj 1000”, the two algorithms are comparable. But, after analyzing the patterns found by VSP for radius 30 meter, we noticed that they are redundant and more precisely, there are numerous close patterns. TrajGrowth is less sensitive to redundancies, since that it performs mining directly on the input trajectories without no preprocessing step. Concerning the running time, TrajGrowth is faster than VSP for most of the results.

6. Conclusion

In this paper, we have proposed a new GPS trajectory mining approach that takes advantage of the incremental strategy of the well known Pattern-Growth algorithmic paradigm popular in itemset and sequence mining. We have adapted prefix, suffix, and projection operators on the trajectory database. We proposed TrajGrowth mining algorithm that uses the sub-trajectory operator and projected databases to find all the frequent trajectories in a trajectory database. This algorithm avoids the costly discretization step of standard approaches by using a precision parameter for which low value push down the mining process to find more precise patterns.

Experimenting TrajGrowth on synthetic and real databases, shows that our approach of trajectory mining is efficient by providing more precise and less redundant patterns than the state of the art VSP approach. The perspective is to implement our approach in parallel and exploit multicore architectures.We want also to integrate our approach in a constraint programming environment enabling to make our approch more declarative by taking into account various patterns constraints [45, 46].

Footnotes

https://www.openstreetmap.org/.

https://en.wikipedia.org/wiki/Geodesic.

Roughly speaking, in the worst case, we ignore the redundancies in the trajectories by assuming that each point GPS is a new point.

https://drive.google.com/open?id=1GoK4NFzY6oCgzT1vu7q0cb7XyyHDhd_Q.

https://play.google.com/store/apps/details?id=com.go.router.

http://isl.cs.unipi.gr/db/projects/rtreeportal.

http://www.geolink.pt/ecmlpkdd2015-challenge/whoweare.html.

Acknowledgments

This research is supported by the Directorate General for Scientific Research and Technological Development (DGRSDT) of Algeria.

Author’s Bios

	Rachid Mohammed Khatir is Phd student. Prior to beginning his PhD, he received in 2015 his Master degree in Knowledge Engineering and Web Technologies from the University of Oran1 Ahmed Ben Bella and then was ranked first in the doctoral admission exam of Litio laboratory. His research interests include Geo-Informations and spatial trajectory mining, algorithms and pattern recognition.
	Yahia Lebbah received the PhD degree in computer science from Ecole des Mines de Nantes, France, 1999. Currently he is a Professor at University Oran1, Algeria. His current research interest is the use of constraint programming and optimization techniques in global optimization, data mining, software validation and multicriteria decision making.
	Rachid Nourine is an associate professor and Director of National Institute Of Telecommunication and ICT of Oran in Algeria. He received his B.Eng and Ph.D degrees in Computer Science at the Department of Computer Science of Oran1 Ahmed Ben Bella University in Algeria. His research interests are in the areas of Remote Sensing, image processing and Data Mining.

References

Zheng

Zhang

Xie

and Ma

W.-Y.

, Mining interesting locations and travel sequences from gps trajectories, in: Proceedings of the 18th International Conference on World Wide Web, ACM, 2009, pp. 791–800.

Elragal

and El-Gendy

, Trajectory data mining: Integrating semantics, Journal of Enterprise Information Management 26(5) (2013), 516–535.

Feng

and Zhu

, A survey on trajectory data mining: Techniques and applications, IEEE Access 4 (2016), 2056–2067.

Zheng

and Zhou

, Computing with spatial trajectories, 2011, 06–08.

Yuan

Zheng

Xie

and Sun

, T-drive: Enhancing driving directions with taxi drivers’ intelligence, IEEE Transactions on Knowledge and Data Engineering 25(1) (2013), 220–232.

Shaw

A.A.

and Gopalan

, Finding frequent trajectories by clustering and sequential pattern mining, Journal of Traffic and Transportation Engineering (English Edition) 1(6) (2014), 393–403.

Mazimpaka

J.D.

and Timpf

, Trajectory data mining: A review of methods and applications, Journal of Spatial Information Science 2016(13) (2016), 61–99.

Yao

Zhang

Zhu

Huang

and Bi

, Trajectory clustering via deep representation learning, in: Neural Networks (IJCNN), 2017 International Joint Conference on, IEEE, 2017, pp. 3880–3887.

Pan

Zhang

and Wang

, Prediction of urban human mobility using large-scale taxi traces and its applications, Frontiers of Computer Science 6(1) (2012), 111–121.

10.

Bermingham

and Lee

, A general methodology for n-dimensional trajectory clustering, Expert Systems with Applications 42(21) (2015), 7573–7581.

11.

Trasarti

Pinelli

Nanni

and Giannotti

, Individual mobility profiles: Methods and application on vehicle sharing, in: SEBD, 2012, pp. 35–42.

12.

Monteiro de Lira

Renso

Perego

Rinzivillo

and Cesario Times

, The comewithme system for searching and ranking activity-based carpooling rides, in: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2016, pp. 1145–1148.

13.

Brakatsoulas

Pfoser

Salas

and Wenk

, On map-matching vehicle tracking data, in: Proceedings of the 31st International Conference on Very Large Data Bases, VLDB Endowment, 2005, pp. 853–864.

14.

Lou

Zhang

Zheng

Xie

Wang

and Huang

, Map-matching for low-sampling-rate gps trajectories, in: Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2009, pp. 352–361.

15.

Newson

and Krumm

, Hidden markov map matching through noise and sparseness, in: Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2009, pp. 336–343.

16.

Wang

and Yan

, Mining frequent trajectory pattern based on vague space partition, Knowledge-based Systems 50 (2013), 100–111.

17.

Han

Pei

Mortazavi-Asl

Pinto

Chen

Dayal

and Hsu

, Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth, in: Proceedings of the 17th International Conference on Data Engineering, 2001, pp. 215–224.

18.

de Weerd

van Langevelde

van Oeveren

Nolet

B.A.

Kölzsch

Prins

H.H.

and de Boer

W.F.

, Deriving animal behaviour from high-frequency gps: Tracking cows in open and forested habitat, Plos One 10(6) (2015), 29–30.

19.

Lee

J.-G.

Han

and Whang

K.-Y.

, Trajectory clustering: A partition-and-group framework, in: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, ACM, 2007, pp. 593–604.

20.

Bao

Ruan

and Zheng

, Interactive bike lane planning using sharing bikes’ trajectories, in: IEEE Transactions on Knowledge and Data Engineering, 2019, pp. 04–06.

21.

Giannotti

Nanni

Pinelli

and Pedreschi

, Trajectory pattern mining, in: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2007, pp. 330–339.

22.

Kang

J.-Y.

and Yong

H.-S.

, Mining spatio-temporal patterns in trajectory data, Journal of Information Processing Systems 6(4) (2010), 521–536.

23.

Sanni

and Akbar

, Trajectory pattern mining with multistage spatial partitioning, International Journal on Electrical Engineering & Informatics 9(2) (2017), 382–393.

24.

Masciari

Shi

and Zaniolo

, Sequential pattern mining from trajectory data, in: Proceedings of the 17th International Database Engineering & Applications Symposium, ACM, 2013, pp. 162–167.

25.

Chung

J.D.

Paek

O.H.

Lee

J.W.

and Ryu

K.H.

, Temporal pattern mining of moving objects for location-based service, in: Proceedings of the 13th International Conference on Database and Expert Systems Applications, Springer-Verlag, 2002, pp. 331–340.

26.

Agrawal

Srikant

et al., Mining sequential patterns, in: icde, Vol. 95, 1995, 3–14.

27.

Gidófalvi

and Pedersen

T.B.

, Mining long, sharable patterns in trajectories of moving objects, Geoinformatica 13(1) (2009), 27–55.

28.

Bayir

M.A.

Demirbas

and Eagle

, Mobility profiler: A framework for discovering mobility profiles of cell phone users, Pervasive and Mobile Computing 6(4) (2010), 435–454.

29.

Zaki

M.J.

, SPADE: An efficient algorithm for mining frequent sequences, Machine Learning 42(1/2) (2001), 31–60. [Online]. Available: doi: 10.1023/A:1007652502315.

30.

Han

Pei

Yin

and Mao

, Mining frequent patterns without candidate generation: A frequent-pattern tree approach, Data Mining and Knowledge Discovery 8(1) (2004), 53–87.

31.

Agrawal

Srikant

et al., Fast algorithms for mining association rules, in: Proc. 20th Int. Conf. Very Large Data Bases, VLDB, Vol. 1215, 1994, pp. 487–499.

32.

Tian

and Zhou

, Mining frequent route patterns based on personal trajectory abstraction, IEEE Access 5 (2017), 11352–11363.

33.

Yuan

and Wang

, Route pattern mining from personal trajectory data, J. Inf. Sci. Eng. 31(1) (2015), 147–164.

34.

Zhang

and Yuan

, The gps trajectory data research based on the intelligent traffic big data analysis platform, Journal of Computational Methods in Sciences and Engineering 17(3) (2017), 423–430.

35.

and Zhou

, A gps location data clustering approach based on a niche genetic algorithm and hybrid k-means, Intelligent Data Analysis 23(S1) (2019), 175–198.

36.

Alt

and Godau

, Computing the fréchet distance between two polygonal curves, International Journal of Computational Geometry & Applications 5(01–02) (1995), 75–91.

37.

Zhu

Luo

Yin

Zhou

Huang

J.Z.

and Zhan

F.B.

, Mining trajectory corridors using fréchet distance and meshing grids, in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, 2010, pp. 228–237.

38.

Buchin

Dodge

and Speckmann

, Similarity of trajectories taking into account geographic context, Journal of Spatial Information Science 2014(9) (2014), 101–124.

39.

Yuan

Sun

Zhao

and Wang

, A review of moving object trajectory clustering algorithms, Artificial Intelligence Review 47(1) (2017), 123–144.

40.

Han

and Pei

, Mining frequent patterns by pattern-growth: Methodology and implications, ACM SIGKDD Explorations Newsletter 2(2) (2000), 14–20.

41.

Liu

and Özsu

M.T.

, in: Encyclopedia of Database Systems, Springer New York, NY, USA, Vol. 6, 2009.

42.

Pei

, Seqpatternminer: Mining sequential patterns by prefix-projected growth, in: Proc. Int. Conf. on Data Engi-neering, 2001, 2001, pp. 2–6.

43.

Zhang

Han

Shou

and La Porta

, Splitter: Mining fine-grained sequential patterns in semantic trajectories, Proceedings of the VLDB Endowment 7(9) (2014), 769–780.

44.

Khoshahval

Farnaghi

and Taleai

, Spatio-temporal pattern mining on trajectory data using arm, International Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences 42 (2017), 395–399.

45.

Kemmar

Lebbah

Loudni

Boizumault

and Charnois

, Prefix-projection global constraint and top-k approach for sequential pattern mining, Constraints 22(2) (2017), 265–306. [Online]. Available: doi: 10.1007/s10601-016-9252-z.

46.

Feng

Zhang

Y.-R.

Jin

and Zhang

, A compromise-negotiation framework based on game theory for eliminating requirements inconsistency, Tehnički Vjesnik 22(5) (2015), 1085–1092.

A pattern-growth approach for mining trajectories

Abstract

Keywords

1. Introduction

1.1 Problem statement

1.2 Research issues

1.3 Contributions

2.1 Trajectory preprocessing

Filtering

Trajectory Map-Matching

Spatial discretization

2.2 Trajectory pattern mining

Table 1 GPS Trajectoriy database TDB1

.

.

.

.

.

4.1 Overall principle

.

.

4.2 TrajGrowth algorithm

4.3 Complexity

4.4 Illustrating example

5.1 Synthetic benchmark trajectories

Table 4 GPS databases trajectories

5.3 Experimental protocol

6. Conclusion

Footnotes

Acknowledgments

Author’s Bios

References

Table 1
GPS Trajectoriy database TDB1

Table 4
GPS databases trajectories