Abstract
With the pervasiveness of GPS-enabled devices, a considerable number of GPS traces are accumulating continuously and unobtrusively in online communities. However, almost all current applications directly use raw GPS data, such as coordinates and time stamps, without interpreting these data. Thus far, online communities cannot offer much support to users in terms of recommending geospatial locations. Furthermore, because the data sets involved are large, users cannot browse each GPS trajectory individually. Therefore, users’ GPS trajectories must be mined and then classified as positive or negative. When the number of ratings for a place exceeds a certain threshold, the place is considered suitable for the user. By contrast, when the ratings for a place are mostly negative, this place is considered unsuitable for the user. When a user searches for the best place, the recommender system determines the user’s location (latitude, longitude) and then sends the best-rated destinations and the shortest routes between the user’s location and the destination to the user’s mobile device. Experiments were conducted in this study to determine the requisite similarity for GPS data points, the user’s information, and the best route for the user.
Keywords
Introduction
With developments in cellular communication and 4G network technology and the ever-increasing popularity of the Internet, the volume of data has exponentially increased. Recommender systems have attracted increasing attention for application in online advertising. These systems can help users to efficiently select their preferred products from numerous candidate products by analyzing the associated user-item matrix. To this end, recommender systems must accurately model user ratings using three variables (user, item, and rating) [10,29,55].
Traditional recommender systems, such as those that use content-based methods or collaborative filtering (CF), predict users’ preferences – often represented as numerical ratings – for new items based on their past ratings for other items. Personalization techniques are mainly used to generate customized recommendations according to users’ preferences and interests [23]. The recommender system is designed to filter unwanted information and provide specific results for a particular user [1]. In travel recommender systems [53,56], the proposed model learns the user’s preferences and generates attractive destinations according to the user’s interests. This study focuses on a travel recommender system that uses artificial intelligence techniques.
Most travel recommender systems focus on analyzing and comparing tourist destinations to help the user to select the most appropriate one [6,24,41]. However, users must know what attractions they can visit when they arrive at their chosen tourist destination. Unfortunately, information regarding points of interest and cultural and leisure activities is very dynamic and is often difficult to retrieve, analyze, and filter. A travel recommender system is a valuable and unique location-based social network (LBSN) service that recommends an activity and the location where this activity can be hold. The commonest types of travel recommender systems are content-based and CF systems. LSBN systems offer a particular user a set of venues (e.g., restaurants, shopping malls) within a geospatial range in consideration of this user’s personal preferences and the social opinions of other users. The location dimension bridges the gap between the physical world and the online social network service. Location is one of the most important components of the user context and provides considerable knowledge about this user’s interests and behaviors. It provides an opportunity to better understand the user in a social context based not only on their online behavior but also their mobility and activities in the physical world [9,11,68].
A travel recommender system provides a user with key points (KPs) from GPS trajectory data [60]. KPs refer to places that have been visited often by many similar people. They are identified by analyzing visitors’ GPS logs of these places. Common examples of KPs include offices, universities, schools, historical sites, museums, restaurants, parks, shopping malls, and places of worship. KPs can be mined to create a prediction model for a user’s future movements [3]. To use the predicted rating, GPS logs are collected from many similar users to determine liked places. Travelers can seek the recommendations of other visitors in the region through the travel application. Furthermore, inexperienced travelers can prepare a better travel plan by learning from other active visitors in the region.
With the pervasiveness of GPS-enabled devices, a considerable number of GPS traces has been accumulating unobtrusively and continuously in these online communities. However, almost all of these applications still directly use raw GPS data, such as coordinates and time stamps, without understanding these data. Therefore, these communities have thus far been unable to offer useful support to users in terms of interesting information regarding geospatial locations. Furthermore, a user cannot separately browse each GPS trajectory in a large data set of a city.
Therefore, the user’s GPS trajectories must be mined and then classified as positive or negative. When the number of ratings for a place exceeds a threshold, this place is considered suitable for the user. By contrast, when the ratings for a place are mostly negative, this place is considered unsuitable for the user.
When a user searches for the best place, the recommender system first determines the user’s location (latitude, longitude) and then sends the best-rated destination and the shortest routes between the user’s location and the destination to the user’s mobile device.
The present study proposes the use of location filtering to model multiple users’ travel sequences at various geospatial scales based on GPS trajectories. Furthermore, route filtering is used to provide users with a route they have never traveled before based on GPS trajectories. The proposed system explores multiple users’ preferences by mining their GPS logs. Specifically, classical travel sequences from multiple users’ location histories are mined in consideration of individuals users’ travel experiences, location interests, and transition probabilities between locations. The proposed system is evaluated using a large GPS data set collected from 107 users over a 1-year period in Beijing, China. Specifically, more than 5 million GPS points and GPS trajectories with a total distance exceeding 160,000 km were collected.
The remainder of this paper is organized as follows. Section 2 reviews related studies. Section 3 describes the methodology, architecture, and operation of the proposed system, and Section 4 reports and discusses the experimental results. Finally, Section 5 presents the conclusions of this study and discusses future research directions.
Literature review
Location recommenders
Many studies have investigated the mining of GPS logs and extraction of user behaviors from their movement history. Several studies have used check-in data from LBSNs as a useful source of information for understanding human mobility patterns [12,13,15,17,19,21,33,46,57]. Although these studies are not directly related to travel recommender systems, we use their insights to devise new and useful location-aware recommendations. Most studies on location recommendation have used some combination of CF and multiple recommenders that exploit users’ geographic preferences [18,20,22].
GeoLife is an LBSN service on Microsoft Virtual Earth (Microsoft Research Asia). GeoLife has a data set of users’ travel experiences, obtained through their GPS trajectories [63,65–67], and finds the most interesting locations and classical travel sequences in a given geospatial region. It also measures the similarity of a destination to the users’ location history. For example, to obtain exact locations at multiple scales over an extended period, a system must be developed to classify GPS data recorded over these locations [34]. Subsequently, a Markov model that uses multiple applications for a single user and collaborative scenarios must be provided. A list of interesting locations is mined through analysis of multiple users’ GPS information, and relational algebra operations and statistical operations are then used to rank these locations [4].
Mining location history
Text mining is an important method that has been applied in various domains, such as medical informatics, bioinformatics, and opinion mining from social media or product reviews. Text mining can be used for different purposes such as classifying a set of documents according to their type [28], analyzing the headlines of financial news [36], and finding similarities among short messages [27]. Searching for similar sequences of data is important in many applications, such as gene similarity determination; speech recognition applications; database and/or Internet search engines; handwriting recognition; spell checkers; and other biology, genomics, and text-processing applications [44,51]. Therefore, algorithms that can efficiently manipulate data sequences (in terms of time and/or space), even with modest approximation guarantees, are highly desirable.
Traditional term frequency–inverse document frequency
Term frequency–inverse document frequency (TF-IDF) is a technique used to eliminate the most common terms and extract only the most relevant terms from a corpus [44]. It removes noise and unrelated data during a corpus preprocessing step. In TF-IDF, a higher term frequency (TF, also called word frequency) for a certain vocabulary entry in the document set corresponds to a lower separate ability of the document property and a smaller weight value; conversely, a lower TF corresponds to a higher separate ability and larger weight value. The TF is expressed as
TF-IDF is expressed by the following equation:
As mentioned, a larger number of vocabulary entries in the document set indicates a lower separate ability of the document property. Vocabulary entry
Edit distance approach
The edit distance (also called the Levenshtein distance) [38] is an easy dynamic programming algorithm that addresses the problem of sequence matching based on the concept of a primitive edit operation. Primitive edit operations include the substitution of a symbol by another symbol, the deletion of a symbol, and the insertion of another symbol. Clearly, by using these three operations, an initial string A can always be transformed into a target string B, provided that both strings are constructed using the same alphabet. Mathematically, the Levenshtein distance between two strings a and b (of length
The edit distance may be computed in
Distances based on correlation coefficients
In recent years, studies have noted that when most biologists compare two objects, they think in terms of similarities rather than distances or dissimilarities [58]. It is intuitively easier to consider similarities rather than multidimensional points. Fortunately, the measurement of similarity is simple, and studies have proposed numerous similarity coefficients for this purpose. Two completely similar (i.e., identical) objects have the maximum similarity (i.e.,
Equation (3) gives a measure of the degree of correlation between random variables X and Y, and the correlation coefficient is in the range of
In statistics and probability theory, the distance correlation is a measure of statistical dependence between two random variables or two random vectors of arbitrary dimension that are not necessarily equal. It is zero if and only if the random variables are statistically independent as given by Eq. (4); by contrast, Pearson’s correlation coefficient can be zero for dependent random variables.
However, the abovementioned methods do not provide fine-grained solutions for mining user trajectories. This study uses comparisons built from users’ historical data to provide recommendations. Unlike previous approaches, the proposed system determines interesting locations by mining multiple users’ GPS data.
Recreational itinerary planning
Previously proposed methods calculate travel routes by considering historical statistical rules and used data mining technology to calculate paths [30,62]. However, they do not consider users’ personal preferences. A heuristic search algorithm has been used to solve the path-planning problem based on network nodes, such as through the provision of personalized path recommendations based on multiple users’ online GPS data [25]. A personalized path recommendation system based on a complete graph uses matrix transformation analysis to select the path [49]. A tourist interactive system evaluates the urban route considering the shortest distance and interesting tourist areas on mobile devices [5]. Another study integrated the Hungarian method and the branch-and-bound-method in operations research, the nearest neighbor in data mining, and rule-based inference to find the route with approximately the shortest distance [31].
The shortest path problem is a fundamental problem in research on graph theory and network optimization. Solving this problem mainly involves determining the shortest path between two nodes in a network. As a basic problem in optimal selection, the shortest path problem plays an important role for research in many fields, such as green navigation, computer networks, biological medicine, urban road planning, social networks, logistics planning, transportation systems, and emergency services.
For example, with the increasing urbanization in recent years, people are increasingly using vehicles for travel because of their convenience. Therefore, traffic and congestion have become unavoidable problems in daily life. In this context, the shortest path problem is becoming prevalent in traffic network analysis and has practical significance. For people to travel quickly and conveniently, the shortest path problem between two nodes must be solved. The present study extends a travel route planning algorithm by considering personal preferences.
System design
System architecture
Figure 1 shows the architecture of the proposed system. It consists of five modules: Data Collection, Feature Preprocessing, Location Filtering, Route Filtering, and Recommender System (RS). In modules 1–3, data are modeled, knowledge is extracted for use as an input to train a recommendation system, and data are processed offline. In modules 4 and 5, users can access the recommender system through the Internet in real time using a laptop, personal computer, or mobile device and submit a query (i.e., user features, category of route). The system will return a ranked list of locations or routes based on the user’s preference or route category. Herein, CF-based route recommendation is first discussed, followed by the proposed location filtering and route filtering modules for wayfinding.

Overall system design.
This system contains five modules: 1. Data Collection (GPS logs, user data, locations data), 2. Feature Preprocessing, 3. Location Filtering, 4. Route Filtering, and 5. Recommender System.
The GPS trajectory (log) data set is established using Microsoft’s GeoLife project [26]. This data set contains data of 182 users, collected over a roughly 5-year period (from April 2007 to August 2012). In this data set, a GPS trajectory is represented by a sequence of time-stamped points, each of which contains latitude, longitude, and altitude information. This data set contains 17,621 trajectories with a total distance of 1,292,951 km and a total duration of 50,176 h. Of these trajectories, 91% are logged in a dense representation (e.g., every 1–5 s or every 5–10 m per point). Table 1 shows sample GPS trajectories in the data set. The data fields are as follows:
Fields 1 and 2: latitude and longitude in decimal degrees, respectively.
Field 3: altitude in feet (−777 if not valid).
Field 4: number of days (with fractional part) that have passed since 12/30/1899.
Fields 5 and 6: date and time, respectively, as a string.
Field 7: ID of each user as an integer.
Sample GPS trajectories in the data set
Sample GPS trajectories in the data set
Each record in this data set represents a GPS point presented as
The GPS trajectory data set creates a weighted user-feature matrix by classifying all users in proportion from the 2016 Statista Survey [2,16]. The survey results show the number of Facebook users in the United States in 2016, sorted by age group and gender, as shown in Fig. 2 and Fig. 3. The age and gender matrix models are given by Eqs (5) and (6), respectively.

Weighted user-feature matrix (age group and gender).

Number of Facebook users in the United States (sorted by age group and gender).
The last text data set contains considerable location data for destinations such as restaurants and entertainment venues. Next, a location-feature matrix is constructed from the China and Beijing point-of-interest (POI) data sets in the China survey [7,14]. The survey results contain the locations of 49,941 stores.
This section discusses how data are modeled to obtain the location filtering and route filtering matrices as inputs for training the recommender system.
User feature
Text content tokenization and continuous feature discretization (e.g., ages) are performed, and all user features are encapsulated into a sparse user-feature matrix
The slope (or gradient) is a measure of the steepness or inclination of a road with respect to the horizontal. Equation (7) works for a single climb in the road. H is the height above elevation acquired from the GPS recorder, and D is the horizontal distance derived from the longitude and latitude acquired using the GPS recorder.
Bergwertung values and route difficulty classification
Bergwertung values and route difficulty classification
Weighted user-feature matrix (route difficulty classification)

Sample weighted user-feature matrices.
However, for a long road segment, other metrics are used for classifying climbs. This study uses the very commonly used Bergwertung “mountain classification” measure [48], given by Eq. (8). This measure is used to classify the difficulty of a mountain biking path. Mountain biking is usually performed on very rough and challenging off-road paths. The Bergwertung measure is used in competitions such as the Tour de France to categorize mountains in terms of their steepness and path distance. Table 2 lists climb classifications in terms of Bergwertung ranges. G1–G3 are suitable for amateur bikers. G4 and HC are suitable for experienced bikers who possess the skill and focus required to avoid dangerous situations.
Equation (9) is used to translate GPS log data into the matrix
Here, route difficulty classification is used to obtain the “similar rating” of a user. Resnick’s standard prediction formula is a commonly used algorithm for this purpose [45]. The similar rating of route

Merge sort algorithm
In the location feature, fields with the most related GPS data (e.g., lng, lat, name) are selected, and location data are translated to the matrix
Location filtering
Location filtering aims to obtain a similar location matrix by using users’ features and location features. This filtering provides a user with the top-N destinations in terms of similar rating and attractiveness in a given geospatial region. Top-N locations correspond to KPs visited most often by users in their historical path. Next, locations that would be interesting to a user are identified; for example, if
Location/content-aware collaborative filtering
The LCCF framework contains two modules: the Key Point Module (KPM) and Similar Location Module (SLM) (user-based).
Key Point Module
The KPM is used for detecting the geographical locations preferred by each user, that is, the KPs; this is because the place where a user likes to go is more important than other places. Examples of KPs include hotels, restaurants, historic places, homes, and schools. The input to this module is GPS points, and the output is KPs for each user. A KP is represented by KP (lat, long) as a virtual location computed from a sequence of subtrajectories (
To estimate the critical points distance, the average of GPS points around specific locations is used, as given by Eq. (11).
Similar Location Module (user-based)
The user-based SLM extracts similar locations from the geospatial region by using the Levenshtein distance. This distance is a string metric used for measuring the difference between two location coordinates. The input of this module is KPs obtained from the KPM, and the output is a set of similar locations. Similar locations are obtained through the following steps.
Sorting key points
In this stage, merge sort is used to sort KPs because it reads items sequentially, typically making
Finding similar users
The output of the merge sort algorithm is a sorted list containing the top-N KPs of different users. Similar users are then extracted through computation of the Levenshtein distances of all latitude and longitude points for each user, as given by Eq. (12).
Mathematically, the Levenshtein distance is the distance between two strings. In Eq. (12),
Calculating explicit rating of place
The explicit rating of a place is calculated to identify whether this place is where a user is most likely to go. The product of the user distance rating value and the weighted user-feature matrix is used to obtain the place rating matrix, as shown in Fig. 5. This is an

Place rating matrix.
The location-based SLM is different from the user-based one. This step focuses on locations that a user may be interested in. It uses the result of the previous step, which predicts a user’s next location coordinates. Each user’s KP is obtained from the KPM based on similar users in the SLM. The previous step uses the merge sort algorithm to find the top-N similar users from the place rating matrix. This is used as the input to this module. Similar locations are obtained through the following steps.

Recommended attractions matrix.
Finding similar locations
After the top-N similar users have been identified, similar locations are extracted through computation of the Levenshtein distance from the top-N key coordinates around users and local attraction coordinates, as given by Eq. (14).
Here, the Levenshtein distance is also used to calculate the two strings. In Eq. (14),
Extracting explicit coordinates of attractions
This step involves finding sightseeing spots that the user is most likely to visit. In this study, the location distance rating value was used to obtain a recommended attractions matrix, as shown in Fig. 6. This is an

Greedy route algorithm
Computing the similarity between items is an essential step in this recommendation system because the recommendation must be a suitable route for users and is based on users’ preferences. The basic idea of CF-based route recommendation is to recommend a route between the user’s current coordinate and the recommended attraction’s coordinate. After location filtering, the recommended attractions matrix is obtained, and suggested routes are generated in one of two ways. The steps involved in the method used by the proposed recommendation system are detailed as follows.
Finding the path point
In this process, the Top-K attractions in the recommended attractions matrix are selected for user recommendations. These Top-K attractions are ranked using a merge sort algorithm. Users can choose their preferred category.
Using a greedy routing algorithm to schedule a path
Greedy routing algorithms constitute an efficient solution to the traveling salesman problem (TSP), a well-known NP-hard problem. It asks the following question: How can a traveling salesman visit all N cities at least once and return to his starting location most efficiently? In this context, efficiency can be defined in terms of distance, time, or money. The TSP is important in real-world scenarios such as delivery networks and supply chains; therefore, it has been studied extensively, and many optimization algorithms have been proposed thus far [32,37,40]. Nonetheless, humans have repeatedly been shown to perform better than computer algorithms on certain problems [39,52].
The goal is to see how the result of a simple greedy algorithm compares to the optimal solution in realistic scenarios. The algorithm of interest is based on the following travel strategy: use the Cartesian coordinate system, with the user’s current position as the original point. Next, select a similar attraction to the user’s current position. This attraction will usually be visited by users because doing so is very easy and does not require much planning. To find the optimal solution, an exhaustive search algorithm is used. It considers all possible permutations of attractions and is thus guaranteed to find the optimal solution. Because the number of permutations is proportional to the factorial of the number of attractions (
Recommender system
The RS recommends the route with the highest rating and arranges routes in descending order of their ratings for each user in the resulting matrix. The route with the highest rating for the most users is considered the best route that can be recommended. Figure 1 shows that RS takes an input from the Route Filtering module and obtains users’ current location (latitude, longitude) from their mobile device as an input. The RS computes the distances between the user and all similar users to obtain the attractive route nearest to the user. By using the abovementioned information, the user can quickly gain an understanding of an unfamiliar city and plan their journeys with minimal effort. The resulting route is shown on the user’s mobile device on a map with directions from the user’s current location to the recommended location.
Description of the combined GPS user trajectory data set
Description of the combined GPS user trajectory data set

Example matrix of attraction recommendations.
In this section, the aforementioned processes are validated through experiments using actual GPS trajectory data, and the implementation and results of the experiments are discussed.
Data sets
To detect and obtain common daily patterns from large raw GPS data sets, this study used Microsoft’s GeoLife data set [26,59,64]. This data set was ideal for the present study because it contained the trajectories of 17,621 users in Beijing. Specifically, it included daily behavioral patterns and high-frequency events using the computation framework discussed in Section 3. However, because this data set did not contain user features, it was combined with feature data from the 2016 Statista Survey. Table 4 describes the combined data sets.
LCCF results
LCCF results
The combined data set was divided into a training set and a validation set using the random sampling technique. The sample data set contained data of 150 users; the validation set contained 30% of all data, and the training set contained the remaining data. The users had a normal distribution in terms of gender and age.
To recommend attractions to the user, the China and the Beijing POI data sets from the China survey [7,14] were used in this study. All experiments were performed on a computer with an Intel Core 3.40 GHz quad-core processor with a cache of 8 MB using the Windows operating system. The experiments were conducted using a program written in JAVA programming language. Figure 7 shows the weighted user-feature matrix with

Greedy routing algorithm.
During LCCF, the similarity is set to more than 60%, and the system makes a recommendation to the user, as detailed in Table 5. The Outdoor Recommendation System approach [59] was compared with LCCF. The Outdoor Recommendation System can only obtain the current user’s preference and cannot obtain a place that is not on their list. For example, user 001 likes route A and always goes by route A. The Outdoor Recommendation System can only recommend a shopping place along this route to the user. In the data set, when user 001 is at point
As described, the LCCF recommendation technique is based on TF-IDF and the Levenshtein distance. Furthermore, the weighted user-feature matrix is used for predicting a user’s preference for a particular place. The similar location recommended by LCCF with the GPS trajectory data set (Microsoft’s GeoLife) was the best (87.5%).
Route filtering
A greedy algorithm is an undirected graph (Fig. 8) that only estimates the minimum route between each point. The proposed system provides a more user-friendly path that is closer to the optimal route.
In a greedy routing algorithm, the user’s coordinate is used to determine the quadrant that has the most points, and then all the quadrant route numbers are summed up. These numbers are arranged in descending order, and the quadrant sequence is followed to make a recommendation when the numbers are the same. This step can reduce the time required to find the optimal route for the user. Finally, the result indicates the nearest locations (Fig. 8).
The result obtained in the previous stage is used to provide CF-based route recommendations and predictions for users. The verification rating was the best (71%). Figure 9 shows examples of paths recommended to user 40. Figure 10 shows a comparison of the rates obtained using the GF and greedy routing methods.

Experimental results of the greedy routing algorithm (Sample ID: User_40).

Experimental results obtained using GF and greedy routing methods.
This study developed a recommender system to recommend routes based on the mining of GPS traces of multiple users. A location recommendation system was proposed and evaluated using Microsoft’s GeoLife data set. In this system, the ratings of the user’s preferences are determined using location filtering and route filtering. The system ranks places according to multiple users’ preferences for different attractions. The location filtering module predicts unknown rates for unvisited attractions. Such information can help in understanding the correlation between users and locations and can be used to provide travel recommendations and guidance to tourists on their mobile devices. This study considered a user’s visit to a location as a link from this user to this location and weighted these links in terms of the user’s travel experiences in various regions. The Route Filtering module plays a crucial role by recommending a suitable route to users.
In real transportation networks, many stochastic factors can delay vehicles, including signals at intersections, traffic accidents, and weather changes. Therefore, future studies should investigate attainment of the path expected to require the least amount of time in a stochastic time-varying transportation network.
Several problems were not considered in this study. First, the focus of this research was more on the mining of GPS logs and the path-planning algorithm. It did not address how the efficiency of travel time can be improved. Second, the proposed system combines the surroundings of recommended locations. Third, the proposed system can only use social network data accessible from presently available mobile apps. In the travel recommendation framework, a multiple context model is considered for constructing customized route arrangements. The extracted context data are sorted, and features related to climate, traffic, security (e.g., parking, mobile connectivity, and health care), service facilities (e.g., vehicle service stations), and vacation spots (e.g., beaches, entertainment venues, and fishing zones) are identified. To solve the abovementioned problems and develop a better recommendation system, location recommendation methods should be further studied.
Conflict of interest
None to report.
