Abstract
Traditional route generation algorithms may result in many routes that few drivers choose in reality, whereas other route generation algorithms need to determine thresholds for route set generation but lack data support. To avoid invalid route generation, reduce computation time, and provide a scientific basis for the generation of navigation routes and traffic assignment, this paper confines the route set size by mining the spatial distribution range of route sets and the threshold of factors affecting route set generation. Global positioning system data are used to determine hotspots based on a hotspot origin–destination identification method. The route set spatial distribution range is mined by the standard deviational ellipse. Finally, the factors affecting the generation of route sets are selected, and the classification and regression trees algorithm is used to mine their thresholds. The results show that the spatial distribution range of route sets is elliptical, and the threshold values of the number of turns per kilometer for medium and long travel distances are 1.794 and 2.508, respectively. The maximum travel time per kilometer for long travel distance is 4.773 min. The maximum numbers of road intersections for short, medium, and long travel distance per kilometer are 1.648, 0.984, and 0.592, respectively. The implications of results on reducing the search range and time of the route set and their applications to traffic network design and route navigation are also discussed.
A route set contains all alternative routes that can meet the needs of travelers. The main purpose of the route set generation algorithm is to obtain a set of feasible routes between the origin and destination (OD), which may not be chosen by travelers in practice. To identify the trajectory between two specific points or areas, it is necessary to set a constraint to filter out the trajectories and avoid the chaotic trajectories. The historical global positioning system (GPS) trajectory data carry rich information on travel behavior and traffic conditions and have been used for a variety of research topics, such as travel time estimation ( 1 , 2 ), driving behavior analysis ( 3 ), taxi driver behavior modeling ( 4 – 6 ), and fuel efficiency and safety modeling ( 7 ). Such trajectory data between OD pairs are the embodiment of drivers’ experience, making it possible to extract a reasonable route set from many generalized routes, considering travel habits, experience, and preferences. Mining the temporal and spatial characteristics of the route set can reduce computation time and improve the efficiency of route planning and route navigation.
At present, existing studies mainly focus on the modeling of route choice behavior, and the data used mainly include stated preference (SP) data and GPS data. Using SP data, scholars have proposed many route choice models, trying to estimate the influence of different factors on route choice behavior ( 8 , 9 ). However, such SP data need to be surveyed carefully. Moreover, the information obtained from surveys is usually subjective and not reliable enough. The use of GPS equipment to collect travelers’ trajectories and analyze drivers’ route choice behaviors takes less effort and is more reliable than the use of traditional surveys. Gong et al. ( 10 ) used GPS location data recorded by 4,900 GPS devices in Guangzhou to study drivers’ route choice behaviors in the urban road network and found that distance, road category, intersections, turns, and traffic signal control were the main factors that affected route choice. Yuan et al. ( 11 ) used a variance entropy-based clustering approach to discuss the distribution of vehicle travel time and proposed a two-step path algorithm to mine the optimal route. Pan et al. ( 12 ) extracted conventional driving route patterns from historical travel trajectories and focused on exploring anomaly route behaviors that were significantly different from conventional patterns. Papinski and Scott ( 13 ) proposed a geographic information system (GIS)-based route choice analysis toolkit, which considered more than 40 variables describing route characteristics. The 237 observed routes were compared with the shortest routes and it was found that the distance and travel time of the observed routes that drivers selected were much longer than those of the shortest routes. Vacca and Meloni ( 14 ) used portable GPS devices to track participants, studied the route choice behavior between the same OD pair, and found the main factors affecting drivers’ route choice, including the number of traffic lights (per kilometer), percentage of highways, and time perception.
Travelers choose a route in the route set based on their travel preferences. Routes that are not in the route set are not selected. The conceptual rationale contemplates that choice set formation and the choices from considered options are distinct mental processes that require separate modeling, as the first is usually non-compensatory and the second is largely compensatory ( 15 ). Therefore, the determination of route sets is the premise of route choice behavior.
Popular route set generation methods include k-shortest path, link penalty, branch and bound ( 16 ), stochastic simulation, and random walk. When generating a route set, some generation methods require constraints. For example, when the branch and bound method is used to generate the route set, it needs directional, temporal, loop, similarity, and left-turn constraints. The thresholds of these behavior constraints are usually assumed and lack data support. With the wide use of GPS data, it is possible to mine the factor thresholds of the route set.
The continuous improvement of GPS accuracy also provides a promising prospect to directly use GPS historical trajectory data as the route set. For example, Qin et al. ( 17 ) directly composed the route set with the route generated by GPS data, calculated the route chosen probability, and identified the anomaly route. However, the huge GPS historical trajectory data may enormously increase the calculation time. Therefore, it is necessary to study the choice set size and route set spatial distribution range, control the number of routes in the route set, and define the research area within a reasonable range. Bekhor et al. ( 18 ) used the Sioux Falls network and the Winnipeg network as examples to illustrate the influence of route choice set size on equilibrium solution and algorithmic performance. Dhakar and Srinivasan ( 19 ) modeled the route choice decision with choice set sizes of 5, 10, and 15 and concluded that the choice set size does not affect model fitting. Huang and Levinson ( 20 ) calculated the weighted root-mean squared error value for route sets of different sizes and concluded that the appropriate route set size is 60. Zhang et al. ( 21 ) constructed a metro passengers’ semi-compensated multinominal logit route choice model, which endogenously calibrated utility function parameters and thresholds. Three routes were chosen based on the standards of the shortest travel time and the minimum number of transfers; these constituted a route set. The model calculated the value of each route in the choice set and determined the choice probability of each route in the route set. Hassan et al. ( 22 ) tested the multinominal logit and nested logit models with different route set sizes under different parameter conditions. The results indicated that the goodness of fit of the model decreased with the increase of the route set, whereas the fitting time increased until the size of the route set hit the peak at 25 or 30.
Route set generation is a prerequisite for transportation network planning. A reasonable route set can reduce route search time in the traffic distribution process at the road network planning stage and can provide a scientific basis for the generation of the navigation route, by mining the thresholds of the travel route set between OD pairs. Although some research has studied the contributing factors of route choice behavior, studies on the factors affecting route set generation are limited. Moreover, the thresholds of the affecting factors, which are important for research on route set generation method, are barely studied. In the traffic assignment stage, it is impossible to enumerate all routes, so the route set needs to be confined to a reasonable size. The computation time needs to be as low as possible, which also brings requirements on the route set size. Existing research on the route set size mostly focuses on its impact on model computation time and goodness of fit, but the research on how to confine the route set is very limited. Although GPS data contain rich information, the data have been mainly used as a supplement of the generated route set or data support for route choice behavior, whereas previous studies (23, 24) have not noticed that the GPS data contain a large amount of route set information. Therefore, based on the GPS data, this paper extracts the actual travel routes to analyze the temporal and spatial distribution characteristics of the travel route set, to discover route set spatial distribution range, and to mine the threshold of each factor affecting the generation of route sets.
This article is divided into five sections. The second section introduces the data processing in detail; specifically, how to determines hotspots based on the hotspot OD identification method. The third section uses a standard deviational ellipse (SDE) to fit the routes between OD pairs and determine the route set size. The fourth section selects the factors affecting the generation of route sets, mines its threshold through the classification and regression trees (CART) algorithm, and ranks the importance of factors. Finally, the fifth section provides the conclusions of the study.
Data
The GPS data used in this article were collected from 11,602 taxis in Xi’an, China, from September 10 to September 17, 2019. Most taxi drivers in Xi’an are familiar with the city and do not use software to guide their routes, which provides an opportunity to analyze the route choice in reality. The GPS data attributes included taxi operating time, license plate number, location coordinates, driving angle, instantaneous speed, and loading state that indicated whether the taxi was carrying passengers or not. The recording time interval of GPS data was 30 s, which resulted in more than 50 million data records each day. The geographic coverage of this study was the main urban area of Xi’an, of which the coordinates range is 34.14°–34.46°N and 108.7234°–109.1025°E. GPS data outside of this range were excluded.
After confirming the study area, pick-up and drop-off hotspots were identified and the frequent OD pairs between these hotspots were extracted. The goals of this step were to identify the areas with high density of pick-up or drop-off events, lay the foundation for hotspot OD matching, and secure a sufficient number of passenger-carrying trips between the same OD pair (from pick-up to drop-off) ( 25 ).
Based on the change of loading state between two adjacent GPS data records, the pick-up and drop-off points can be identified. Taking the taxi GPS data of September 10, 2019 in Xi’an as an example, from 50 million trajectory data generated by taxis in the whole city, after deleting outliers and the data when the boarding time is not on the same day, more than 499,000 pairs of pick-up and drop-off points in the research region were obtained. Then, we used the density-based spatial clustering of applications with noise or DBSCAN algorithm to determine the pick-up and drop-off hotspots ( 26 ). Two parameters of the algorithm need to be determined: cluster neighborhood radius (Eps) and minimum density threshold (MinPts). In this paper, the K-distance method was used to determine reasonable Eps.
To determine the value of Eps, the Euclidean distance of each point in the dataset and its nearest point was calculated. Then, K-distance was plotted, in which y-axis represented the distance and x-axis the data sorted in ascending order. The inflection point of the curve was determined as Eps of the dataset.
Taking the point dataset on Tuesday, September 10, 2019 as an example, the K-distance changed significantly around 0.00211. Therefore, 0.00211 was selected as the Eps.
MinPts indicates the density of pick-up or drop-off points in each cluster. In this paper, with the given Eps and assuming MinPts, clustering results of pick-up (drop-off) points were obtained. According to the clustering results under different MinPts, reasonable MinPts can be determined. With different MinPts, the clustering results of the pick-up (drop-off) points are shown in Table 1.
Clustering Results of the Pick-Up (Drop-Off) Points
Note: MinPts = minimum density threshold.
To obtain as many clusters as possible and to ensure each cluster has relatively little noise data, the pick-up points were analyzed. In Table 1, the proportion of noise data indicate that the value of MinPts should be chosen from among values 100, 200, and 300. To nail down which MinPts should be chosen, the spatial distribution of different MinPts values is shown in Figure 1. It is clear that when the value of MinPts was set to 300, the spatial distribution appeared to be more reasonable. Then, the points of the whole day were clustered into 148 clusters (Table 1).

DBSCAN clustering results: (a) DBSCAN clustering results when the value of MinPts is 100, (b) DBSCAN clustering results when the value of MinPts is 200, (c) DBSCAN clustering results when the value of MinPts is 300, and (d) DBSCAN re-clustering results when the value of MinPts is 300.
A hotspot OD identification method was proposed, with the goal of providing a sufficient sample size for the study. It consisted of the following two steps: (1) For each pick-up event, the corresponding drop-off point and the trajectory data in-between are searched; (2) the drop-off points are re-clustered. The DBSCAN algorithm was used for the re-clustering of pick-up points ( 25 ). After re-clustering drop-off points, more than 40 OD pairs including 6,000 routes were found, as shown in Figure 1.
Spatial Distribution Range of Route Sets
Before using GPS trajectory data to generate the route set, the spatial distribution range of the route set needs to be set and the data between OD pairs or two specific areas need to be filtered out to ensure that the trajectory is not messy. It is necessary to study the spatial distribution range of the route set, which is an important index to confine the route set in space, as well as to limit the scale of the route set and the detour distance to a certain extent. The hotspots identified before were used as the origins and destinations. The effective trajectory must pass through the origin (or the area near the origin) and the destination (or the area near the destination) in sequence. The entire range between the origin and destination is defined as the OD area. The trajectory shape of the route set between the 40 OD pairs chosen for the study presents a similar “ellipse” distribution trend. The trajectory distribution near the origin and destination presents a phenomenon similar to the focus in an ellipse, and the trajectory in the area is distributed as the boundary of the ellipse.
To further understand the spatial distribution range of the trajectory data, SDE was used to fit the travel route distribution characteristics. SDE is a spatial statistical method that can accurately describe the spatial distribution characteristics of data with four basic parameters: mean center, the angle of rotation, semi-major axis, and semi-minor axis (
27
). The mean center represents the center position of the entire data. The angle of rotation refers to the angle from true north to the major axis of the ellipse, which reflects the main direction of the distribution. The semi-minor axis of the ellipse represents the data distribution range, and the semi-major axis represents the data distribution direction. The coordinates of the points in the dataset are (
Among them, xi and yi are the coordinates of features i,
The calculation method of the angle of rotation is shown in Equation 3.
Among them,
The calculation methods of the standard deviations of the x-axis and y-axis are shown in Equations 4 and 5, respectively.
SDE fitting was performed in ArcGIS. Eight OD pairs were chosen as examples and the fitting results are shown in Table 2 and Figure 2. By fitting SDE to the trajectory data, the spatial route distribution range between the OD pairs is found to be elliptical. If one segment composing a route is not inside the ellipse, the route is not inside the ellipse.
Standard Deviational Ellipse Fitting Results

Standard deviational ellipse fitting results of different origin–destination (OD) pairs: (a) Xi’an Institute of Physical Education (D)–Xiaozhai Station (O), (b) Xi’an North Station (O)–Xi’an Station (D), (c) Huaban Life Zone 2 (O)–Xiaozhai Station (D), (d) The First Affiliated Hospital of Jiaotong University (O)–Xiaozhai Station (D), (e) North Street (O)–Xi’an Station (D), (f) The Bell Tower (O)–Xi’an Station (D), (g) Daminggong Primary School (O)–City Library (D), and (h) Xi’an Railway Station (O)–Xijing Hospital (D).
Mining the Threshold of Factors Affecting the Route Set Generation
Selection and Coding of Factors Affecting the Route Set Generation
As the data were from GPS devices, the individual and household attributes of travelers were excluded in this study and the road route attributes were examined. Based on previous studies ( 19 , 28 ), eight attributes, including the number of turns (both left and right turns) per kilometer, the number of road intersections per kilometer, travel time per kilometer, proportion of expressway in a route (EP), proportion of arterial road in a route (AP), proportion of secondary road/branches in a route (SBP), loop route distance, and travel distance type, were selected to mine travel route set characteristics. The classification of each factor and its corresponding coding are shown in Table 3. The classification of different variables was determined according to their distribution. Taking travel distance as an example, the first quartile is 1.783 and the third quartile is 4.702.
Encoding Factors Affecting the Generation of Route Set
Note: na = not applicable.
Establishment of the Decision Tree Model
The number of turns per kilometer (both left and right turns), the number of road intersections per kilometer, travel time per kilometer, EP, AP, SBP, and loop route distance were set as the independent variables, and travel distance type was set as the dependent variable. In this modeling process, only the routes in the specific ellipse studied before were selected as a sample, which filtered out more than 6,000 pieces of data. Of the sample data, 80% were randomly selected as training samples for decision trees and the remaining 20% were used as test samples to test the model results. The tree’s maximum depth was set to six layers. The decision tree will stop growing when the number of objects in the parent node is less than 2% of the total number of objects or the number of objects in the child node is less than 1% of the total number of objects. The decision tree results are shown in Figure 3. The classification results of training samples and test samples are shown in Table 4.
Classification Results of the Decision Tree
Note: na = not applicable.

Results of decision tree construction.
As seen in Table 4, in the decision tree model, the classification accuracy of the training dataset is greater than 75% and the classification accuracy of the test dataset is greater than 73%, indicating that the results obtained under the existing parameter conditions are acceptable. After building the decision tree, overfitting may occur when the model memorizes the training data so well that the model is learning noise on top of the signal. To avoid overfitting, pruning was used to reduce the size of the decision trees. Figure 3 shows that the branches of the decision tree are too large, indicating overfitting of the decision tree. Therefore, the cost complexity pruning (CCP) algorithm was selected for pruning this model, and the results are shown in Figure 4.

Results of pruning the decision tree.
Results
The results show that there are 49 nodes in the decision tree after pruning: 0 nodes are denoted as root nodes and the depth of the tree model is 6. To improve the model, the calculation of the Gini index of all factors affecting the generation of route sets is the first step. Then, they are sorted in ascending order. If the Gini index is the smallest, the corresponding factors are selected as the optimal feature of the current dataset. According to the calculation results of the seven characteristics of the Gini index, the smallest value is the travel time per kilometer. So, the factor is set as the root Node 0, and it is also the one that has the greatest influence on drivers’ route choice among all the factors.
Next, two values of the root node are used to divide the tree into two child nodes, and the child nodes divide the original data into two subsets. The other factors are selected with the minimum value of the Gini index, except for the travel time per kilometer: loop route distance is selected as Node 1, which is divided into two child nodes named “loop route distance ≤ 1.214” and “loop route distance > 1.214”, respectively. The above steps are repeated for each new node until all nodes in the decision tree meet any of the following conditions: all samples in the node belong to the same category and are set as leaf nodes; all samples in the node have the same remaining attribute values but belong to different categories; all sample attributes in the node are processed, and there are no remaining attributes that can be used to further divide the sample.
Comparing Figures 3 and 4 shows that the decision tree is different after pruning. Several nodes in the decision tree are cut, and the pruning process is as follows: taking Node 3 and Node 4 as examples, the cost complexity parameter α of Node 3 and Node 4 are first calculated. The computation results indicate that the value of Node 4 is less than the value of Node 3. Therefore, the branches below Node 4 are cut off. Similarly, the CCP algorithm continues to calculate the value of each other non-leaf node and then snips out the subtree with the minimum each time. Through pruning optimization of the decision tree, the classification results of training samples and test samples are shown in Table 5.
Classification Results of the Optimized Decision Tree Model
Note: na = not applicable.
As can be seen in Table 5, in the optimized decision tree model, the classification accuracy of the training dataset is over 75% and that of the test dataset is about 74%, which is about the same as that of the initial decision tree model. The accuracy of the training set after pruning is lowered because the decision tree will not learn the optimal parameters as well for the training set. However, without pruning the tree, the model with improper parameters will fail to generalize. This means that the model has learned an overly complex function, which perfectly predicts for the training sample, but may not be able to generalize with new data. To sum up, the model after pruning is more reliable for classification and prediction.
Discussion
Threshold of Factors Affecting the Generation of Route Sets
The root node and each internal node are marked with the basic variable and its branch threshold, which lead to the results of dividing the threshold values of each attribute under different travel distance types. The results show that for long travel distance, the maximum number of turns per kilometer that a driver can accept is 2.508, the maximum travel time per kilometer is 4.773 min, and the threshold of loop route distance and number of intersections per kilometer are 1.214 and 0.592, respectively. For medium travel distance, the threshold of the number of turns per kilometer that the driver can accept is 1.794, the maximum loop route distance is 1.027, and the maximum number of intersections per kilometer is 0.984. For short travel distance, the threshold number of intersections per kilometer is 1.648. The rule set of factors generated by the final decision tree model is shown in Table 6, in which Modes 1, 2, and 3 represent short, medium, and long travel distances, respectively.
Rule Set of Factors Affecting the Generation
Note: AP = proportion of arterial road in a route; EP = expressway in a route; SBP = proportion of secondary road/branches in a route.
Influence of Network Structure
The result of the decision tree model ranks the factors affecting drivers’ route choice from high to low as follows: travel time per kilometer, the number of turns (both left and right turns) per kilometer, AP, loop route distance, the number of road intersections per kilometer, EP, and SBP. The result is shown in Figure 5.

The importance of factors in descending order.
The reason that travel time per kilometer has the greatest influence may be that travelers are prone to get to the destination as quick as possible, which is in line with common sense. When the vehicle is moving in the opposite directions, left-turn vehicles need to wait for a long time. Cars not in the right-turn lane are often delayed by the cars that go straight through the intersection. It accounts for drivers who think highly of the number of turns per kilometer in route choices. It is worth noting that after optimization, EP and SBP are the least important factors when making route choices. Figure 1 shows the road network structure of Xi’an: it has a main urban road network that belongs to the grid-type road network; arterial roads are the backbone of the road network, and expressways serve in the surrounding areas, presenting a form of mixed road network. Drivers are inclined to use wider roads and are not willing to drive on the low service level roads.
Comparison and Application of Branch and Bound Method
Prato ( 16 ) used the branch and bound method to generate the route set, and simply set the number of left turns threshold to 4, ignoring the interaction between the numbers of left turns and other influencing factors. The threshold of the number of turns per kilometer calculated by the decision tree for the long travel distance is 2.508, which is longer than the medium and short travel distance threshold values of 1.794 and 1.648, respectively. The reason may be that when the travel distance is short, travelers are more sensitive to the variables that affect travel efficiency. This result shows that the left-turn threshold value of 4 is inappropriate. Although the loop route distance is considered as the primary consideration for drivers’ route choice, Prato set 1.2 as the threshold for a route, which is not combined with the actual situation. If the branch and bound method is used to generate the route, the threshold obtained in this paper can be used as a constraint.
Conclusions
This paper presents the findings of the route set spatial distribution range and the thresholds of factors affecting the generation of route sets. The route set is generated directly from the GPS trajectory data of taxis in Xi’an, China. In this study, 40 representative hotspots were selected as the origin and destination. It was found that the trajectory shapes between the OD pairs presented a distribution trend that is similar to an ellipse shape. Moreover, the SDE was used to fit the route set spatial distribution, and the spatial distribution range between OD was determined as an ellipse.
The number of left turns per kilometer, the number of road intersections per kilometer, the travel time per kilometer, the loop route distance, EP, AP, SBP, and the travel distance were selected as the set of factors affecting the generation of route sets. The CART decision tree model was established and the accuracy of the model was improved by the CCP method. The results show that for long travel distance, the maximum number of turns per kilometer that a driver can accept is 2.508, the maximum travel time per kilometer is 4.773 min, and the threshold of loop route distance and number of intersections per kilometer are 1.214 and 0.592, respectively. For medium travel distance, the threshold of the number of turns per kilometer that the driver can accept is 1.794, the maximum loop route distance is 1.027, and the maximum number of intersections per kilometer is 0.984. For short travel distance, the threshold number of intersections per kilometer is 1.648. The factors that influence travel choices most significantly are travel time per kilometer, the number of turns per kilometer, and AP.
It is recommended that the algorithm used to extract OD pairs should be optimized for less computation time. Also, as the data in this article are from the GPS dataset, when selecting the factors affecting the generation of route sets and mining their threshold, the research only focuses on the inherent attributes of the route. Factors such as individual driver attributes, household attributes, traffic conditions, weather conditions, and others could be added in the future. Finally, when coding decision tree variables, only three kinds of travel distances are divided. It is recognized that the value has a certain subjectivity. In future research, variables like EP, AP, and SBP could be used as continuous variables to better understand the road utilization conditions considered by drivers for different travel distances.
Footnotes
Author Contributions
The authors confirm contribution to the paper as follows: study conception and design: X. Hu, Y. Deng; data collection: S. Yan; analysis and interpretation of results: Y. Deng, S. Yan, P. Zhang; draft manuscript preparation: S. Yan. All authors reviewed the results and approved the final version of the manuscript.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research is supported by the National Key Research and Development Program of China (Grant No. 2018YFB1600900), the Shaanxi Provincial Science and Technological Project (Grant No. 2020JM-244), the Science and Technological Project of Shaanxi Provincial Transport Department (Grant No. 19-24X), and in part by the Fundamental Research Funds for the Central Universities (Grant No. 300102210204).
