Abstract
This paper presents a method for enhancing purpose imputation from global positioning system data without using geographic information system data via relevant feature selection from six groups: (1) activity time; (2) user characteristics; (3) predicted travel modes; (4) actual travel modes; (5) estimated home location; and (6) estimated location of the most frequently visited non-home place (MFVP). Two datasets were collected in 2019 using TRavelVU, a smartphone application. The first one (the Hanoi dataset) comprised 652 days’ worth of data collected from 63 users in Hanoi, Vietnam, whereas the second one (the Donate dataset) comprised 932 days’ worth of information collected from 65 individuals in Denmark, Sweden, and Norway. The hyperparameters of the random forest models were tuned carefully in accordance with selected features, thereby facilitating a thorough evaluation of the improvement in prediction models. The findings of this study revealed that the addition of either actual or predicted modes resulted in improved imputation performance, albeit the former exhibited a stronger positive effect. This demonstrated the potential benefits of integrating mode detection and purpose identification into a continuous process. The newly adopted MFVP feature contributed to enhanced prediction results (around 2%). The proposed purpose-imputation models, which benefited from all features, demonstrated accuracies of the order of 75% and 85% for the Hanoi and Donate datasets, respectively. The imputation of home and work/education activities demonstrated high success, whereas reasonable prediction results with nearly all F-score levels ranging between 50% and 83% were observed for pick-up/drop-off, shopping/eating, visit/leisure, and business activities.
In the past, mobility data collection has remained entirely dependent on self-reported surveys ( 1 ), whose data tend to be unreliable owing to human memory limitations and persons’ habit to round-off times ( 2 – 4 ). The widespread deployment of global positioning system (GPS) has brought about a technological revolution in travel-related investigations ( 1 , 5 ) because GPS facilitates continuous and passive collection of big accurate spatiotemporal data over long durations ( 6 – 9 ). Currently, an increasing number of GPS-based surveys are being performed owing to the ubiquitous spread of smartphones ( 4 , 7 , 10 ). However, positioning data do not comprise purpose information, and this has motivated researchers to develop numerous purpose-imputation models ( 6 – 17 ), as presented in Table 1.
Existing Studies in Relation to GPS-Data-Based Purpose Imputing
Note: GPS = global positioning system; GIS = geographic information system; POI = points of interest.
The different prediction methods developed can be categorized into three groups: rule-based, probabilistic, and learning-based ( 5 ). Pre-defined ad hoc rules were created and performed plausibly on a 13-person test with an accuracy of 79.5% ( 11 ) but poorly on larger-scale datasets of 66 respondents ( 13 ) and 1,104 participants ( 6 ) with accuracies of 60% and 43%, respectively. It has been realized that probabilistic methods yield better imputation performance compared with rule-based approaches. Chen et al. ( 14 ) developed a multinomial logit model that demonstrated an accuracy of nearly 80%, whereas the two-level nested logit model described in ( 9 ) successfully inferred 60% of activities considered. Since 2013, supervised learning models have attracted increased research attention. Tree-based algorithms are the most preferred for use in purpose-imputation applications. Decision-tree models have demonstrated successful imputation of 65% to 75% of considered activities ( 9 , 12 , 15 ). The random forest (RF) approach has been successfully employed in ( 7 , 16 , 17 ) and demonstrated the highest accuracy level (96.8%) in available literature ( 17 ). Other powerful classifiers have been developed by employing enhanced neural networks with particle swarm optimization ( 10 ) and determining parameters between nodes of different layers based on the Bayesian probability distribution ( 8 ); these approaches demonstrated accuracy levels exceeding 90%. Other methods, such as those based on the Bayesian network, support vector machine, and K-nearest neighbors, have been developed to generate baseline levels to emphasize the superiority of the main models proposed ( 8 , 10 , 17 ). To make a model adequately generalize data contained within a training dataset followed by successful detection of purposes contained in an unseen (i.e., test) dataset, researchers have tailored the values of hyperparameters (HPs) such as the number of trees in a forest or number of hidden neurons ( 8 , 10 , 16 ). HPs denote parameters involved in the configuration of an algorithm, and their values are specified before commencement of a training process. However, they cannot be estimated directly from available data ( 18 ).
With regard to input features, geographic information system (GIS) data (land use and points of interest) hold the key to inferring purposes successfully ( 19 ), and the same have been considered for use in (6–14, 17). Without consideration of GIS data, accuracy levels of 76% and 84.4% have been reported in ( 15 ) and ( 16 ), respectively. In both these studies, regardless of the mode transfer purpose, the predictions of non-home and non-work activities demonstrated low accuracy levels. For example, the precision levels of shopping, social, and bring-get activities remained low at approximately 40% ( 15 ). In extant research, the use of GIS data is limited by the geographical scope of research. As the quality of and means to store GIS data vary from one place to another ( 19 ), employing GIS data requires intensive computation. Additionally, including GIS datasets of more than one area into a single experiment is a time-consuming and labor-intensive task ( 20 ). In fact, all prior studies concerning purpose detection have analyzed data collected in a specific region of a developed country or China, thereby resulting in a limited understanding of the derivation of trip purposes from GPS-based travel surveys in developing countries.
Transportation modes are useful features for purpose imputation ( 7 , 9 , 10 , 16 , 17 ). They comprise actual modes that are directly extracted from the ground truth. Notably, purpose imputation is the subsequent step of mode detection in GPS data processing ( 1 ). Thus, predicted modes, instead of actual modes, should be used to identify trip purposes.
Additionally, participant characteristics are considered worthy features (1, 6–12, 16). While several individual attributes, such as gender, age, and occupation, are easily accessible, others, for example, locations of frequently visited places, such as home, workplace, and school comprise sensitive information, which participants might not be willing to provide readily ( 17 ).
Time profiles of activities are also important variables. The most frequently used time-related indicators include the start time and duration of activities (7, 9, 11–17). Interestingly, Reumers et al. ( 15 ) developed a model that realized 76% prediction accuracy while exclusively considering these variables.
A striking advantage of collecting mobility data via GPS is the inclusion of multiple consecutive days. Because travel diaries of individuals may mention several repetitions of activities ( 21 ), the use of historical travel patterns extracted from a GPS dataset may help promote purpose imputation. However, because the frequency of visiting locations can only be estimated from single-day data, this is not an important consideration from the prediction model viewpoint in ( 12 , 14 ).
Based on the above-described gaps in extant research, this study aims to enhance the accuracy of GPS-data-based purpose detection without using GIS data by selecting relevant features. The specific objectives include
test predicted mode-based features to facilitate integration of mode imputation and purpose imputation into a continuous process;
test features associated with frequently visited places extracted from GPS data to benefit more from GPS data of multiple days and thereby eliminate the need for participants to reveal sensitive location information;
improve prediction accuracy of non-home and non-work purposes; and
use datasets collected in developed and developing countries to facilitate comprehensive evaluation of feature-related findings and deepen the knowledge in relation to purpose imputation in developing countries.
This paper continues by describing the two datasets used in this research. Next, the methods to optimize the RF model HPs and feature definitions are shown. The following section contains the results of prediction models obtained using the two datasets along with a discussion of major findings. Finally, conclusions and future research directions are presented.
Data
Hanoi Dataset
Data Collection
The first dataset was collected in Hanoi, the capital of Vietnam, during the months of March and April of 2019. TRavelVU, a smartphone application developed by Trivector (Sweden) for both Android and iOS platforms, was used. Participants were recruited via an invitation post on Facebook; additionally, direct invitations were sent to the authors’ colleagues at the University of Transport and Communications in Hanoi. The selected participants were asked to answer a questionnaire in relation to their personal information (e.g., gender, age, occupation). Owing to privacy concerns, the participants’ residential addresses could not be gathered.
After completing the questionnaire, all participants were provided with a password to set up the TRavelVU application on their smartphones and participate in the Hanoi survey. During the survey period, the app recorded points every 1–3 s when smartphones were in motion. GPS data-processing algorithms developed by Trivector transform points into segments that annotate the activity or travel being undertaken before showing these segments sequentially on the app interface. All participants were requested to check and provide mode information to travel segments and purpose information to activity segments. In addition, each participant was encouraged to provide validated data for at least 7 days. The final dataset comprised data collected from 63 people with confirmed travel diaries and living in Hanoi during the survey period (Figure 1a). A more detailed description of the recruitment process has been reported in ( 22 ).

Estimated home locationsa of participants as captured in Hanoi and Donate datasets.
Data Preparation and Purpose-List Determination
This study used daily travel diaries confirmed by participants. Figure 2 below depicts an example of the same. Because the TRavelVU app is sensitive to changes between the movement and non-movement of the participants, an activity performed at a given location could be split into several segments, as depicted in Figure 3 (for example). The authors carefully visualized all consecutive and identically annotated activity segments to decide whether to connect them manually or not.

Daily travel diary of a participant from Hanoi.

Three “work” activities at P1, P2, and P3 (blue circles) were reported within the premises of a university between 14:50 and 17:20. P1, P2, and P3 can, therefore, be merged into one activity P (red star) lasting from 14:50 to 17:20 with position coordinates considered as the average of those of P1, P2, and P3.
The purpose list considered comprised: (1) home, (2) work/education, (3) shopping/eating, (4) pick-up/drop-off, (5) visit/leisure, and (6) business.
Work and educational activities were similar with significantly longer durations and high frequency of occurrence during morning hours compared with other non-home purposes (Figure 4, a and b).
Shopping and eating out were combined into a single category, because eating breakfast at street-vendor stalls and diners as well as going daily shopping in the morning to purchase food supplies for the entire day is a common habit among people in Hanoi. Further, some people spend a part of their day to shop and have lunch or drink coffee at department stores and malls.
Pick-up and drop-off were grouped together as they usually last for only a short period and occur within specific time windows.
Business was considered a separate purpose type because some participants, being occupied as sales staff or repairers, usually left their offices to serve customers, or they were required to go to places to maintain machines, repair machines, or both. Business activities usually occur within normal working hours, albeit with shorter durations compared with work/education activities.
Visit and leisure activities are characterized by different start times and duration profiles.
Activities not belonging to the six types mentioned above, as well as those performed outside Hanoi or lasting less than a minute each, were discarded. Eventually, the Hanoi dataset comprised 652 days’ worth of collected data (10.4 days per user on average) with 2,596 activities: home, 935 (36.0%); work/education, 401 (15.4%); shopping/eating, 454 (17.5%); pick-up/drop-off, 157 (6.1%); visit/leisure, 303 (11.7%); and business, 346 (13.3%).
Donate Dataset
Data Collection
The TRavelVU application affords its users two survey options. The first involves performing dedicated surveys, where a user requires a password to take part (for example, the Hanoi survey). The second survey type, such as that performed for collection of the Donate dataset, can be accessed by anyone after installing the TRavelVU application on their smartphones. To validate the findings from the Hanoi dataset, the authors asked Trivector to share a part of the Donate dataset collected during September and October 2019. The Donate dataset was chosen because the same variable list as the Hanoi dataset can be achieved. Additionally, this dataset encompassed data of citizens of several Scandinavian countries, thus partly representing data from developed countries.
Data Preparation and Purpose-List Determination
The Donate dataset comprised data collected from 138 participants. Data collected from users whose activities were tracked for only a single day and those failing to declare their gender, age, occupation, or presence of a child in the household were neglected. The final dataset comprised 932 days’ worth of data collected from 65 users (an average of 14.3 days per user) living in Sweden, Denmark, and Norway (Figure 1b).
A purpose list identical to that used for the Hanoi dataset was used for defining activity segments in the Donate dataset. However, unlike the Hanoi dataset, the authors retained all confirmed segments corresponding to the six purposes. That is, the authors: (1) considered all short activities that lasted less than 1 min; (2) considered all activities outside the living areas of participants; and (3) did not merge consecutive activities with identical labels occurring at the same place. Although this strategy may reduce the accuracy of purpose detection, it facilitates a comprehensive evaluation of the proposed method.
Finally, of the 3,469 segments eligible for further purpose identification: home, 1,256 (36.3%); work/education, 722 (20.8%); shopping/eating, 860 (24.8%); pick-up/drop-off, 184 (5.3%); visit/leisure, 324 (9.3%); and business. 121 (3.5%). As can be seen, the home and pick-up/drop-off activities demonstrated a nearly equal share in both datasets. Overall, compared with the Hanoi dataset, the Donate dataset comprised much smaller percentages of visit/leisure and business activities, whereas larger corresponding percentages of work/education and shopping/eating activities were observed in the Donate dataset. The time distributions of purposes in the two datasets were observed to be nearly identical (Figure 4).

Start time and duration distributions of Hanoi and Donate datasets split across different purposes.
Method
Random Forest Algorithm
The RF technique was used in this study owing to it being one of the most pragmatic approaches to detecting purposes ( 7 , 16 , 17 ). Compared with other machine-learning approaches, such as artificial neural networks and support vector machines, optimization of RF models is well documented ( 23 , 24 ), and the same can be performed faster and more effectively based on the authors’ tests.
RF, a non-parametric prediction tool introduced in ( 25 ), comprises an ensemble of decision trees. All trees learn from random samples selected from the original data with replacement. To split each decision-tree node, a subset of input features that are randomly withdrawn from all features is employed. Subsequently, all decision-tree votes are aggregated to determine the final RF prediction. The randomness described above and the voting mechanism of RF helps in avoiding overfitting of results.
To facilitate an objective and comprehensive evaluation of the feature contributions to the effectiveness of prediction models, the authors optimized the model settings based on the corresponding features used. The Python programming language and Scikit-Learn library were deployed to tailor the values of the following HPs:
number of features used for each split;
number of trees in a forest;
maximum number of splits until a leaf;
minimum number of observations in a leaf; and
minimum number of observations in a node to cause further splitting.
To select an appropriate HP set, the data collected for each purpose were first divided into training and test datasets in a 3:1 ratio (Figure 5).

Flowchart of tuning hyperparameters.
Subsequently, a ten-fold cross-validation random grid search was performed on the training set. The random grid search refers to a range of values for each HP determined based on the authors’ experience and ( 23 ). A grid was formed using a combination of such values, and HP sets were randomly chosen from this grid. Then, a ten-fold cross-validation search was implemented for models created by the HP sets randomly selected previously. The model having the highest accuracy after this search was kept. The HP set determining this model was considered as the best HP one found from the random grid search.
Next, this model was trained on the entire training set before being applied to the test set. If the accuracy of the said model when applied to the test set exceeded the baseline level, which was produced by the model using the default HPs available within the Scikit-Learn library, a ten-fold cross-validation exhaustive grid search was performed. Here, some values of each HP were determined around the best found from the random search. Consequently, each HP could assume several candidate values. Thus, probable values of all HPs created a new grid. The search exhaustively ran across all HP sets considered from the grid. The best HP set, the model employing which demonstrated the highest average accuracy through the ten-fold cross-validation search, was thus determined.
Subsequently, the model based on the best HP set was trained on the entire training set before assessing its performance on the test set. Generalization of the said model was considered when its accuracy on the test set exceeded the baseline level. Once the accuracy of the model when applied to the training set was much higher than that when applied to the test set, the corresponding HP set was neglected owing to the severe overfitting. While some overfitting is unavoidable when handling actual data, it is important that the same should be limited. For example, ( 26 ) reported a 5.1% difference between training-data and test-data accuracy levels. In this study, the RF model was considered to generalize data sufficiently, provided the difference between accuracy levels observed when operating on the training and test datasets remained less than 6%.
The grid was modified once a search (random or exhaustive) result failed to determine an HP set that led to sufficient generalization, better accuracy, or both, compared with the baseline level when employed in a model.
Feature Selection
Feature groups considered include: activity time, user characteristics, predicted travel modes, actual travel modes, estimated home location, and estimated location of the most frequently visited non-home place (MFVP).
Time-Related Features
The duration and commencement of activities were considered continuous variables (Table 2) expected to be useful for the imputation model. This is because these parameters represent unique time characteristics of certain purpose types (for example, home and work/education).
Description of Features
refers to the feature group used exclusively for Hanoi data.
refers to the feature group used exclusively for Donate data.
refers to the feature group used for both Hanoi and Donate data.
MFVP refers to the most frequently visited (non-home) place.
User-Related Features
The participants’ gender, job, age, and family-related information were considered binary variables to limit model complexity. The variable named “dynamic job” was typically used to determine whether participants frequently worked outside their main offices. This was expected to be beneficial with regard to identifying business activities.
Transportation-Mode-Related Features
Most people in Hanoi rely heavily on motorcycles for their daily commute; there is only moderate use of cars. Bus services (the only means of public transport) and bicycles find the lowest use ( 27 – 29 ). The mode information used in this study corresponded to the main trip mode, and accordingly, the prioritized list included use of public transport, car, motorcycle, bike, and walking. For example, if a trip comprised three travel segments, including commute by foot, bus, and motorcycle, its main mode would correspond to public transport. Two types of travel-mode information were considered in this study. The first information type can be directly extracted from the ground truth. The second type has never been employed in a practical application. It corresponds to the result of a hierarchical process of mode identification, presented in ( 4 ).
With regard to the Donate dataset, the mode list included travel by train, public transport (buses, trams, and metro rail), car, bike/e-bike, and walking in the decreasing order of priority. Because the coordinates of places visited during a trip were not available, mode detection could not be performed, and only actual modes were used for purpose detection when employing the Donate data.
Home-Related Feature
Binary values of the home-related feature were determined based on the spatial relationship between the location of performing an activity and the corresponding home location. The home location of a user was estimated by considering location coordinates of the source of the first trip and destination of the last trip on survey days. A typical day corresponds to the period between 03:00 on the first day and 02:59 on the next day. The first and last trips performed on the first and last days, respectively, were ignored because users may install, uninstall, or both, the TRavelVU application while being away from home. The frequency of a participant’s stay at the origin and destination locations was evaluated in consideration of the total number of other origin/destination points located up to 100 m away from the said source/destination location. As expected, a participant’s home location demonstrated the highest frequency. The coordinates of the home location were calculated as the average of the coordinates of all locations up to 100 m away from the home location.
Most Frequently Visited Non-Home Place (MFVP)-Related Feature
Binary values of the MFVP-related features were determined based on the spatial relationship between the location at which an activity was performed and that of a non-home place most frequently visited by a participant. To estimate the MFVP coordinates, the authors removed all home points before determining the frequency of visiting activity locations in a manner identical to determining the frequency of home activities. Accordingly, the location corresponding to the highest frequency was considered the MFVP, and its corresponding location coordinates were evaluated by averaging those of all points located up to 100 m away from the MFVP. Because people have a tendency to return to previously visited locations at regular intervals (21), the use of MFVP was expected to enhance prediction performance.
Results and Discussion
For Hanoi Dataset
Table 3 presents the results obtained using six different models, which were carefully tuned and employed different features. For all models, the difference between the accuracy levels realized when applied to the training and test datasets lies in the 4% to 6% range. This implies that overfitting was contained within acceptable limits. Table 4 presents a case where Model_6, which considers inputs from all features except predicted modes, demonstrated accuracies of the order of 80.9% and 75% when applied to the training and test datasets, respectively.
Comparison of Models using Different Features
Note: MFVP = most frequently visited (non-home) place.
Results Obtained for Model_6 When Applied to Hanoi and Donate Datasets
1 refers to the total value; 2 refers to the average value; na = not applicable.
As can be seen from Table 3, 60.1% of all activities were successfully identified based on their start time and duration (Model_1), to which the addition of user characteristics (Model_2) afforded a 2.3% improvement. Interestingly, the addition of both the actual (Model_3) or predicted (Model_4) mode-related features resulted in enhanced performance, albeit modes extracted from the ground truth demonstrated a relatively stronger effect (64.3% versus 63.6%). Addition of the closeness-to-home feature (Model_5) demonstrated a dramatic increase in accuracy (by 8.7%) to 73%. The use of Model_6 revealed that a further 2% (i.e., an aggregate of 75%) of all activities could be correctly detected via incorporation of the MFVP feature, thereby justifying its consideration.
These observed changes in model performance emphasize that features related to time and home location assume the highest importance. Moreover, the incorporation of user- and actual-mode-based features demonstrates a nearly similar effect with regard to improving the overall accuracy (approximately 2% for each) of the prediction model. This justifies their heavy usage in extant approaches (1, 8, 10, 12, 16, 17). By contrast, the predicted-mode information, possibly owing to inclusion of wrong imputation, does not significantly enhance the model performance when compared with actual modes.
Figure 6, a and b, reveal observed changes in the recall and precision values, respectively, corresponding to the different purposes when applying the six models described above to the Hanoi dataset. These trends explain how a given model provides better prediction results on addition of certain features. Compared with Model_4, the recall and precision values for home increase significantly when employing Model_5, thereby emphasizing the appropriate functioning of the home-detection algorithm. Accordingly, corresponding parameter values for the visit/leisure and work/education activities also demonstrated considerable increases. As several instances of visit/leisure and work/education activities were characterized by ambiguous start times and duration profiles similar to home (Figure 4, a and b), there exists a possibility of them being wrongly labeled as home when applying Model_4. However, Model_5 ensures their accurate characterization owing to them being performed at locations far from participant homes. The observed decline in recall values for pick-up/drop-off from 69.2% (Model_4) to 53.9% (Model_5) could possibly be because of the proximity of kindergartens, schools, or both, from participant homes.

Observed trends in precision and recall of different purposes when employing different models on Hanoi and Donate datasets.
In Model_6, the MFVP consideration causes a significant increase in the recall values of shopping/eating, visit/leisure, and pick-up/drop-off. This is because, apart from their home, participants regularly visit restaurants and markets, in particular, over workplaces. In fact, going shopping daily is a common practice in Vietnam, whereas people rarely visit workplaces over weekends. Therefore, the MFVP consideration substantially improves the prediction of shopping/eating activities. Further, the MFVP consideration prevented the pick-up/drop-off, visit/leisure, and business activities performed at locations far from those of MFVP to be misclassified as shopping/eating.
The analyses mentioned above demonstrate that the incorporation of the home- and MFVP-location-prediction algorithms facilitates accurate detection of not only major activities (home, work/education, and shopping/eating) but also minor ones (visit/leisure and pick-up/drop-off). Hereafter, in this paper, “major purposes” refers to those with a considerably large number of activities compared with “minor purposes.”
For Donate Dataset
All models applied to the Donate dataset were tuned to ensure the difference between observed accuracy levels when applied to the training and test datasets remained less than 6%. As mentioned above, predicted-mode variables could not be applied to the Donate dataset; therefore, Model_3 (Table 3) was not created. Generally, the influence of the incorporation of features was observed to be in agreement with results obtained for the Hanoi dataset. That is, the incorporation of user attributes, actual modes, and MFVP facilitated a substantial improvement in model accuracy (by up to 2% each). The consideration of participants’ home locations increased the prediction accuracy from 72.4% to 83.1%. Likewise, the incorporation of the home- and MFVP-location-prediction algorithms enhanced the classification accuracy of both major and minor purposes (Figure 6, c and d).
Model_6, when applied to the Donate dataset, demonstrated a 10% higher accuracy (85.1%) compared with that realized on its application to the Hanoi dataset. Notably, data preparation in relation to filtering and merging activities was performed for the Hanoi dataset, but not for the Donate dataset. Inclusion of long-distance trips with activities outside a specific geographical research scope resulted in lower prediction accuracies ( 20 ). Therefore, if the data-preparation procedures were implemented the same for both datasets, the observed difference in accuracy levels when using the Hanoi and Donate datasets is expected to exceed 10%.
The realization of considerably higher accuracy when the Donate dataset is used may be attributed to the comparatively longer survey duration. With 14.3 days per person compared with 10.4 days in the Hanoi dataset, home and MFVP locations in the Donate dataset can be estimated more accurately, thereby facilitating better classification of major purposes. In particular, F-scores pertaining to the home, working/education, and shopping/eating activities equaled 97.1%, 90.8%, and 83.3%, respectively, when using the Donate dataset compared with 92.1%, 80.4%, and 59.6%, respectively, when using the Hanoi dataset. Moreover, shares of working/education and shopping/eating activities (20.8% and 24.8%, respectively) in the Donate dataset are larger than corresponding values (15.4% and 17.5%, respectively) in the Hanoi dataset (Table 4). Therefore, the application of Model_6 to the Donate dataset facilitates realization of a much higher prediction accuracy compared with the Hanoi dataset owing mainly to the superior imputation of major purposes. Nevertheless, the longer survey duration of the Donate dataset does not facilitate the detection of minor purposes owing to their small shares. Thus, sufficient data corresponding to minor purposes are not available to facilitate Model_6 learning and, consequently, it fails to detect such purposes very well. Both datasets demonstrated nearly similar F-scores for the pick-up/drop-off and visit/leisure activities in reasonable range of 51% to 63% (Table 4). The clearly low F-score pertaining to business activities (36.4%) in the Donate dataset could similarly be attributed to its low share (3.5%).
Despite the low F-scores of business and pick-up/drop-off activities, a slightly higher F-score of Model_6 was observed for the Donate dataset (70%) compared with its Hanoi counterpart (68.9%). Owing to the observed higher prediction accuracy and F-score of Model_6, purpose imputation in the Donate dataset was better compared with the Hanoi dataset. In addition, the 75% accuracy observed for the Hanoi dataset was smaller compared with that reported in several prior studies (8, 10, 11, 15–17). Accordingly, compared with developed countries, extracting purpose information from GPS data in a developing country is less accurate.
Interestingly, the 85% accuracy observed for the Donate dataset exceeded those reported in prior studies involving use of GIS data (6, 7, 9, 11, 13, 14) as well as those not considering GIS information ( 15 , 16 ). Compared with ( 15 , 16 ), the proposed approach demonstrated better prediction of non-home and non-work/education activities in both the Donate and Hanoi datasets. Therefore, the method proposed (Model_6) in this study significantly enhances purpose-imputation accuracy based on GPS data without using GIS information.
Conclusion
This paper presents details concerning the authors’ proposed method of feature selection and HP tuning of RF models to enhance trip purpose imputation from GPS data without using GIS data. The proposed approach is unique because the development and testing of imputation models were performed on datasets collected from both developed and developing countries. Moreover, the Donate dataset included participants living in three different European countries, instead of a specific city or region of a country, as has been the case in the existing research. The major findings and contributions of this study to the existing literature are as follows.
Time- and home-related features are the biggest contributors to the prediction ability of a given model. Consequently, they should be considered in the further development of purpose-prediction algorithms.
The predicted transportation mode increases model accuracy, thereby encouraging the integration of mode detection and purpose imputation into a continuous process.
The repetition of travel patterns should be considered to enhance the purpose-prediction accuracy by estimating the home and MFVP locations. Notably, the proposed method duly considers predicted-mode- and MFVP-related features, which have hitherto been disregarded in prior research.
A longer survey duration allows the prediction model to better classify major purposes, thereby resulting in a higher overall accuracy. However, the same may not be true for minor purposes.
The proposed method does not require participants to provide personal locations, which may be unavailable owing to privacy concerns. Furthermore, the method completely depends on internal features obtained from collected data samples. Therefore, the proposed method is highly transferable and well suited to performing comparable studies of purpose imputation across regions. The predictive performance of the proposed method may be considered as an appropriate baseline, against which the performance of other candidate models based on external data (e.g., GIS data) can be compared.
Purpose imputation using GPS data is a novel concept in developing countries, and the proposed model demonstrates lower performance there compared with developed countries.
The findings mentioned above demonstrate that purpose imputation can gain reasonable prediction results without the support of GIS data in both developing and developed countries through deploying the advantage of GPS data (i.e., multiple days recorded). In relation to limitations, the proposed method is solely based on the RF technique without consideration of other complex and powerful learning-based classifiers (e.g., support vector machine and maximum entropy [30]). Second, the detection accuracy of minor purposes is only reasonable and requires significant improvement. The sizes of datasets used were relatively small with particular levels of self-bias selection, giving the findings only weak reliability to some extent. The above conclusion (2) in relation to predicted modes is based on exclusive testing of the Hanoi dataset with the Donate dataset not being considered for the same. Last, GIS data were not considered. These limitations emphasize the need to extend the scope and findings of this research by developing more sophisticated supervised learning algorithms considering GIS data to facilitate the analysis of large datasets.
Footnotes
Acknowledgements
The authors sincerely appreciate the constructive comments of the editor and six anonymous reviewers.
Author Contributions
The authors confirm their contribution to this manuscript as follows: study conception and design: M.H. Nguyen, J. Armoogum; data collection: M.H. Nguyen, E. Adell; analysis and result interpretation: M.H. Nguyen; preparation of draft manuscript: M.H. Nguyen, J. Armoogum, E. Adell. All authors reviewed the results and approved the final version of the manuscript for publication.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
