Abstract
Prior studies have shown that Internet search query data have great potential to improve tourism forecasting. As such, selecting the most relevant information from large amounts of search query data is crucial to enhancing forecasting accuracy and reducing overfitting; however, such feature selection methods have not been considered in the tourism forecasting literature. This study employs four machine learning–based feature selection methods to extract useful search query data and construct relevant econometric models. We examined the proposed methods based on monthly forecasting of tourist arrivals in Beijing, China, along with weekly forecasting of hotel occupancy in the city of Charleston, South Carolina, USA. Our findings indicate that the forecasting model with the selected search keywords outperformed the benchmark ARMAX model without feature selection in forecasting tourism demand and hotel occupancy. Therefore, machine learning methods can identify the most useful search query data to significantly improve forecasting accuracy in tourism and hospitality.
Introduction
Tourism demand forecasting plays a crucial role in the tourism and hospitality industry. Recently, accurate prediction of tourism demand has become an increasingly important research topic. Many studies to date have focused on forecasting tourism demand using diverse quantitative methods, including time series, econometric, artificial intelligence, and integrated models (Chen et al. 2019; Gunter and Önder 2015; Song, Qiu, and Park 2019; Song and Li 2008; Song and Witt 2000; X. Sun et al. 2016). Recent forecasting methods include deep learning (Law et al. 2019), spatial temporal models (Y. Yang and Zhang 2019), ensemble empirical mode decomposition (X. Li and Law 2020), Bayesian global vector autoregressive model (Assaf et al. 2019), Markov switching models (Valadkhani and O’Mahony 2018), and pooling (Long, Liu, and Song 2019). These developments reflect researchers’ efforts to propose a more effective forecasting method to significantly improve the forecasting accuracy of tourism demand.
Scholars have found that Internet big data, such as search query data, can be used to predict tourism demand accurately. In particular, search query data can reflect tourist behavior and supplement traditional data sources to predict tourism demand (Choi and Varian 2012; X. Yang et al. 2015). To incorporate search query data into forecasting models, researchers have proposed combining one or several representative indices as explanatory variables to reduce model complexity. Keyword selection from search queries largely determines tourism forecasting performance. In the existing literature, many studies have defined keywords on the basis of prior domain knowledge and then collected search query data from Baidu or Google to represent tourists’ interests (Brynjolfsson, Geva, and Reichman 2016; S. Sun et al. 2019). Common approaches to index aggregation include shift and summation, principle component analysis, and the generalized dynamic factor model. For example, S. Sun et al. (2019) aggregated one index from 22 Baidu search query data using the shift and sum method for tourism forecasting. X. Li et al. (2017) extracted a composite index from 45 search query data as an explanatory variable via a generalized dynamic factor model. However, these studies did not determine which search query data are most helpful for improving tourism forecasting accuracy.
When selecting search keywords, the tradeoff between search query data coverage and accuracy has been deemed pivotal (Geva et al. 2017). Incorporating a large group of search query data into searching may reveal relevant information, but it may also introduce irrelevant noise and cause problems such as spurious correlation and data overfitting (Brynjolfsson, Geva, and Reichman 2016; Geva et al. 2017; Song and Liu 2017). Researchers need to determine which search keywords should be retained to maintain the accuracy. The question of how to automatically select the most appropriate combinations of search query data related to tourism demand from search engines that can effectively improve the forecasting accuracy remains unanswered.
This study aims to investigate whether machine learning methods can help select the most useful search query data to improve predictions of tourism demand and hotel occupancy compared to a benchmark ARMAX model without machine learning. Moreover, if machine learning methods are well suited to tourism demand forecasting, which method realizes the best forecasting performance? To achieve these research objectives, we conducted empirical studies that include (1) the monthly tourism demand forecasting in Beijing using Baidu search data and (2) the weekly hotel occupancy forecasting using Google data. In addition, several machine learning–based feature selection methods (i.e., filter-based feature selection, recursive feature selection, genetic algorithm feature selection, and random forest feature selection) were applied to extract appropriate search query data subsets for incorporation into our forecasting models. Forecasting results reveal the effectiveness of machine learning feature selection methods in improving the forecasting of tourism demand and hotel occupancy.
The remainder of this article is organized as follows. The next section presents a literature review. The third section introduces our four feature selection methods and describes the methodology. The fourth section outlines our empirical results, including the monthly Beijing tourism forecasting and the robustness of weekly hotel occupancy forecasting. The last section provides concluding remarks.
Literature Review
Three streams of literature are relevant to our work: (1) tourism forecasting models with search query data; (2) Internet search query data selection; and (3) the current state of research and application of machine learning–based feature selection.
Tourism Forecasting Models with Search Query Data
Tourism forecasting has become increasingly important due to an urgent need for timeliness and accuracy in forecasting tasks (Frechtling 2012; Guizzardi and Stacchini 2015; Hassani et al. 2017; Pai, Hung, and Lin 2014; Peng, Song, and Crouch 2014; Shen, Li, and Song 2011; Zhou-Grundy and Turner 2014). Existing literature in tourism and hospitality demonstrates that Internet search query data have become an important variable to increase tourism forecasting accuracy (J. Li et al. 2018; Song, Qiu, and Park 2019). Pan, Wu, and Song (2012) was one of the first to demonstrate how Google search query data could improve forecasting accuracy of hotel room demand. Scholars have continued to adopt different models to effectively analyze the search query data and to improve forecasting accuracy. Table 1 presents an overview of selected work in the tourism literature.
Overview of Selected Papers with Search Query Data.
Note: ARIMA = autoregressive integrated moving average; AR-MIDAS = autoregressive-mixed data sampling; ADE = adaptive differential evolution; ADL = autoregressive distributed lag; BA = Bat algorithm; BPNN = back-propagation neural networks; DFA = dynamic factor approach; DL = deep learning; DLM = dynamic linear model; EEMD = ensemble empirical mode decomposition; GDFM = generalized dynamic factor model; KELM = kernel extreme learning machine; KPCA = kernel principle component analysis; SVR = support vector regression; TVP = time varying parameter; VAR = vector autoregressive.
When tourists plan trips, they may refer to Internet search engines to retrieve information using keywords (Fesenmaier et al. 2011; X. Yang et al. 2015). Search query data from Google and Baidu are most widely used for different predicted contexts. X. Yang et al. (2015) indicated that Baidu search data perform better when forecasting tourist arrivals in China, while Google search data are found to be suitable for the forecasting of countries and cities that mainly speak English (Önder 2017; Pan and Yang 2017). Search query data have been used to forecast international tourism demand of a tourism destination from multiple source markets (e.g., X. Li and Law 2020). Scholars also examined the predictive ability of search query data in forecasting single time series such as the total number of tourist arrivals to one destination (e.g., Hu and Song 2019; X. Li et al. 2017; Pan and Yang 2017; S. Sun et al. 2019; X. Yang et al. 2015). Both types of research have revealed the usefulness of search query data in tourism demand prediction.
In terms of methodologies, time series, econometric, artificial intelligence, and hybrid models have been adopted in the existing forecasting literature. Time series and econometric models have accounted for a larger proportion of studies. Search query data are incorporated into the models as an explanatory variable such as AR (autoregressive), ARMA (autoregressive moving average), ARIMA (autoregressive integrated moving average), seasonal ARIMA, ADL (autoregressive distributed lag), TVP (time varying parameter), and VAR (vector autoregressive) (Bokelmann and Lessmann 2019; X. Huang, Zhang, and Ding 2017; X. Li and Law 2020; Önder 2017; Pan, Wu, and Song 2012; Park, Lee, and Song 2017). Advanced econometric models have also been developed for modeling search query data. For example, Bangwayo-Skeete and Skeete (2015) adopted an autoregressive-mixed data sampling model that predicted monthly tourist arrivals with weekly Google search data. Camacho and Pacce (2018) proved that a dynamic factor model can improve tourism forecasting accuracy compared to an AR model. X. Li et al. (2017) proposed a generalized dynamic factor model to analyze the search query data to improve forecasting performance.
Artificial intelligence models have been applied to forecast tourism demand with search query data. For example, Hu and Song (2019) demonstrated that a BPNN model can outperform ARIMA and ADL models in forecasting tourist arrivals from Hong Kong to Macau. S. Sun et al. (2019) suggested that a kernel extreme learning machine method can improve the forecasting accuracy of Beijing tourism demand compared to ARIMA models. Law et al. (2019) proved the ability of a deep learning method in forecasting Macau tourist arrivals. Moreover, hybrid models have been adopted in tourism forecasting with search query data recently. S. Li et al. (2018) combined an adaptive differential evolution with the BPNN to further enhance the forecasting. Wen, Liu, and Song (2019) proposed a hybrid model to combine a linear ARIMA and a nonlinear AR model to improve forecasting accuracy. Zhang et al. (2017) combined the Bat algorithm and support vector regression to improve forecasting performance.
Different models have their own advantages, and no single method can outperform in all forecasting situations (Song, Qiu, and Park 2019). Search query data are considered as explanatory variables that can influence forecasting performance. Therefore, the selection of search query data has become important since such a selection process determines not only which data are incorporated into the model but also whether forecasting accuracy can be improved.
Internet Search Query Data Selection
When forecasting using search query data, concerns about the selection of search query keywords have become unavoidable as search data selection greatly influences forecasting quality. Table 1 suggests that search query data are mainly selected based on two approaches: intuition and prior domain knowledge (Brynjolfsson, Geva, and Reichman 2016) and Google’s search query index.
The first keyword selection method has been widely applied in the existing literature. Yang et al. (2015) and X. Li et al. (2017) defined several aspects of tourist activities including transportation, dining, lodging, shopping, recreation, and tours. Studies such as S. Li et al. (2018) and Law et al. (2019) followed their search keywords selection frameworks but made minor adjustments by adding one aspect relevant with clothing. Hu and Song (2019) excluded search terms from shopping and tours, since these activities are not the motivation for visitors from Hong Kong to Macau. As discussed by Geva et al. (2017), the advantage of such search data selection is achieving a high coverage of search keywords; however, by selecting more search query data, the overfitting problem resulted from irrelevant data is unavoidable. It would possibly influence the accuracy of forecasting models.
Some researchers directly obtained search query data by using Google category index in Google’s search engine (Camacho and Pacce 2018; Önder 2017). For example, Bangwayo-Skeete and Skeete (2015) used a search index for “hotels and flights” in Google to predict tourist arrivals in the Caribbean. X. Li and Law (2020) used the Google query index from Hong Kong travel subcategories to improve out-of-sample forecasting accuracy for Hong Kong tourist arrivals. However, researchers cannot obtain the specific combination of keywords included in the Google index, because they are not transparent to users (Choi and Varian 2012).
Therefore, selecting the most relevant data that reflect both the coverage and accuracy of search keywords is important for tourism forecasting with search query data. Existing studies have not yet considered an optimal search query data selection method to obtain an improved subset, but extracted one or several representative indices as explanatory variables in forecasting models. For instance, X. Li et al. (2017) extracted a composite index using the generalized dynamic factor model. Xie et al. (2020) selected principal components using a kernel principal components analysis to reflect information contained in search query data. Yang et al. (2015) aggregated search query data into an index using shift and summation. An extracted representative index can reflect useful information at an aggregated level, but it is still difficult to determine which search keywords are most relevant for improving tourism forecasting.
Machine Learning–Based Feature Selection
Feature selection is an essential procedure that removes irrelevant information to improve algorithm performance (Chandrashekar and Sahin 2014). When considering collected search query data as a feature set, researchers can obtain a subset of data that can achieve higher performance (Domingos 2012; Guyon et al. 2002). One advantage of feature selection is avoidance of overfitting and enhanced prediction performance (Guyon and Elisseeff 2003). In this study, a machine learning–based feature selection method is adopted to remove redundant and extraneous information from search query data and to retain the most relevant subset of keywords that helps achieve a considerable coverage and an improved forecasting accuracy.
Machine learning–based feature selection can be applied to supervised learning in a forecasting context (Cai et al. 2018; Kodratoff and Michalski 2014). The criteria to keep or remove a feature depends on whether it can improve forecasting performance based on different machine learning algorithms (Cui et al. 2017; Kursa and Rudnicki 2010). Each feature selection method has its own advantages and disadvantages and we focus here on four classical methods: filter-based feature selection, recursive feature selection, genetic algorithm feature selection, and random forest feature selection. Our review provides a brief introduction of each method.
The filter-based feature selection method is a relatively fast algorithm that is particularly useful for overcoming overfitting (Yu and Liu 2004). Features are often ranked using scores based on specific criteria, such as the correlation coefficient, chi-squared test, and information gain (Saeys, Inza, and Larrañaga 2007). Features with low scores, as calculated by an algorithm, are removed (Chandrashekar and Sahin 2014).
Recursive feature selection selects a subset of features by recursively removing features to achieve the required maximum performance and minimum number of features (Yan and Zhang 2015). First, the importance of each feature is obtained after training an initial set of features. Second, the least important features are eliminated from the set. This selection procedure is recursively repeated until the desired subset of features is assembled (Guyon et al. 2002).
Genetic algorithm feature selection obtains the most appropriate subset of data based on a genetic algorithm that eliminates and maintains features (Saeys, Inza, and Larrañaga 2007). The key elements of a genetic algorithm are selection, crossover, and mutation. This approach first selects individual features with high fitness values from a current generation. Crossover recombines the chromosomes of two parents to generate new individuals in the next generation. Mutation is the process of changing genes randomly selected in the current chromosome (C. L. Huang and Wang 2006). The method is computationally feasible for generating suitable results (Chandrashekar and Sahin 2014).
Random forest feature selection is a popular and efficient machine learning algorithm constructed based on regression trees. It combines many binary decision trees, built using several bootstrap samples, from a learning sample and selects a subset of explanatory variables randomly at each node (Genuer, Poggi, and Tuleau-Malot 2010; Sylvester et al. 2018). The random forest method then determines the optimal subset of features based on model aggregation of classification and regression (Breiman 2001). This approach has been adopted for variable selection in several fields, such as operations and supply chain management (Cui et al. 2017).
Research Gap
In summary, prior studies have indicated that Internet search query data can enhance tourism demand forecasting performance. However, several research gaps related to the use of search query data in tourism forecasting should be addressed. Scholars have tended to either choose keywords from search engines based on intuition and prior knowledge or use the Google query index without a definitive selection process; no prior studies have explored which keyword combinations comprehensively reflect tourism demand. Questions to be considered include whether more search keywords are better for improving the forecasting accuracy of tourism demand? The notion of how to balance search keyword coverage with the accuracy of forecasting performance remains unresolved. In other words, a rigorous procedure regarding the selection of search query data has not yet been proposed.
Specifically, machine learning–based feature selection has not yet been applied to the selection of search query data for tourism forecasting despite being an effective method for analyzing large volumes of data. Feature selection methods can choose the most appropriate combinations of search query data to obtain better forecasting performance with higher efficiency. Feature selection has been widely incorporated to process large volumes of data in fields such as bioinformatics to realize faster and more effective models (Cai et al. 2018). In tourism, however, the application of feature selection to improve forecasting performance using search query data remains limited.
Methodology
The four machine learning–based feature selection methods used in this study include filter-based feature selection, recursive feature selection, genetic algorithm feature selection, and random forest feature selection.
Filter-based feature selection is used to select the most relevant search query data based on criteria such as the correlation coefficient [R(i)] and information gains [IG(Xi, Y)]:
where
Recursive feature selection is an iterative procedure used to eliminate irrelevant features and retain the most appropriate ones. Each feature is ranked based on its calculated importance. At each iteration, the importance of features is measured and the top-ranked features are retained. The recursive feature selection algorithm is described in Table A1 of the appendix.
Genetic algorithm feature selection is a general adaptive optimization search method (Chandrashekar and Sahin 2014). To select the best search data subset, we first need to set the maximum generations, population per generation, crossover, and mutation probability. Subsets of search data can then be generated through crossover and mutation, and algorithm performance is evaluated using the fitness function (Xue, Yao, and Wu 2018). The most appropriate subset is obtained after K-fold cross-validation. For details about the genetic algorithm, see Das, Das, and Ghosh (2017).
Random forest feature selection is an efficient machine learning method that applies bootstrap samples to combine decision trees and randomly selects a subset of variables at each node (Cui et al. 2017). The subset is selected based on its importance, which is computed using the error in the out-of-bag (OOB) sample. The OOB sample refers to the set of observations not used to build the current trees. The random forest selection procedure of Guyon et al. (2002) is described in Table A2 of the appendix.
Accordingly, a forecasting framework is proposed in this article to answer the research questions by performing four major steps: (1) search keywords selection, (2) machine learning–based feature selection, (3) econometric modeling, and (4) forecasting evaluation. Figure 1 depicts the forecasting framework.

Proposed forecasting framework.
1. In the first step, we collected search query volume data from search engines such as Google and Baidu. We proposed extracting tourism demand–related keywords to reflect tourists’ attention to various activities including dining, lodging, traffic, recreation, shopping, and tourism. These keywords reflected tourists’ decisions about various aspects of a trip during their travel planning process (X. Yang et al. 2015). The selected keywords should be comprehensive and reflect various dimensions of tourism demand related to tourist activities. Tourists who express their interest in a certain travel destination will likely search for “special attractions” or “special food” through a search engine. They also search for keywords such as “travel guides” or “travel plans” to obtain more information about potential destinations. It should be noted that not all keywords associated with tourism concepts can be included; search engines do not return keyword data series containing few results (S. Sun et al. 2019). For example, data are available for the keyword phrase “Beijing tourism,” but a query for “How to travel in Beijing” returns no results. For the purposes of this article, a group of search query data was generated based on selected keywords.
2. In the second step, the machine learning–based feature selection methods were used to extract subsets of search query data from the above collected data set. After the feature selection procedure, the dimensions of the search query data will be reduced but the forecasting accuracy can be improved because of the elimination of irrelevant data. Since we introduced different machine learning–based methods, we can evaluate which method can effectively deal with search query data. It should be noted that the particular focus is the automatic selection from the perspective of machine learning techniques. Although the methods have different criteria to retain the search query data, the fundamental idea is to estimate if the addition of data improves or reduces forecasting accuracy. Therefore, the “black-box” nature of machine learning methods simplifies the data selection step, with no further manual examination is needed. Results of the above four feature selection methods can be obtained through R software using the “caret” and “Boruta” packages (Kursa and Rudnicki 2010).
3. In the third step, econometric models were constructed to incorporate the search query data and to predict tourism demand. We focus on the improvement of forecasting accuracy with feature selection compared to that without selection. A benchmark ARMAX model was obtained by adding all search query data on the basis of the ARMA model, without any feature selection method. To evaluate whether the machine learning–based feature selection method improves the tourism forecasting performance, four econometric models incorporating the subsets of search query data extracted from the aforementioned four feature selection methods were then constructed, and their respective forecasting performances were compared with the benchmark ARMAX model above.
The ARMAX model is described as
where
The econometric models with the selected search query data obtained from machine learning–based feature selection can be written as
where the only difference between this equation and the above ARMAX model is
4. In the last step, to evaluate whether our proposed forecasting models on the basis of four machine learning methods outperformed the benchmark ARMAX model, we compared the dynamic forecasting results of these different models by using the root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and improvement ratio (IR) based on each measure. The evaluation measures were constructed using the following equations:
where
Empirical Results
We conducted an empirical study of forecasting monthly domestic tourist arrivals to Beijing, China, to examine the performance of the proposed methodology in selecting the most appropriate search query data and forecasting tourism demand.
First, Internet search query data from Baidu were collected relative to various aspects of tourism activities such as dining, lodging, traffic, recreation, shopping, and relevant tourist attractions. Second, four feature selection methods were applied to select the most appropriate subset of search query data. Third, we then constructed forecasting models using different types of selected search data as explanatory variables and compared the forecasting accuracy of the forecasting models with the benchmark model. Furthermore, we conducted one empirical experiment about the weekly forecasting of hotel occupancy in Charleston, SC, to provide a robustness check about our methodology.
Data Description
We select Beijing City as the target destination. Beijing as the capital of China has been developed as one famous tourism city, and the accurate forecasting of its tourist arrivals has attracted increasing attention from existing studies such as X. Li et al. (2017) and S. Sun et al. (2019).
Two kinds of data, including monthly Beijing tourist arrivals and search query data, were collected from January 2011 to August 2019, including 104 data points. Baidu search data were used since Baidu has the biggest market share in China compared with other search engines (X. Yang et al. 2015). Our collected data sets can reflect the most recent trends in both tourist arrivals data and search query data. The tourist arrivals data were obtained from Wind database (http://www.wind.com.cn/), and the search query data were collected from Baidu; this search engine’s data apparently perform better than Google’s when forecasting domestic tourist arrivals in China (X. Yang et al. 2015).
The average number of tourist arrivals in Beijing (in 10,000) was 2,379 with standard deviations of 746.32. Figure 2 shows actual tourist arrival data for Beijing, which presents distinct cyclical characteristics.

Monthly tourist arrivals in Beijing (2011.1-2019.8).
Baidu search query data were collected from Baidu’s search engine on the basis of procedure of search keywords selection shown in Figure 1. Numerous keywords related to tourism activities were considered, which covered the aspects of tourists’ dining, lodging, traffic, recreation, shopping, choices of attraction, and other relevant activities. Several search query data were not collected given the low volume on Baidu search engine. Furthermore, keywords on famous attractions such as “The Great Wall” and “The Palace Museum” were included to reflect the tourists’ attention on traveling in Beijing. In total, we obtained 59 search query data series. Figure 3 shows the related search keywords reflected in different categories.

Baidu search keywords for Beijing tourism.
Selection of Search Query Data
Four feature selection methods including filter-based feature selection, recursive feature selection, genetic algorithm feature selection, and random forest feature selection were used to select the appropriate subsets of search query data. All search query data were ranked, and less important data were eliminated.
We obtained four subsets of search query data for each method according to specific algorithms introduced in the Methodology section using R software. A common criterion when selecting subsets is that a feature will be eliminated if it fails to improve the forecasting accuracy. The detailed number of selected search query data was not necessarily the same given the discrepancy of the algorithms. Therefore, for the convenience of modeling and evaluation, only the top five search query data series were selected to represent the information contained in the original data set for each feature selection method. Table 2 lists the selected search query data using four machine learning–based methods.
Selected Beijing Search Data with Four Feature Selection Methods.
After obtaining the four groups of search query data, we computed the composite index using shift and summation process in S. Sun et al. (2019) to represent the linear combination of selected search query data. Therefore, the further analysis and modeling were conducted on the basis of the constructed indexes. Here, Index0 was used to represent the index computed from all search query data without feature selection. In addition, Index1, Index2, Index3, and Index4 represented the indexes obtained from four feature selection methods. Figure 4 shows the tourist arrivals data and the five indexes. All data were standardized for the convenience of comparison in one figure.

Tourist arrivals data and the five indexes.
As shown in Figure 4, the graphic features between Index0 and tourist arrivals data were quite different, while the other four data series were closely related to tourist arrivals data. To further analyze the relationships between the selected indexes and tourist arrivals data, we conducted the Pearson correlation analysis, Granger causality tests, and cointegration tests.
The Pearson correlation coefficients and relevant statistics were provided in Table 3. The correlation coefficient between the number of tourist arrivals and the index constructed from all search query data (Index0) without feature selection was the smallest among the five indexes. The correlation coefficient between Index4 and the tourist arrivals data was 0.76. In particular, the indexes constructed from four feature selection methods have stronger correlation with the predicted tourist arrivals data. The result suggests that all feature selection methods can select more relevant search query data compared to that without feature selection.
Correlation Analysis between Tourist Arrivals and Five Indexes.
Significance level at 1%.
Estimation Results of Forecasting Models
To investigate whether and which machine learning–based feature selection method could best improve the forecasting accuracy of tourism demand, the following forecasting models were constructed based on different subsets of search query data. First, a classical ARMA model without search query data or machine learning methods was constructed for tourism forecasting (X. Li et al. 2017; Pan, Wu, and Song 2012). We chose the ARMA model because we compared the performances among AR, ARMA, and ARIMA models and found that an ARMA model can achieve the highest accuracy. Therefore, other models incorporating search query data were built on the ARMA model.
Second, a benchmark ARMAX model incorporating a composite index of all search query data was constructed to explore whether machine learning–based feature selection could outperform the model without feature selection. Third, we proposed four models using selected search data sets based on the following feature selection methods: filter-based feature selection, recursive feature selection, genetic algorithm feature selection, and random forest feature selection, which were noted as models 1–4. We conducted a logarithmic transformation of tourist arrival data to reduce the impact of outliers. The lag orders of the auto-regression term and the moving-averaging term were specified on the basis of the AIC.
Tables 4–6 show the estimation results of the constructed six models. The differences among the models lie in the incorporation of the explanatory variables. For the ARMA model, only the AR and MA terms were considered to predict the tourist arrivals data. ARMAX model incorporated the Index0 as the explanatory variables, which contained the information from all search query data. However, models 1–4 included the variables Index1, Index2, Index3, and Index4 to reflect the extracted information from different feature selection methods.
Estimation Results of ARMA and ARMAX Models.
Note: *, **, and *** indicate significance at the 10%, 5%, and 1% level, respectively.
Estimation Results of Models 1 and 2.
Note: ** and *** indicate significance at the 5% and 1% level, respectively.
Estimation Results of Models 3 and 4.
Note: *, **, and *** indicate significance at the 10%, 5%, and 1% level, respectively.
As indicated from the above estimation results of six models, the constructed indexes are significant at the 5% or 1% level. Different lag orders of the explanatory variables are selected according to AIC. In addition, models 1–4 improves the adjusted R-squared compared to the ARMA and ARMAX model. Overall, search query data index built from feature selection methods could significantly predict the changes in tourist arrival data.
Comparisons of Forecasting Performance
To provide a robust and reliable evaluation about forecasting performance, Table 7 indicates one-step ahead, two-step ahead, and four-step ahead for Beijing tourism demand forecasting in terms of RMSE, MAE, MAPE, and IR. We particularly focus on the improvement in forecasting accuracy when incorporating the search query data. First, the values of RMSE, MAE, and MAPE in the forecasting models showed that the ARMA model without search query data or feature selection as well as ARMAX model without feature selection methods exhibited relative larger forecasting errors than models 1–4. For the reduction of forecasting errors of RMSE, models 1–4 with four feature selection methods significantly improved forecasting accuracy compared to ARMA model by 19.78%, 26.74%, 22.56%, and 21.87%, respectively, in the one-step ahead forecasting. Similar findings for two- and four-step ahead forecasts measured by MAE and MAPE can be achieved when compared to ARMA model.
Evaluation of Beijing Tourism Demand Forecasting.
Second, when compared to ARMAX model, the best forecasting model varies across different periods. For the one- and two-step ahead forecasting, models 1–4 always outperformed the ARMAX model, indicated by the decreased forecasting errors in terms of RMSE, MAE, and MAPE. The average IRs (measured by RMSE) of the four models using different feature selection methods were 16.04% and 30.33%, for the one- and two-step-ahead forecasting respectively. However, for the four-step-ahead forecasting measured by RMSE, models 1, 3, and 4 still outperform the ARMAX model, while the ARMAX model is found to perform better than model 2, suggesting that the feature selection methods do not necessarily improve the forecasting accuracy in the long run. The result further demonstrates that the proposed method is particularly effective for the short-term forecasting with search query data (Park, Lee, and Song 2017).
In general, the out-of-sample forecasting results indicated that models with feature selection methods significantly improved the forecasting accuracy of tourism demand in Beijing. To further answer the research question about which feature selection method can improve forecasting performance to the greatest extent, we computed forecasting errors of RMSE, MAE, MAPE, and IR for each feature selection method.
Figure 5 depicts detailed forecasting errors from the four feature selection methods based on dynamic one-step-ahead out-of-sample forecasting. The overall performance of each method was not significantly different; the MAPEs of the four models were 0.6295, 0.5410, 0.5939, and 0.5972, respectively. The RMSEs in the four models ranged from 0.0551 to 0.0604 with the mean of 0.0582 and standard deviation of 0.0022. MAEs ranged from 0.0415 to 0.0484 with a mean of 0.0453 and standard deviation of 0.0028. MAPEs ranged from 0.541 to 0.6295 with a mean of 0.5904 and standard deviation of 0.0366. Furthermore, the improvement ratios of four models ranged from 12.83 to 20.39. In summary, these findings suggest that the proposed feature selection methods can significantly improve forecasting accuracy, but we did not observe a significant difference in forecasting performance among these methods.

RMSE, MAE, MAPE, and IR among models 1-4.
Robustness with Respect to Hotel Forecasting
The above empirical results have demonstrated the effectiveness of machine learning–based feature selection in tourism demand forecasting. We also examined whether machine learning–based feature selection methods could improve forecasting in a specific dimension of tourism demand: hotel occupancy. We chose Charleston, SC, as the target destination because the hotel occupancy data are accessible, which were gathered weekly from January 2006 to February 2014 including 426 data points (Pan and Yang 2017). Here, we briefly discuss the data description, feature selection results, estimation results of forecasting models, and forecasting evaluation.
Consistent with Pan and Yang (2017), search query data were gathered from Google, including 45 search keywords related to hotel occupancy in Charleston. Figure 6 illustrates weekly hotel occupancy data in the city; the average hotel occupancy rate was 0.7028. These data demonstrated significant cyclical characteristics.

Weekly hotel occupancy in Charleston, SC.
Following the proposed methodology, we selected weekly Google search data using the four feature selection methods and incorporated these data into forecasting models: one benchmark ARMAX model and four machine learning–based models. The forecasting models were constructed for weekly forecasting of hotel occupancy in Charleston. To be consistent with Pan and Yang (2017), we constructed the benchmark ARMAX model by including the search query data taking the keyword “hotel Charleston” as the explanatory variable. For Hotel-Models 1–4, we incorporated the aggregated index from chosen feature selection methods as the explanatory variable, consistent with our practice in the aforementioned Beijing forecasting study.
Table 8 displays the estimation results of five econometric models on forecasting weekly hotel occupancy. The dependent variable was the weekly hotel occupancy rate in Charleston. The lag orders of autoregressive and moving average terms were decided based on AIC. The explanatory variables including AR(1), MA(4), and the selected search query data are significant at the 1% significance level. All four forecasting models with feature selection methods showed an increase in the adjusted R-squared.
Estimation Results for Weekly Charleston Occupancy Forecasting.
Significance at the 1% level.
Table 9 shows the one-, two-, and four-step-ahead hotel occupancy forecasting evaluation in terms of RMSE, MAE, MAPE, and IR values. Forecasting models with feature selection methods exhibited lower values of RMSE, MAE, and MAPE than the benchmark model. Compared to the benchmark ARMAX model, the four forecasting models reduced the RMSEs by 18.4%, 16.41%, 17.04%, and 20.13%, respectively, in one-step-ahead forecasting; the average improvement in forecasting accuracy was 18.00%. These results suggest that machine learning methods can select useful subsets of search query data to significantly improve hotel occupancy forecasting. Figure 7 depicts the improvement of forecasting accuracy of four weekly forecasting models compared with the benchmark ARMAX model. The findings were in line with those obtained in the monthly tourism forecasting study, which suggested that the proposed four feature selection methods could improve the forecasting accuracy with insignificant differences.
Evaluation of Hotel Occupancy Forecasting Models.

IR based on RMSE among four weekly hotel forecasting models.
Conclusions
Accurate and timely forecasting presents a crucial challenge for industries and academia in tourism and hospitality (Song, Qiu, and Park 2019). This study investigated whether machine learning–based methods can select the most useful combinations of search keywords to improve tourism forecasting accuracy. We applied four machine learning–based feature selection methods (i.e., filter-based feature selection, recursive feature selection, genetic algorithm feature selection, and random forest feature selection). Forecasting performance of the proposed methods was calculated based on tourism demand in Beijing as well as hotel occupancy rates in Charleston, SC.
Our findings indicate that a useful subset of search query data can be obtained via machine learning–based methods, which can in turn significantly improve forecasting accuracy. The empirical studies based on forecasting monthly tourist arrivals and weekly hotel occupancy indicate superior performance of the proposed feature selection methods. The results of our forecasting evaluation did not reveal a significant difference among the four feature selection methods in terms of reduced RMSE, MAE, and MAPE. We found that these methods could obtain optimal subsets of search query data to improve forecasting accuracy.
Our study contributes to the relevant literature by implementing feature selection to balance the coverage of search query data and forecasting accuracy. To the best of our knowledge, this study is the first to apply different feature selection methods to extract useful information from Internet search query data. Selected subsets of search query data from all feature selection methods outperformed the ARMAX model without feature selection. These findings further confirm that the selected keywords related to dining, lodging, traffic, recreation, shopping, tourism and attraction are effective in tourism forecasting. Furthermore, the value of search query data increases with the adoption of machine learning–based feature selection methods. In particular, one significant advantage of the proposed feature selection methods is the automatic selection of useful search query data for effectively predicting tourism demand. The overfitting problems that are usually caused by a large data set can be avoided, which contributes to the improvement of forecasting accuracy. Such a ‘bottom-up’ strategy of feature selection methods used for search query data entails selecting keywords automatically and less depending on intuition or prior knowledge.
This study has several limitations. First, although we explored the performance of four feature selection methods in selecting search query data, other methods (e.g., least absolute shrinkage and selection operator and principal components analysis) could be used to select and consolidate large volumes of search query data based on a linear regression framework (Song and Liu 2017; Song, Qiu, and Park 2019). Therefore, future studies could compare these approaches with our proposed machine learning–based methods in selecting search query data. Second, because of the data availability issue, the forecasting of tourist arrivals from different source markets was not conducted in this study, which could be addressed in future studies to provide more support for destination management. Furthermore, this study only considered search query data for tourism and hospitality forecasting. Other big data, such as from social media, should be considered to improve forecasting accuracy. Therefore, subsequent research could extend feature selection methods by integrating more user-generated big data from multiple Internet platforms to improve tourism and hospitality forecasting accuracy.
Footnotes
Appendix
Random forest selection procedure.
| 1. Compute the OOB error_0 for each tress in the random forest; |
| 2. Randomly select one feature and add noise to it, compute the OOB error_1; |
| 3. The importance of the selected feature is computed using: . The importance of the feature is obtained using the difference between the two errors. If the OOB error_1 largely reduces by adding the noise, the feature is important for the overall performance. |
| 4. Rank the features based on the computed importance; |
| 5. Repeat the above steps until a subset of features are selected. |
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclose receipt of the following financial support for the research, authorship, and/or publication of this article: Research funds from the National Natural Science Foundation of China (No. 71601021), Fundamental Research Funds for the Central Universities (No. FRF-TP-19-067A1), and Hospitality and Tourism Research Centre (HTRC Grant) of the School of Hotel and Tourism Management, The Hong Kong Polytechnic University (Project Account Code: 5-ZJLT).
