Abstract
Search engine data are of considerable interest to researchers for their utility in predicting human behaviour. Recently, search engine data have also been used to predict tourism demand (TD). Models developed based on such data generate more accurate forecasts of TD than pure time-series models. The aim of this article is to examine whether combining causal variables with search engine data can further improve the forecasting performance of search engine data models. Based on an artificial neural network framework, 168 observations during 2005–2018 for short-haul travel from Hong Kong to Macau are involved in the test, and the empirical results suggest that search engine data models with causal variables outperform models without causal variables and other benchmark models.
Keywords
Introduction
Due to the impossibility of stockpiling unused hotel rooms and unoccupied airline seats (Chu, 2004; Law, 2000), accurate tourism demand (TD) forecasts can provide practitioners and policymakers with useful information for formulating effective tourism marketing and development strategies/policies (Dergiades et al., 2018). Accurately forecasting the demand for tourism services is a difficult task for practitioners and academics (Goh and Law, 2011). Reviewing the current TD forecasting literature, Wu et al. (2017) divided the quantitative forecasting methods used by practitioners into three main categories: non-causal time-series, econometric and artificial neural network (ANN)-based methods. The data sources used for these methods are historical TD series, causal variables or a combination of both.
With the increasing use of the Internet and smartphones, search engines have become efficient channels for people to obtain information. Online search histories reflecting users’ interests are recorded by these search engines and provide a rich source of data. Google, a multinational technology company that provides the most convenient and frequently used search engine services, has published search query data since 2004. The value of such data has been widely recognised; for instance, it has been successfully used to detect influenza epidemics (Ginsberg et al., 2009). Recently, search engine data have been used, sometimes in combination with historical TD data, to predict TD (Bangwayo-Skeete and Skeete, 2015; Li et al., 2017b; Önder and Gunter, 2016; Pan et al., 2012; Rivera, 2016; Volchek et al., 2018; Yang et al., 2015). Empirical evidence shows that the use of search engine data improves tourism forecasting performance (Bangwayo-Skeete and Skeete, 2015; Pan et al., 2012). However, search query-based models do not normally include causal variables. Therefore, the question is whether introducing causal economic variables such as tourist income (TI), tourism prices (TPs) and exchange rate (ER) data into search query-based models can further improve their forecasting performance. In other words, are causal variables still useful in search query-based TD forecasting? This study addresses that question by developing a conceptual framework to interpret the roles of different data sources in TD and quantitatively evaluating the usefulness of causal variables in artificial intelligence (AI)-based and econometric models incorporating search engine data.
This study investigates monthly TD for short-haul travel from Hong Kong to Macau during 2005–2018; in total, 168 observations are involved. Macau was chosen for three reasons. First, it is well known for its gaming industry (Lu, 2011) and fine dining (Law et al., 2019), which attract short-term visitors from surrounding regions for leisure and vacation purposes (Fong, 2017). Second, as a significant proportion of visitors are short-haul travellers, they are more likely to be influenced by economic factors and online information. Third, the completion of the Hong Kong–Zhuhai–Macao Bridge has strengthened the connections between Hong Kong and Macau. As a result, Macau has become a popular short-haul destination for Chinese tourists from neighbouring regions.
This article makes three main contributions. First, a conceptual framework is proposed for combining data sources that clarify the role of historical TD series, causal variables and search engine data in predicting TD. Second, an ANN model is used to combine causal variables, search engine data and historical TD. Multiple lags for each variable are included in this model. Third, the role of causal variables is tested empirically, and the results confirm the usefulness of causal time series in improving the accuracy of the ANN model in forecasting TD.
Literature review
In TD forecasting, causal variables, historical series and search query data are frequently used. Causal variables and lagged dependent and explanatory variables are commonly used in traditional TD forecasting methods, such as pure time-series methods, econometric methods and AI-based methods. When search query data are involved, we name these models ‘search query-based methods’. This literature review focuses on these two kinds of approaches.
Traditional TD forecasting methods
The TD forecasting literature classifies forecasting methods into the following categories: non-causal time-series, econometric and AI-based methods (Song et al., 2019; Wu et al., 2017). Non-causal time-series methods extrapolate historical TD series to generate forecasts (Burger et al., 2001; Chang and Liao, 2010; Chu, 2009; Ramos and Rodrigues, 2014; Tsui et al., 2014; Wu et al., 2017). According to Ramos and Rodrigues (2014) and Wu et al. (2017), the most frequently used non-causal time-series models are no-change models (Naïve I), constant growth rate models (Naïve II), exponential smoothing models, autoregressive moving average (ARMA) models (such as the autoregressive integrated moving average model (ARIMA) and the seasonal ARIMA model) and structural time-series (STS) models.
Econometric models (Onafowora and Owoye, 2012; Ramos and Rodrigues, 2014; Song and Li, 2008; Song and Witt, 2012; Song et al., 2019; Wu et al., 2017) incorporate causal variables into pure time-series models. In addition to forecasting, econometric models explore the relationship between TD and causal variables. Li et al. (2005) and Song and Li (2008) point out that the main causal variables of TD include TI, TPs at the destination relative to those in the country of origin, TPs at competing destinations (substitute prices (SPs)) and ERs. Transportation costs (Dritsakis, 2004; Lim, 1999), marketing expenses (Law, 2000, 2001; Law and Au, 1999) and climate (Goh, 2012; Lise and Tol, 2002; Li et al., 2017a, 2018) are also recognised as important factors. Examples of widely used econometric methods include error correction models (ECMs) (Kulendran and Wilson, 2000; Song and Witt, 2006), the autoregressive distributed lag model (ADLM, a general form of the ECM) (Song et al., 2003b, 2003c), vector autoregressive models (Gunter and Önder, 2015), time-varying parameter (TVP) models (Song et al., 2003a) and the mixed-data sampling (MIDAS) approach (Bangwayo-Skeete and Skeete, 2015). To improve forecast accuracy, these models are often combined; examples include the ECM-LAIDS model (Mangion et al., 2005), the TVP-LR-AIDS model (Li et al., 2006), the TVP-STS model (Song et al., 2011) and the TVP-EC-AIDS model (Wu et al., 2012a).
AI-based methods (Claveria et al., 2015; Kon and Turner, 2005; Law and Au, 1999; Law, 2000; Palmer et al., 2006) aim to establish non-linear connections between TD and its lagged values or explanatory variables. ANN models (Law, 2000; Law and Au, 1999), support vector regression models (Chen and Wang, 2007), Gaussian process regression models (Wu et al., 2012b) and deep learning approaches (Law et al., 2019) are also used to predict TD. AI-based methods are particularly accurate in forecasting TD (Song and Li, 2008; Wu et al., 2017). The data sources used for these AI-based methods are historical TD series, causal variables or a combination of the two.
Search query-based TD forecasting
Due to the wide use of the Internet and smartphones, search engines have become an important platform for searching for information around the world. Google is one of the most powerful search engines, with 91.4% of the global market share of all search engines in 2018 (www.gs.statcounter.com). In addition, it has published search intensity data since 2004. Search query data have been used successfully to detect influenza epidemics (Ginsberg et al., 2009), predict abnormal stock returns and trading volumes (Joseph et al., 2011) and identify housing market trends (Wu and Brynjolfsson, 2015). In the tourism context, search engines help tourists obtain useful information on restaurants, hotels, transportation, attractions and retail stores at their planned destination. As a result, search engines’ histories reflect tourists’ preferences in terms of destinations, cuisine and accommodations.
Pan et al. (2012) demonstrated that search engine data can be used to accurately predict demand for hotel rooms. Yang et al. (2015) used search query volume to predict the number of visitors to Hainan Province and compare the predictive power of forecasting models based on two search engines, Google and Baidu. The results show that both types of search engine data significantly improve forecast accuracy and that Baidu query data perform better due to its larger market share in China. Similarly, Bangwayo-Skeete and Skeete (2015) conducted a composite search for ‘hotels and flights’ from source countries to popular destinations in the Caribbean to test the performance of autoregressive MIDAS models using search query data. The results show that search engine data have significant benefits for forecasting TD. Önder and Gunter (2016) evaluated the predictive power of Google Trends by focusing on Vienna as a destination and using seasonal and seasonally adjusted data. The results confirm that forecast error is reduced when Google Trends data are used. Rivera (2016) treated search query volume data as a representation of an unobservable process and used a dynamic linear model to forecast TD in Puerto Rico, taking non-resident hotel registrations as a proxy variable. The results suggest that the search query-based model only outperforms its competitors when the forecast horizon is greater than 6 months. Li et al. (2017b) proposed a composite search index using the generalised dynamic factor model (GDFM) to forecast TD and compare its forecast performance with two benchmark models. The results show that the GDFM outperforms competing benchmark models. Volchek et al. (2018) used time-series, econometric and ANN models with the Google Trends index to forecast the number of visits to five London museums. They find that the inclusion of this index in pure time-series models generates the most accurate forecasts and that no other model outperforms its competitors in all situations. All of these studies confirm that search engine data can improve the forecast accuracy of forecasting models if they are properly integrated.
Traditional TD forecasting methods link TD with its causal variables, whereas search engine data methods examine the association of search engine data with TD. To the best of our knowledge, the question of whether causal variables can be useful if they are introduced into modern search engine data methods in tourism forecasting remains unanswered.
As an AI-based method, ANN models highlight the non-linear relationship between TD and input variables. They are widely used to predict TD with causal variables as inputs and are known to produce accurate forecasts of TD (Law and Au, 1999; Pai and Hong, 2005; Uysal and El Roubi, 1999). To identify the linear or non-linear relationship for high forecast accuracy, the ANN model is used in this research.
Methodology
Conceptual framework
Researchers generally consider three data sources to forecast TD: historical TD series, historical series of causal variables and search query series. Researchers recognise that TD series can be short-memory series or long-memory series, reflecting cyclical or seasonal changes in tourist behaviour (Gil-Alana, 2005; Odaki, 1993). Forecasting using lagged demand series depends on the intensity of short or long memories. Morley (2009) argued that a simple lagged demand term is not sufficient to account for the dynamics of TD models and found that causal variables help to specify the demand model. From a socio-psychological and economic perspective, causal variables determine tourists’ demand and search motivation when they search for travel information online (Heung et al., 2001). Search volume data generally record tourists’ behaviour by indicating their search frequency. Bangwayo-Skeete and Skeete (2015) provided evidence that search query information offers significant benefits in forecasting. As a result, causal variables are dynamic factors that form unobservable TD, and both historical series and search queries help to evaluate potential TD. Thus, we propose a conceptual framework (Figure 1) to describe the process of TD realisation, which can help us specify the behavioural models in the forecasting exercise. The primary influencing factors (causal variables) of TD include TI, TPs at the destination relative to those in the source markets, TPs at competing destinations, ERs, transportation costs and marketing promotion expenditures. These determine the type of holiday tourists are interested in, and tourists then search online for information that matches their demand. The search frequency reveals their preference. Search query data, together with historical series (continuing historical patterns), are two dimensions that account for the dynamics of TD. Thus, both contribute to accurately forecasting the demand for tourism.

TD realisation – A conceptual framework. TD: tourism demand.
Variables
To specify the TDTD model, we include the following key causal variables based on other studies: TI, TPs in Macau relative to those of Hong Kong, TPs in competing destinations (SPs) and ERs (Li et al., 2005; Song and Li, 2008). As the ER between the Hong Kong dollar (HK$) and the Macau pataca remained roughly the same between January 2001 and December 2018, ER is not considered in this study. When the ADLM is specified, lagged (L) TD is included (Song et al., 2003b, 2003c). Li et al. (2017b) developed a forecasting framework using search engine data, with search query keywords related to Dining (QD), Lodging (QL), Shopping (QS), Transportation (QTR), Tours (QT) and Recreation (QR). They found that these search queries provide useful information for forecasting TD. However, Macau is a well-known international gaming (Lu, 2011) and fine-dining destination (Law et al., 2019). Therefore, shopping and sightseeing are not the main motivations for visitors from Hong Kong. For this reason, QS and QT are excluded from the search query list.
In addition, a public holiday variable (HOLIDAY) is included as a causal variable. Based on a survey of 406 Japanese leisure travellers in Hong Kong, Heung et al. (2001) found that ‘enjoying their holidays’ is one of the most important motives for holidaying.
In this study, the TD function is written as follows:
(i) where TD is visitor arrivals (Arr) from Hong Kong to Macau between January 2005 and December 2018. Data are collected from the Department of Statistics and Census Service of Macau (see Figure 2). (ii) TI is Hong Kong’s TI, as represented by Hong Kong’s gross domestic product (GDP). Hong Kong’s quarterly GDP data between 2005Q1 and 2018Q4 are collected from the Hong Kong Census and Statistics Department. Monthly TD is affected by lagged quarterly GDP, which can be directly included in the ANN. (iii) TP corresponds to TPs in Macau relative to those in Hong Kong, as measured by the ratio of Macau’s consumer price index (CPI) to that of Hong Kong. Macau’s monthly CPI is collected from the Department of Statistics and Census Service of Macau (January 2005 to December 2018). Hong Kong’s monthly CPI (January 2005 to December 2018) is obtained from the Hong Kong Census and Statistics Department. Therefore, TPs in Macau relative to those of Hong Kong are calculated as follows:
(iv) SP is the SP at competing destinations. We choose the Chinese mainland (CM), Chinese Taipei (T), Korea (K) and Japan (J) as competing destinations for Macau. Indeed, the departures of Hong Kong residents to these destinations accounted for 87% of Hong Kong’s total departures in 2017. Therefore, the SP index is calculated as follows:
where (v) QD, QL, QTR and QR are search query data related to Dining, Lodging, Transportation and Recreation, respectively. As Hong Kong is the tourists’ place of origin, monthly search queries are obtained from Google Trends between January 2005 and December 2018 (https://trends.google.com/trends/). The languages used are English and traditional Chinese. The search query keywords are presented in Table 1 and the monthly search query data in Figure 3. (vi) HOLIDAY is a dummy variable representing public holidays in Hong Kong. Monthly holiday data (January 2005 to December 2018) are collected from the Hong Kong online calendar (http://m.calendar411.com). (vii) (L) is a lag operator. The number of lags for each variable is determined by the Akaike Information Criterion (AIC) index (Sakamoto et al., 1986).

Tourist arrivals from Hong Kong to Macau.
Keywords of search queries.
Note: A: QD: Dining; QL: Lodging; QTR: Transportation; QR: Recreation; in English; B: in traditional Chinese.

Search query data.
AI models
As previously mentioned, AI-based models are frequently used to predict TD (Claveria et al., 2015; Kon and Turner, 2005; Law, 2000; Law and Au, 1999; Palmer et al., 2006). They are known for their greater accuracy in tourism forecasting than regression and time-series models. However, their main disadvantage is that the relationship between input and output variables is unknown (they are ‘black boxes’), so they cannot be used to inform decisions (Li and Song, 2008). In contrast, econometric models require a careful selection of explanatory variables to avoid the problem of collinearity (Dormann et al., 2013).
One AI-based method involves back-propagation neural networks (Law, 2000; Wang et al., 2015) that aim to connect input variables and output variables. A given network (Figure 4) consists of an input layer, an output layer and one or more hidden layers. Each layer contains artificial neurons (nodes) connected to the artificial neurons (nodes) of the adjacent layer(s). Each connection between a pair of artificial neurons, like the synapses in a biological brain, can transmit signals to one another. The strength of the connection is expressed by the weight, which automatically adjusts according to the error between outputs and actual values based on the training set. After the model is trained, non-linear relationships are identified between the input and the output variables. When values of variables are input into the model, the trained neural network can output forecast values.

Back-propagation neural network.
The training process is carried out according to the following steps. We take a three-layer neural network as an example. 1. Initialise the network by assigning random numbers to the weights of the connections between the input layer and hidden layer 2. Transfer values forward.
(2.1) Transfer input values from input variables (Ii
) to the neurons (nodes) (yj
) on the hidden layer by
where
(2.2) Transfer neurons’ values (yj
) on hidden layer to output neuron O by
3. Estimate the error, transfer error back and adjust the weight of connections.
(3.1) Estimate the error (E) between the TD forecast (O) and the actual demand (Arr) on the training data set as follows:
(3.2) Update the connection weights
and update the connection weights
where η is the learning rate. Its value is between 0 and 1. 4. Repeat 2 and 3 until the error E is within an acceptable range.
With a trained neural network, a prediction can be made by inputting the input variable values.
A neural network toolbox in MATLAB (Version R2018b) is available for conducting these processes. newff() helps to create a neural network with specified layers and numbers of nodes in each layer, train() is used to train the network on the training data set and net() can be used to apply the trained network by inputting the input variable values and outputting the prediction value.
Benchmark forecasting models
A univariate ARIMA model and multivariate ADL model are estimated as benchmark models. The ARIMA model generates TD forecasts based solely on TD series, while the ADL model includes lagged output, causal variables and the search query variable to predict TD.
ARIMA model
The ARIMA model, a univariate model proposed by Box and Jenkins (1970), is the latest generation of models in the ARMA family. It integrates the autoregressive model and the moving average model. The specificity of this model is that it depends only on historical data. It has become extremely popular in recent years (Song and Li, 2008). The ARIMA (p, d, q) model can be written as follows:
where yt
is TD at time t (i.e. the number of tourist arrivals in month t), Δ is the difference function (i.e.
ADL model
The ADL model (Pesaran et al., 1996; Pesaran and Shin, 1998) is one of the main econometric forecasting methods (Song and Li, 2008). In the model, current TD is regressed on lagged values of TD and on current and lagged values of one or more explanatory variables. These explanatory variables are normally economic variables, such as TI, TPs at the destination relative to those in the countries/regions of origin, TPs at competing destinations (SPs) and ERs (Gunter and Onder, 2015; Song and Li, 2008). In addition, Pan et al. (2012) used search engine data to predict hotel room demand and incorporate online big data as explanatory variables in econometric models. Online big data are widely used as explanatory variables (Bangwayo-Skeete and Skeete, 2015; Huang et al., 2017; Li et al., 2017b; Rivera, 2016; Yang et al., 2015). The general ADL model can be written as follows:
where
Accuracy measure
To verify the forecasting accuracy of the proposed models, we adopt the mean absolute deviation (MAD), mean squared error (MSE), mean absolute percentage error (MAPE), root mean square error (RMSE) and root mean square percentage error (RMSPE).
Results and implications
Preparation
The data sample covers the period from January 2005 to December 2018. The frequency of the data varies according to the variables. For example, Hong Kong’s GDP is a quarterly series, while all of the other variables are monthly series. Using lags of these variables that may affect the demand variable, 156 observations from January 2005 to December 2017 are selected to train the ANN model and estimate the parameters of the ADL and ARIMA models. The period from January 2018 to December 2018 is reserved for evaluating the performance of these models for one-step-ahead forecasts. The variables included in the different models are defined as follows.
Historic tourism demand series (HTDS): Visitor arrivals to Macau from Hong Kong.
Causal variables (CV): Hong Kong’s GDP, Macau’s CPI relative to that of Hong Kong, a CPI index of substitute destinations and HOLIDAY, a dummy variable for public holidays in Hong Kong.
Search engine data (SED): QD, QL, QTR and QR related to Dining, Lodging, Transportation and Recreation, respectively.
The AIC (Sakamoto et al., 1986) is also used to determine the lags of input variables to interpret current TD.
Specifying competing models
To test the values of the causal variables, two groups of data are used. The data in the first group include HTDS, CV and SED. The data in the second group only include HTDS and SED for comparison purposes. A three-layer ANN model is trained on each set of data, and the performance is compared. According to Wanas et al. (1998), the best performance of a neural network occurs when the number of hidden nodes is equal to
Model 1:
Model 2:
Empirical results
We set 8 nodes in the hidden layer, the maximum training epochs to 5000 and the error tolerance to 0.001. MSE is used as an error measurement during the training process. We train the model recursively to generate 1-month-ahead forecasts. After training the ANN, the goodness of fit for model 1 and model 2 according to MSE is
Comparison of average performance.
Note: MAD: mean absolute deviation; MAPE: mean absolute percentage error; MSE: mean squared error; RMSE: root mean square error; RMSPE: root mean square percentage error. (**%) Indicates the percentage of accuracy improvement compared with model 2.
Two questions still need to be addressed. The first is whether forecast error can be reduced when causal variables are introduced into econometric models with HTDS and SED. The second concerns the forecasting performance of ANN-based models relative to other forecasting models, including causal econometric models and time-series models. To answer these questions, an ADL model with HTDS, CV and SED, an ADL model with HTDS and SED, and an ARIMA model with HTDS are estimated to generate monthly rolling forecasts for 2018. The goodness of fit for these three models is measured by MSE and their values are
Forecasting values and accuracy among benchmark models.
Note: ANN: artificial neural network; HTDS: historic tourism demand series; CV: causal variables; SED: search engine data; ADL: autoregressive distributed lag; ARIMA: autoregressive integrated moving average model; MAD: mean absolute deviation; MAPE: mean absolute percentage error; MSE: mean squared error; RMSE: root mean square error; RMSPE: root mean square percentage error. (**%) Indicates the percentage of accuracy improvement of ANN1 (HTDS & CV & SED).
A comparison of ADL (HTDS & CV & SED) with ADL (HTDS & SED) shows that the accuracy of the ADL model improves by 5.35% (MAD), 3.50% (MAPE), 12.05% (MSE), 6.22% (RMSE) and 3.57% (RMSPE) after the causal variables are added to the model. With causal variables, both the fit and forecasting performance are improved. These improvements show that causal variables can provide useful information and enhance the explanatory power of econometric models of TD.
A comparison of ANN1 (HTDS & CV & SED) with ADL (HTDS & CV & SED) using the same data sources gives the same results, which confirm the findings of Song and Li (2008) and Wu et al. (2017) that the AI-based method achieves excellent results when data for causal variables are lacking.
Lastly, a comparison of the performance of ANN1 (HTDS & CV & SED) with ADL (HTDS & CV & SED), ADL (HTDS & SED) and ARIMA (HTDS), ANN1 (HTDS & CV & SED) shows an improved accuracy of between 10.41% and 20.39%, 15.65% and 29.41% and 3.25% and 25.08%, respectively. These results show that the ANN1 (HTDS & CV & SED) model outperforms the other models. There are two reasons for this. First, ANN1 (HTDS & CV & SED) is an AI-based model, which Song and Li (2008) and Wu et al. (2017) have shown performs particularly well. Second, ANN1 (HTDS & CV & SED) incorporates HTDS, CV and SED.
Implications
This study has several implications for destination tourism management. To formulate effective tourism marketing and development strategies/policies, practitioners and policymakers should consider using three types of data in forecasting TD. (1) Causal variables such as the income of tourists, prices at the destination, prices at the substitute destination, marketing expenditure, ERs and transportation costs are still very useful in explaining the determinants of TD. (2) Search query data from sites such as Baidu and Google for keywords related to dining, lodging, shopping, transportation, tours and recreation also contain useful information for forecasting. (3) Lagged TD data contain important information on the dynamics of TD from the historic perspective, which can enhance forecasting accuracy if used properly.
However, including too many data series could lead to model overfitting. Thus, the AIC or Bayesian information criterion index should be used to decide upon the inclusion of the variables and their lags.
Using selected variables and lags, AI-based models or econometric models can be utilised to forecast TD. The results of this article and those of Song and Li (2008) and Wu et al. (2017) show that an AI-based model can outperform other models if used properly, especially in situations where the sample size is small.
The AI-based model is also more flexible than econometric models in forecasting TD in different frequencies. This is useful given that mixed frequency data can provide useful information for decision-making at different time intervals.
Conclusion
The ability to accurately forecast TD is important for practitioners and policymakers. With the growing use of the Internet and smartphones, search engines have become a globally important platform for users searching for information. Because search queries indicate the interests of users, search query data can contribute to TD forecasting and improve the accuracy of pure time-series models. We also investigate whether causal variables can improve accuracy. Theoretically, a conceptual framework for TD realisation is proposed to support the integration of historical TD, causal variables and search queries. Based on this conceptual framework, we find that causal variables help the decision-making process and improve performance. To quantitatively analyse the role of causal variables, we specify two competing models. Model 1 contains the causal variables as input for the ANN model, whereas model 2 does not include any causal variables. The training sample is 156 observations of TD for short-haul trips from Hong Kong to Macau between January 2005 and December 2017 (12 more observations are used for testing). We train the models and generate 1-month-ahead forecasts recursively from January 2018 to December 2018. Comparison of the models proves that causal variables improve the forecasting accuracy of the ANN and ADL models. To test the performance of the ANN model with causal variables, we compare it with two ADL models and one ARIMA model. The comparison confirms that the ANN model with causal variables outperforms these benchmark models. The reasons are in two aspects. The one comes from the model. AI-based model can generate particular accurate forecast. It is proved both in this article and the existing literature (Song and Li, 2008; Wu et al., 2017). The second reason is the multiple data sources. ANN model with causal variables incorporates more information to interpret the TD. The role of search query data is proved by Bangwayo-Skeete and Skeete (2015) that search query data help to improve the accuracy of forecasting for TD. The historical series are shown by Odaki (1993) and Gil-Alana (2005) that time series own short or long memories. The causal variables are quantitively proved in this article and help to improve the performance.
In the context of research combining historical TD series, causal variables and search query data to predict TD, the main contribution of this study is that we propose a conceptual framework supporting the integration of causal variables, search queries and historical data. In addition, we quantitatively prove the superiority of the AI-based model with causal variables in TD forecasting.
This study has several limitations, some of which could be addressed in future research. First, TI, TPs at the destination and SPs are used as causal economic variables; however, the populations, ERs, transportation costs, marketing promotion and climates of various markets may also affect tourist arrivals. Future researchers could include these variables in forecasting models to further improve their accuracy. Second, this study focuses on short-haul travel from Hong Kong to Macau. Future researchers could further explore the performance of the models in long-haul TD forecasting.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Hong Kong Scholars Program and the National Natural Science Foundation of China (no. 71761001).
