Abstract
This study explores how to select the optimal number of lagged inputs (NLIs) in international tourism demand forecasting. With international tourist arrivals at 10 European countries, the performances of eight machine learning models are evaluated using different NLIs. The results show that: (1) as NLIs increases, the error of most machine learning models first decreases rapidly and then tends to be stable (or fluctuates around a certain value) when NLIs reaches a certain cutoff point. The cutoff point is related to 12 and its multiples. This trend is not affected by the size of the test set; (2) for nonlinear and ensemble models, it is better to select one cycle of the data as the NLIs, while for linear models, multiple cycles are a better choice; (3) significantly different prediction results are obtained by different categories of models when the optimal NLIs are used.
Introduction
Due to the increasing importance of international tourism to the world economy, both the public and the private sectors have channeled a lot of resources and investment to this industry (Jiao and Chen, 2019; Peng et al., 2014). Accordingly, international tourism demand forecasting is of great interest to both tourism researchers and practitioners (Crouch, 1994; Gunter and Önder, 2015; Jorge-González et al., 2019; Kim et al., 2018; Song and Witt, 2006). Accurate forecasts are of enormous value to government, organizations, and tourism practitioners and support good decision-making (Bi et al., 2019; Palmer et al., 2006; Song and Li, 2008). For example, accurate demand forecasts can help government agencies design infrastructure for tourism, such as transportation systems and accommodation, can help organizations design marketing and other materials, such as tour brochures (Li et al., 2018), and can help practitioners make reasonable operational decisions, such as scheduling and staffing (Bi et al., 2020a). Therefore, to support international tourism-related decision-making, it is necessary to study international tourism demand forecasting.
Many models can be used to forecast international tourism demand (Song et al., 2019). These can be mainly divided into three categories, that is, time series models, econometric models, and machine learning models (Bi et al., 2020b; Song and Li, 2008; Song et al., 2019). A detailed introduction to these three categories is given in “Tourism demand forecasting models” section. Among the three categories, the machine learning models are relatively new and have shown excellent predictive ability. Therefore, this category of model has been widely studied and applied in tourism demand forecasting (Hassani et al., 2015, 2017; Li et al., 2018; Silva et al., 2019). For machine learning models to forecast tourism demand, it is first necessary to convert the tourism volume data in the form of time series into an input–output pair sequence. In this process, an important step is to determine the number of lagged inputs (NLIs), that is, the number of the time series’ lagged terms as inputs. Although selecting and determining the optimal NLIs values is very time-consuming, and the NLIs value will significantly affect the accuracy of the forecasts (Peng et al., 2014), there is no published research on how to determine the optimal NLIs and, consequently, there is a lack of guidance on how to reasonably and quickly select the NLIs in international tourism demand forecasting based on machine learning models.
To fill this research gap, this study explores how to select the NLIs in international tourism demand forecasting based on machine learning models. Specifically, we explore the following four questions.
Studies have shown that the NLIs can significantly affect the accuracy of international tourism demand forecasting (Peng et al., 2014). However, it is still unknown how the prediction errors change as NLIs increases. Therefore, the first research question is:
In practice, the number of samples in the test set may not be the same. However, whether the optimal NLIs for a particular machine learning model will be affected by the size of the test set is unclear. In other words, for a given time series data set of tourist volume and a given machine learning model, it is still unknown whether the optimal NLIs will change with the number of samples. Therefore, the second research question is:
As a type of time series data, an important characteristic of international tourist volume data is seasonality. Intuitively, the optimal NLIs may be related to the seasonality of international tourist volume data. However, there is still no direct evidence showing the relationship between optimal NLIs and the seasonality of tourism time series. Whether the optimal NLIs of different models can be directly determined through the characteristic of international tourist volume data is still unclear. Consequently, there is no consensus about the selection of the optimal NLIs. That is why different NLIs are used in different studies for monthly international tourism demand forecasting (Álvarez-Díaz and Rosselló-Nadal, 2010; Burger et al., 2001; Hassani et al., 2017; Hong et al., 2011). Therefore, the third research question is:
Comparing the performance of different models in tourism demand forecasting is very helpful for developing new models. Although several studies have attempted to address this issue, their conclusions are not based on the optimal NLIs for different machine learning models; indeed, some of the machine learning models do not use the optimal NLIs in these studies. As a result, the generalizability and reliability of the conclusions obtained in these studies may be limited. Moreover, it is not known whether there is a significant difference in the predicted results of different machine learning models when each of those models uses its optimal NLI. Therefore, the fourth research question is:
To answer the above research questions, a comprehensive and systematic experimental analysis is carried out in this study. A framework for tourism demand forecasting based on machine learning models is first proposed, which includes two parts: data conversion and model training. Then, the experimental design is given. On this basis, 19,200 experiments were conducted on 10 data sets and 8 machine learning models to explore the relationship between the NLIs and the forecasting results. Answering the above questions will provide scientific guidance and help for practitioners to determine the optimal NLIs when using machine learning models to forecast international tourism demand and provide valuable information that can be used to make policy and business decisions more effectively.
Literature review
Tourism demand forecasting models
A variety of tourism demand forecasting models have been developed. These models can largely be classified into three categories, according to their associated theory and technology: time series analysis models, econometric models, and machine learning models (Pan and Yang, 2017; Song and Li, 2008). A brief literature review concerning these is presented below.
(1) Time series analysis models
Time series analysis models are the traditional models for noncausal time series tourism forecasting. In this category, the autoregressive moving average (ARIMA) and its improved versions (e.g. seasonal ARIMA) are the most commonly used models and have good performance (Athanasopoulos et al., 2011; Cho, 2001; Du Preez and Witt, 2003; Goh and Law, 2002; Lim and McAleer, 2002; Shahrabi et al., 2013). For example, Cho (2001) applied three time series forecasting models, namely exponential smoothing, ARIMA, and improved ARIMA, to predict travel demand for Hong Kong. ARIMA and improved ARIMA outperformed exponential smoothing and are more suitable to forecast fluctuating travel demand. Goh and Law (2002) compared the performance of 10 time series forecasting models based on data for tourist arrivals at Hong Kong and found that improved versions of ARIMA (i.e. seasonal ARIMA and multivariate ARIMA) had the highest prediction accuracy. Other time series models used in tourism forecasting have included state space models (Athanasopoulos and Hyndman, 2008; Beneki et al., 2012), generalized autoregressive conditional heteroskedastic models (Chan et al., 2005; Liang, 2014), and exponential smoothing (Fildes et al., 2011).
(2) Econometric models
Compared with time series models, econometric models are more suitable for causal time series tourism forecasting since they can analyze the causal relationship between tourism demand and factors that affect it. Those commonly used for tourism demand forecasting include error correction models (Shen et al., 2009; Wong et al., 2007), the vector autoregressive model (Assaf et al., 2018; Cao et al., 2017; Shan and Wilson, 2001; Song and Witt, 2006), the autoregressive distributed lag model (Song et al., 2003a, 2003b), the almost ideal demand system (De Mello and Fortuna, 2005; Li et al., 2006), and the structural equation model (Turner and Witt, 2001).
(3) Machine learning models
In recent years, machine learning models have emerged in the field of tourism forecasting, such as artificial neural networks (NNs) (Burger et al., 2001; Kon and Turner, 2005), support vector machines (Pai et al., 2014; Sencheong and Turner, 2005), and rough set approaches (Au and Law, 2000; Law and Au, 2000). The main advantage of machine learning models is that they do not require any assumptions regarding the data (e.g. distribution and probability); additionally, their adaptability and nonlinearity make them especially suitable for nonlinear prediction (Li et al., 2018; Song and Li, 2008). Machine learning models have therefore been increasingly applied in the field of tourism demand forecasting (Li et al., 2018), and they are the focus of this study.
NLIs in tourism demand forecasting
To use machine learning models to forecast tourism demand requires the time series data to be converted into a form that can be used in those models, that is, an input–output pair sequence. In this process, an important step is to determine the NLIs. The NLIs values used in previous studies with respect to different machine learning models and data sets are summarized in Table 1.
The NLIs used in previous studies of the use of machine learning models to forecast tourism demand.
As shown in Table 1, different NLIs are used in different studies. It should be noted that selecting and determining the NLIs can be very time-consuming, and most studies do not explain why a particular NLI was chosen. Nevertheless, these studies have shown that NLIs can significantly affect the accuracy of forecasts (Peng et al., 2014). However, no research has been reported on how best to determine which NLIs to use. Therefore, a comprehensive and systematic experimental study exploring the relationship between NLIs and the accuracy of forecasts is necessary.
Framework for tourism demand forecasting with machine learning and related models
A framework for tourism demand forecasting with machine learning
The proposed framework for tourism demand forecasting based on machine learning is shown in Figure 1. It has two parts: data conversion and model training. Detailed descriptions of these two parts are given below.

Framework for tourism demand forecasting based on machine learning.
(1) Data conversion
Tourist volume is a type of time series data, which cannot be directly used by machine learning models. Therefore, it is first necessary to convert the tourist volume data in the form of time series into the data form that can be used by machine learning models, namely input–output pair sequences.
Let
where the matrix on the left-hand side is the input of machine learning models (
To illustrate the above conversion process more clearly, an example is given here. Table 2 presents a time series of tourist volume data with 10 observation points. In the case of n = 3, the data in Table 2 can be converted into the data shown in Table 3.
An example of a time series of tourist volume data with 10 observation points.
The converted data corresponding to the data in Table 2 in the case of n = 3.
(2) Model training
After the data have been converted, machine learning models can be trained, and a tourist volume predictor can be obtained. Here, the “tourist volume predictor” refers to the machine learning model with actual prediction ability, which is trained on the tourist volume data. Based on the obtained predictor, future tourist volumes can be forecast.
Machine learning models for tourism demand forecasting
At present, machine learning models that can be used to forecast tourism demand fall mainly into three categories: linear machine learning models, nonlinear machine learning models, and ensemble machine learning models. The most commonly used linear machine learning models are linear regression (LR) and ridge regression (RR); the most commonly used nonlinear machine learning models include the decision tree regression (DTR) model, the K nearest neighbor regression (KNNR) model, the support vector regression (SVR), and NNs; the most commonly used ensemble machine learning models are bootstrap aggregating (BA) and adaptive boosting (AB). To analyze the effect of NLIs values on forecasting results more comprehensively, all eight of these models are tested in this study. A brief introduction to the eight models is given below.
(1) Linear regression
LR is used to analyze the relationship between a dependent variable and one or more independent variables. If there is only one independent variable, then the model is called simple LR; if there are two or more independent variables, then the model is called multiple LR. Let
where
To determine the parameters in equation (1), the least squares method can be used, where the loss function and objective function are given in equations (2) and (3), respectively
where M is the number of samples,
(2) Ridge regression
RR, also known as Tikhonov regularization and weight decay, is a biased estimation regression method specially used for analysis of collinear data. In essence, it is an improved least squares method. By abandoning the unbiasedness of the least squares method, more reasonable regression coefficients can be obtained at the cost of losing part of the information and a reduction in accuracy. RR is more suitable for ill-posed problems than the least squares method.
The function form of RR and LR is the same, but their loss function is different when estimating the parameters. The main difference is that a regularization term is introduced into the loss function of RR to avoid the possible problems of LR when estimating the parameters, such as over-fitting. The loss function of RR is shown in equation (4)
where
(3) Decision tree regression
DTR is a type of nonparametric supervised learning model. The goal of DTR is to create a model for predicting the value of target variables by the decision rules inferred from data features. Several algorithms can be used to construct the DTR, such as M5P, and classification and regression tree (CART). In these algorithms, CART, proposed by Breiman et al. (1984), is regarded as one of the most powerful nonparametric algorithms for generalization operations. Compared with other parametric algorithms, CART is less affected by abnormal data. CART generates an optimized binary tree by calculating the Gini index of each node and using binary recursive partitioning to achieve regression. The Gini index at node t can be calculated by
where K is the number of output categories and
The difference loss of node t under branch condition
where
The conditions for choosing branches are computed as
(4) K nearest neighbor regression
K nearest neighbor (KNN) model is one of the simplest and most commonly used algorithms in the field of data mining. The KNN model can be used not only for classification but also for regression. In this study, the KNN used for regression (i.e. KNNR) is employed. The basic idea of KNNR is to predict the value of a sample to be predicted based on its KNNs. The process has four steps: (a) calculate the Euclidean distances between the sample to be predicted and other samples; (b) rank the samples from small to large based on the obtained Euclidean distances; (c) determine the optimal value of K; and (d) determine the predicted values by calculating the weighted average values of the K nearest samples.
(5) Support vector regression
SVR is a type of support vector machine for regression (Drucker et al., 1997). Its core idea is to find a hyperplane satisfying the condition that all the data in the training set are as close as possible to the hyperplane. The process of finding the hyperplane can be transformed into solving the following optimization problems
where
(6) Neural network
NN is an algorithmic mathematical model which imitates the behavioral characteristics of animal NNs and carries out distributed parallel information processing. NN mainly achieves the purpose of processing information by adjusting the interconnection between a large number of internal nodes. Let
where wk is the weight of input vk,
(7) Adaptive boosting
AB is a kind of ensemble learning algorithm. AB can be used for both classification and regression. For regression problems, its core idea is to train different predictors (weak predictors) based on the same training set using different sample weights and then aggregate these weak predictors to form a stronger final predictor (strong predictor). The training process of AB has six steps: (a) initialize distribution weights of the training samples; (b) train weak predictors based on training samples and their corresponding weights; (c) calculate the prediction error of the trained weak predictor; (d) calculate the weight of the trained weak predictor and update the weight of each training sample according to the obtained prediction error; (e) repeat the above process T times, and T weak predictors and their corresponding weights can be obtained; and (f) integrate the obtained T weak predictors and their corresponding weights, and the final strong predictor can be obtained.
(8) Bootstrap aggregating
BA, also called bagging, is another kind of ensemble learning algorithm, which aims to improve the stability and accuracy of machine learning algorithms for classification and regression. For regression tasks, its core idea is to train different predictors (weak predictors) using different training data sampled from the training set uniformly and with replacement, and then the final prediction results can be obtained by calculating the average value of the outputs obtained by the weak predictors. The training process of BA has four steps: (a) generate a bootstrap sample from the training set; (b) train a weak predictor using the bootstrap sample; (c) repeat the above process T times, and T weak predictors can be obtained; (d) calculate the average value of the outputs obtained by the T weak predictors, and the final prediction results can be obtained.
Experimental design
Experimental data sets
The experimental data sets used in this study are monthly numbers of international tourist arrivals at 10 European countries: Austria, Belgium, Finland, Germany, Greece, Italy, Luxembourg, the Netherlands, Portugal, and Sweden. The data used in the experiment are obtained from the Eurostat database (https://ec.europa.eu/eurostat/data/database). The period spans from January 1995 to April 2018, giving 280 observation points for each country.
The experimental data are shown in Figure 2. In Figure 2, the horizontal axis and vertical axis of each sub-figure represent time and number of international tourist arrivals, respectively. As can be seen from Figure 2, the international tourist arrivals in these 10 countries show a distinct periodicity, with a 12-month cycle. Since the most suitable tourist months in different countries are not the same, the months when international tourist arrivals reach an extreme value are not the same. In addition, international tourist arrivals in these 10 countries show an overall upward trend over study period.

The experimental data sets.
Experimental procedure
The experimental procedure is shown in Figure 3. It consists of three stages: (1) data conversion, (2) machine learning model training, and (3) comparison and evaluation of experimental results. Detailed descriptions of these three stages are given below.

The experimental procedure.
(1) Data conversion
As described above, the machine learning models require the tourist volume data in the form of time series to be converted into input–output pair sequences. Specifically, the international tourist arrivals of each of the 10 countries with respect to the 280 months mentioned above are transformed into the data form with different NLIs required by the machine learning models. In this experiment, the range of NLIs is set to 1–60. Therefore, for each country, 60 sub-data sets with different NLIs can be obtained. In this stage, a total of 60 × 10 = 600 sub-data sets were obtained.
(2) Machine learning model training
To train tourist demand predictors and evaluate their performance, it is necessary to divide each sub-data set into a training set and a test set. To verify whether the optimal NLIs value is affected by the size of the test set, each sub-data set is divided in four ways, to give test sets of 3, 12, 24, and 36 samples. Therefore, for a sub-data set, four training sets and four test sets are obtained. Since there are 600 sub-data sets, 600 × 4 = 2400 training sets and 600 × 4 = 2400 test sets are obtained. Based on the obtained training sets, the above eight machine learning models are trained, and the corresponding predictors are obtained. On this basis, the corresponding prediction result of each predictor on each test set is obtained. Since there are eight machine learning models and 2400 test sets, 2400 × 8 = 19,200 prediction results are obtained in this stage.
(3) Comparison and evaluation of experimental results
To compare the performance of different models with different NLIs values, the root-mean-square error (RMSE) is adopted in this study because RMSE is the most commonly used indicator to measure the performance of models in tourism forecasting (Hassani et al., 2017). The RMSE can be calculated by equation (11)
where yn and
To verify whether there is a significant difference between the prediction results with respect to different NLIs or models, the Wilcoxon test is used in this study. The Wilcoxon test is a nonparametric statistical method to test hypotheses and is used to compare two related samples to assess whether their population mean ranks are significantly different. The method is briefly described as follows.
Let
Let
Let
Let
Experimental setup
The experimental study is performed on a PC with a 64 GB RAM and a 3.10 GHz Intel i7-7920HQ 8-Core CPU, using the Windows 7 operating system. In the experimental study, Anaconda (https://www.anaconda.com/) is used, which is an open-source distribution of the Python language for data science and machine learning applications. The main parameters for the above eight models and their settings in the experiment are shown in Table 4.
The main parameters for the eight models and their settings in the experiment.
Note: LR: linear regression; RR: ridge regression; DTR: decision tree regression; KNN: K nearest neighbor; AB: adaptive boosting; BA: bootstrap aggregating; SVR: support vector regression; NN: neural network.
Experimental results and analysis
How the prediction errors of different models vary as NLI increases
The RMSE curves of different models with respect to different sub-data sets as NLI increases are shown in Figures 4 to 13. To explain the meanings of Figures 4 to 13 more clearly, we take Figure 4 as an example. Figure 4 shows the RMSE curves of different models for international tourist arrivals in Austria, where parts (a) to (d) represent the RMSE curves on the sub-data sets with 3, 12, 24, and 36 test samples, respectively. The horizontal axis represents the NLIs values, from 1 to 60, and the vertical axis represents the RMSE.

The RMSE curves of different models for international tourist arrivals in Austria.

The RMSE curves of different models for international tourist arrivals in Belgium.

The RMSE curves of different models for international tourist arrivals in Finland.

The RMSE curves of different models for international tourist arrivals in Germany.

The RMSE curves of different models for international tourist arrivals in Greece.

The RMSE curves of different models for international tourist arrivals in Italy.

The RMSE curves of different models for international tourist arrivals in Luxembourg.

The RMSE curves of different models for international tourist arrivals in the Netherlands.

The RMSE curves of different models for international tourist arrivals in Portugal.

The RMSE curves of different models for international tourist arrivals in Sweden.
As can be seen from Figures 4 to 13, the RMSE of different models is affected by the NLIs value. Specifically, as NLIs increases, most of the RMSE obtained by the eight machine learning models first decrease and then remain unchanged. This trend has nothing to do with the size of the test set. Therefore, for research questions Q1 and Q2, we can draw the following conclusions:
Exploring the relationship between NLIs and tourist volume data
As can be seen from Figures 4 to 13, the optimal NLIs values, that is, those corresponding to the minimum RMSE obtained by different models, are related to the cycle of the data (i.e. 12 months). In other words, the optimal NLIs values for different models are 12, 24, 36, 48, or 60. However, some models, such as NNs, have large fluctuations in the prediction results with some sub-data sets. To determine the optimal NLIs more intuitively and accurately, the standard deviation of each set of 12 consecutive prediction results is calculated by equation (16)
where
According to equation (16), the standard deviation of the predicted results obtained by the eight models with different NLIs on the 10 data sets can be determined. To comprehensively consider the case of test sets with different numbers of samples, the average value of RMSE of each model on each data set with different sample sizes for the test sets is calculated. The results are shown in Table 5. To clarify the meaning of the data in Table 5, we take the “154,034” in the upper left corner of the table as an example. The “154,034” denotes the average value of RMSE obtained by the LR model on the data set “Austria” with different numbers of samples in the test set in the case of NLIs = 12. In other words, “154,034” is the average value of the RMSE obtained by LR with respect to NLIs = 12 corresponding to the RMSE values in Figure 4, where parts (a) to (d) show the results with different numbers of samples in the test set. It can be seen from Table 5 that the optimal NLIs value can differ both across models and across the different national data sets. For example, LR achieves the best prediction result with NLIs = 60 with the data set “Austria,” while LR achieves the best prediction result with NLIs = 24 with the data set “Finland.” To verify whether there are significant differences among the results obtained by each model with different NLIs values, the Wilcoxon test is used. The results are shown in Table 6. For DTR, KNNR, AB, BA, SVR, and NN, in most cases there is no significant difference between the results obtained by these models with one cycle (i.e. 12 months) as the NLIs or multiple cycles (24, 36, 48, and 60 months) as the NLIs. Since the complexity of these models increases as NLIs increases, but the prediction results are generally not significantly improved, it is better to select one cycle as the NLIs in tourism demand forecasting.
The average value of RMSE of each model on each data set with different numbers of samples in the test sets.
Note: LR: linear regression; RR: ridge regression; DTR: decision tree regression; KNNR: K nearest neighbor regression; AB: adaptive boosting; BA: bootstrap aggregating; SVR: support vector regression; NN: neural network; NLIs: number of lagged inputs; RMSE: root-mean-square error.
Wilcoxon test results for the differences in the RMSE obtained by different models with different NLIs.
Note: LR: linear regression; RR: ridge regression; DTR: decision tree regression; KNNR: K nearest neighbor regression; AB: adaptive boosting; BA: bootstrap aggregating; SVR: support vector regression; NN: neural network; NLIs: number of lagged inputs; RMSE: root-mean-square error.
* At the significance level of 0.05.
** At the significance level of 0.01.
It should be noted that for LR, there is a significant difference between the results from using one cycle (i.e. 12 months) as the NLIs and those from using multiple cycles (24, 36, 48, and 60 months) as the NLIs. However, there is no significant difference among the results from using two cycles (i.e. 24 months), three cycles (i.e. 36 months), four cycles (i.e. 48 months), and five cycles (i.e. 60 months) as the NLIs. Since the complexity of the model increases as NLIs increases, but the prediction results are not significantly improved, it is better for LR to select two cycles as the NLIs in tourism demand forecasting.
For RR, there is a significant difference between the results from using one cycle as the NLIs and using multiple cycles as the NLIs. In addition, there is a significant difference between using two cycles as the NLIs and using three cycles (i.e. 36 months) or four cycles (i.e. 48 months) as the NLIs. However, there is no significant difference between the results from using three cycles (i.e. 36 months) and using four cycles (i.e. 48 months). Therefore, it is better for RR to select three cycles as the NLIs in tourism demand forecasting. Therefore, for research question Q3, we can draw the following conclusions:
Verifying whether there are significant differences among the predicted results from different models
The last line of Table 5 shows the average RMSE of the eight models with the optimal NLIs values on the 10 data sets. The ranking of the eight models in tourism demand forecasting according to this average is
Wilcoxon test results for the differences in the RMSE obtained by different models.
Note: LR: linear regression; RR: ridge regression; DTR: decision tree regression; KNNR: K nearest neighbor regression; AB: adaptive boosting; BA: bootstrap aggregating; SVR: support vector regression; NN: neural network; RMSE: root-mean-square error.
Based on the above analysis, for the research question Q4, we can draw the following conclusions:
Additionally, to better evaluate the results of different models, it is necessary to compare the performances of these machine learning models with the baseline models (i.e. ARIMA and the Naïve model) commonly used in tourism demand forecasting. For this, the average RMSE of each machine learning model with the optimal NLIs on each data set is respectively calculated, as shown in columns 2 to 9 of Table 8. Meanwhile, the average RMSEs of ARIMA and the Naïve model on each data set are also respectively calculated, as shown in the last two columns of Table 8. The results in Table 8 show that when each of those machine learning models uses its optimal NLIs, they significantly outperform the baseline models (i.e. ARIMA and the Naïve model) on these data sets. However, if the optimal NLIs is not used, the performance of these models may not be as good as the ARIMA in some cases. This provides evidence that the NLIs value significantly affects the accuracy of machine learning models in tourism demand forecasting.
The average RMSEs of different machine learning models and baseline models on each data set.
Note: LR: linear regression; RR: ridge regression; DTR: decision tree regression; KNNR: K nearest neighbor regression; AB: adaptive boosting; BA: bootstrap aggregating; SVR: support vector regression; NN: neural network; RMSE: root-mean-square error; ARIMA: autoregressive moving average.
Conclusions
This study experimentally explores how to select optimal NLIs values in international tourism demand forecasting based on machine learning models. A framework for tourism demand forecasting based on machine learning models is proposed, which has two parts: data conversion and model training. Based on the proposed framework, an experimental design is presented in which the international tourist arrivals in 10 European countries (Austria, Belgium, Finland, Germany, Greece, Italy, Luxembourg, the Netherlands, Portugal, and Sweden) from January 1995 to April 2018 are selected as the data sets; the performance of eight commonly used machine learning models (LR, RR, DTR, KNNR, AB, BA, SVR, and NN) is evaluated using NLIs values ranging from 1 to 60. The Wilcoxon test is used to verify whether there are significant differences between the prediction results obtained using different NLIs values and different models. Based on 19,200 experiments, the following conclusions can be drawn:
As NLIs increases, the RMSE of most machine learning models first decreases rapidly and then tends to be stable (or fluctuates around a certain value) when NLIs reaches a certain cutoff point. The cutoff point is usually related to 12 and its multiples.
The optimal NLIs value of different machine learning models is not affected by the size of test set.
For the nonlinear machine learning models (i.e. DTR, KNNR, SVR, and NN) and ensemble machine learning models (i.e. AB and BA), it is better to select one cycle as the NLIs in international tourism demand forecasting. For the linear machine learning models (i.e. LR and RR), using one cycle as the NLIs can obtain acceptable prediction results, but better prediction results may be obtained using multiple cycles. Specifically, it is better for LR and RR to select two cycles and three cycles as the NLIs in international tourism demand forecasting, respectively. These results provide evidence that there is a certain relationship between the optimal NLIs and the seasonality of tourism time series, and the optimal NLIs can be determined according to the seasonality of tourism time series. It should be noted that one cycle is not always the optimal NLIs for different models, especially for linear machine learning models. These results are helpful to the formation of the theory of selecting the optimal NLIs in tourism demand forecasting with machine learning models.
In most cases, there are significant differences among the prediction results obtained by different categories of models, that is, linear machine learning models, nonlinear machine learning models, and ensemble machine learning models, when the optimal NLIs is used. Surprisingly, the two simplest linear models (LR and RR) perform significantly better than the nonlinear machine learning models (DTR, KNNR, SVR, and NN) and ensemble machine learning models (AB and BA) in international tourism demand forecasting.
The above conclusions can provide scientific guidance and help for practitioners to determine the optimal NLIs quickly and reasonably when using machine learning models to forecast international tourism demand and provide valuable information for making more effective policy and business decisions.
The study does have some limitations, which may serve as avenues for future research. First, the conclusions are based on the international tourist arrivals at 10 European countries, and further research is needed on the validity of these conclusions for other countries and regions. In addition, this study only focuses on the univariate forecasting problem in the field of tourism demand forecasting, thus how to determine the NLIs in multivariate forecasting is a promising future research direction.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was partly supported by the Humanities and Social Science Fund of Ministry of Education of China under grant number 20YJC630002, the China Postdoctoral Science Foundation under grant numbers 2020T130318 and 2019M661000, the National Natural Science Foundation of China under grant numbers 71971124 and 71932005, and the Fundamental Research Funds for the Central Universities, NKU under grant number 63202074.
