Abstract
As helpful big data, search engine data (SED) regarding tourism-related factors have currently been introduced to tourist volume prediction, but they have been shown to impact the tourism market on different timescales (or frequency band). This study develops a novel forecasting method using an emerging multiscale analysis—bivariate empirical mode decomposition (BEMD)—to investigate multiscale relationships. Three major steps are performed: (1) SED process to construct an informative index from sufficient SED using statistical analyses, (2) multiscale analysis to extract scale-aligned common factors from the bivariate data of tourist volumes and SED using BEMD, and (3) tourist volume prediction using an SED-based index. In the empirical study, the novel BEMD-based method with SED is used to forecast the tourist volume of Hainan in China, a global tourist attraction, and significantly outperforms both popular techniques (not considering SED or multiscales) and similar variants (considering SED or multiscales) in accuracy and robustness.
Keywords
Introduction
Tourism forecasting has increasingly become a hot topic in the research field of prediction (Chen et al., 2012; Jiao and Chen, 2019; Volchek et al., 2019; Yang et al., 2015). According to existing studies, popular forecasting methods are divided into two groups: econometric models (Athanasopoulos et al., 2018; Liang, 2014) and artificial intelligences (AIs) (Claveria et al., 2016; Sun et al., 2019). Popular econometric models for tourism forecasting include the autoregressive moving average (Du Preez and Witt, 2003; Shahrabi et al., 2013), linear regression (LR) (Li et al., 2016; Peng et al., 2017), and generalized autoregressive conditional heteroscedasticity models (Liang, 2014). Prevailing AI models include support vector regression (SVR) (Wu and Cao, 2016), back propagation neural networks (Li et al., 2018b), and the extreme learning machine (ELM) (Sun et al., 2019). Existing forecasting methods focusing on autoregressive analysis used the historical observations of the tourism market and disregarded the external factors (Athanasopoulos and de Silva, 2012; Jackman and Greenidge, 2010). However, some external factors have been shown to significantly drive the tourism market (Li et al., 2018a; Sun et al., 2019).
In the big data era, a promising idea of employing search engine data (SED) has emerged to capture various external factors and improve tourism forecasting, as demonstrated by the studies listed in Table 1. SED have been extensively considered as a typical type of big data, which record web users who search behaviors via online search engines, and directly and comprehensively reflect related external factors, which can serve as powerful predictors for the tourist market (Li et al., 2016; Li et al., 2018b; Peng et al., 2017). Table 1 indicates that diverse tourism-related SED have been introduced in tourism prediction, including SED for searching tourism (Sun et al., 2019), destinations (Pan et al., 2012), lodging (Bangwayo-Skeete and Skeete, 2015), traffic (Sun et al., 2019), scenic spots (Li et al., 2017; Yang et al., 2015), eating choices (Li et al., 2017), shopping (Li et al., 2017), weather (Yang et al., 2015), extreme events (Li et al., 2018a), web traffic (Gunter and Onder, 2016), and attractions tickets (Huang et al., 2017).
Available studies of tourism prediction using SED.
Note: SED: search engine data; ARMA: autoregressive moving average; AR-MIDAS: autoregressive mixed-data sampling, LR: linear regression; MA: moving average; ETS: error-trend-seasonal or exponential smoothing; DLM: dynamic linear model; SARIMA: seasonal autoregressive integrated moving average; GDFM: generalized dynamic factor model; BPNN: back propagation neural networks; KELM: kernel extreme learning machine.
Different SED searching for different tourism-related factors have exhibited different predictive powers for tourist volume in terms of different timescales or frequency bands (i.e., spans of impacting periods) (Chen et al., 2012; Li et al., 2016). For instance, weather has been proven to impact tourist volume more significantly in the short term (Gössling et al., 2012; Li et al., 2018a, 2018b); extreme events (e.g., deployment of terminal high-altitude area defense system in Korea in 2016) may have a medium-term effect on tourist volume; and scenic spots may exhibit a long-term predictive power (Li et al., 2016; Li et al., 2018a). The corresponding SED for each factor may have a distinct multiscale relationship with tourist volume and provide a new perspective to improve the existing forecasting techniques in tourism prediction.
To model the multiscale relationship between SED and tourist volume, an emerging algorithm, that is, one typical case of empirical mode decomposition (EMD) family, is introduced in this article. In particular, the EMD family is the promising multiscale analysis method, which has the following two unique superiorities for modelling the multiscale relationship between different data. Firstly, it’s different from traditional multiscale analyses (such as Fourier analysis and wavelet analysis) that use fixed bases (Xu et al., 2017). EMD family employs empirical, adaptive, and flexible bases according to the intrinsic features of the data, which renders it more suitable for nonstationary, nonlinear, and complex data (He et al., 2016). Secondly, the version of EMD can process multiple data, by extracting matched (or scale-aligned) common modes (or factors) with a similar total number of modes and similar timescales, such as bivariate empirical mode decomposition (BEMD) (Chen et al., 2017; Xu et al., 2017) and multivariate empirical mode decomposition (MEMD) (Tang et al., 2020; Yuan et al., 2019). Notably, the EMD family method has already been applied in the field of prediction, including EMD for tourism volume prediction (Chen et al., 2012); BEMD for crude oil price (Wang et al., 2018) and electricity demand forecasting (Xiong et al., 2014); MEMD for crude oil price (Tang et al., 2020) and PM2.5 (Yuan et al., 2019) prediction. However, there is no study applying the version of EMD into tourism prediction. Therefore, this study closes the literature gap by introducing BEMD to improve tourism prediction for effectively exploring the multiscale relationship between tourist volume and sufficient SED.
Generally, the aim of this study is to propose a novel BEMD-based methodology with informative SED for tourist volume forecasting. Three main steps are involved in the proposed methodology: SED process, multiscale analysis, and tourist volume prediction. First, basic statistical analyses, that is, principal component analysis and correlation analysis, are employed to construct an informative index from massive SED for various tourism-related factors. Second, the promising multiscale analysis for bivariate data, that is, BEMD, is introduced to capture the multiscale relationship between tourist volume and the informative SED, in terms of extracting matched modes on similar timescales. Third, a popular forecasting method, either a statistical or AI model, is used to conduct individual prediction on each timescale, then the final prediction results are obtained in a linear form of different timescales. Compared to existing studies, this study makes major contributions from the following two perspectives: This study may be the first attempt to introduce BEMD to capture the multiscale relationship between tourist volume and the related informative SED and propose a novel BEMD-based method with SED for tourist volume forecasting; and The effectiveness of the proposed method is empirically verified in comparison with popular forecasting techniques (original forms that do not consider SED or multiscales) and similar counterparts (that consider SED or multiscales) in the prediction for tourist volume to Hainan province of China, which is a global tourist attraction.
The remaining parts of this article are organized as follows: The second section formulates the proposed methodology; the third section conducts an empirical study and discusses the effectiveness of the novel method; the fourth section concludes the study and outlines the promising directions for future research.
Methodology
This section forms a novel BEMD-based methodology with tourism-related SED for tourist volume forecasting. “Model framework” subsection presents the general framework of the methodology; “SED process, Multiscale analysis, and Tourist volume prediction” subsections elaborate the three main steps; and “Empirical design” subsection designs the empirical study.
Model framework
A BEMD-based methodology with SED is formulated for tourist volume forecasting with the three main steps of the SED process, multiscale analysis, and tourist volume prediction, as the general framework illustrated in Figure 1.

General framework of BEMD-based method with SED for forecasting tourist volume. BEMD: bivariate empirical mode decomposition; SED: search engine data.
SED process
With the boom in big data era, a use of SED (e.g. Google and Baidu) as exogenous variables for tourist volume predictions is becoming popular (Li et al., 2017). This step is aimed at collecting sufficient SED regarding various tourism-related factors as the exogenous variables, that is, Sl,t, and processing them into a predictive index Zt with significant predictive power for the tourist volume Tt, where Tt indicates the tourist volume at time t; Sl,t is the lth series of SED; and Zt is the constructed SED-based index (SED-I). Therefore, four sub-steps are taken, in terms of SED collection, SED selection, index construction, and relationship investigation, respectively.
SED collection
Three processes are conducted to collect informative tourism-related SED. First, according to existing relevant research, eight popular tourism-related factors (i.e., tourism, destination, lodging, traffic, scenic spot, eating, shopping, and weather) are considered to generate the initial keywords and then obtain other related keywords using the Baidu search engine (https://index.baidu.com/) (Li et al., 2016). Second, each recommended keyword is used as the initial keyword to extend the keyword set until the creation of new keywords ceases (Yang et al., 2015). Finally, these keywords are employed to download the corresponding SED from the Baidu Index (Yang et al., 2015).
SED selection
Effective SED with a significant predictive power for tourist volume are selected from the collected massive SED, in terms of the Pearson correlation coefficientρx,y. Specifically,
Index construction
When the number of search query data is large, keeping all the exogenous variables in the model poses problems because of potential multicollinearity and overfitting issues in the model forecasting (Li et al., 2017). Therefore, this step attempts to remove the multicollinearity of the collected massive SED Sl,(l = 1,…, L) in terms of constructing the informative SED-I Z. Principal component analysis (PCA) is conducted to extract key information from the SED series in terms of uncorrelated indexes, that is, principal components Zm (m = 1,…, M; M ≤ L), which are arranged in the order of decreasing variance (Jolliffe and Cadima, 2016; Yao et al., 2017) as follows:
where
Relationship investigation
Two typical relationship analyses, that is, co-integration test and Granger causality test, are applied to test the predictive power of the constructed SED-I for tourist volume in terms of a significant co-integration and a Granger causality to tourist volume.
In the co-integration test, the long-run relationships between two variables (or multivariable) are estimated, and the residuals from the regressions are tested for stationarity (Khan et al., 2005; Kumar Narayan and Smyth, 2006). The popular co-integration analysis methods include the Engle–Granger test (Engle and Granger, 1987) and the Johansen test (Johansen, 1991). In particular, the Engle–Granger method is typically used to test the co-integration between two time series variables (Khan et al., 2005), while the Johansen method is suited for a multivariable time series (Croes and Vanegas, 2008; Dritsakis, 2004; Johansen, 1988, 1991, 1995). If the time series variables share a common stochastic trend, their first differences are stationary and consequently may be jointly co-integrated (Dritsakis, 2004; Khan et al., 2005). In this study, the Engle–Granger method is employed for estimating the stable long run or equilibrium linear relationship between the bivariate time series data (tourist volume and SED-I), in which the two parent series must be nonstationary I(1), and the linear combination should be stationary I(0) (Khan et al., 2005; Webber, 2001). In particular, if two time series variables (i.e., xt and yt) are co-integrated, the variables will be tested with stationary at the same order (i.e., 1 or greater), and the LR residual ut is stationary, as follows (Engle and Granger, 1987; Granger, 1969; Khan et al., 2005):
where a0 is the constant and a1 is the coefficient of LR. The stationarity of the residual ut, which corresponds to the co-integration, is measured based on a unit root test using augmented Dickey–Fuller test (Dickey and Fuller, 1979; Said and Dickey, 1984).
The Granger causality test assumes that the series xt does not strictly Granger cause yt if the following (Granger, 1969; Sun et al., 2019):
where
where ut and vt are the residuals of regression. A standard joint test (e.g. F- or χ2-test) is performed to verify if the coefficients
Multiscale analysis
Univariate EMD versions only work well if the variables involved are loosely related to each other (Rehman and Mandic, 2009) or they suffer complications, such as nonuniformity and nonalignment of timescales (Adarsh, 2016; Wei and Chen, 2012). Moreover, the EMD may produce oscillations with very disparate scales in one mode or with similar scales in different modes, which results in the “mode mixing” phenomenon (Colominas et al., 2014). To alleviate this phenomenon, the bivariate approach was proposed (Rilling et al., 2007), that is, BEMD, which treats the bivariate data as fast rotations superimposed on a slow rotation and captures the intrinsic mode functions (IMFs) jointly, synchronously, and coherently, guaranteeing that the timescales are aligned with each other (Rehman and Mandic, 2009).
In the case of bivariate data, an emerging multiscale analysis, that is, BEMD, is employed to model the multiscale relationship between tourist volume and the SED-I. BEMD is a bivariate version of original EMD, especially for addressing the high correlation and codependence in bivariate data. BEMD considers the bivariate data as fast rotations superimposed on a slower rotation and jointly extracts IMF with the timescales aligned with each other (Rilling et al., 2007). In detail, BEMD capture the multiscale relationship by extracting n pairs of the matched (or scale-aligned) common modes on similar timescales—IMFs (i.e.,
The envelope of bivariate data is a three-dimensional tube that encloses a signal in terms of signal projections in different directions. Given bivariate data xt, (t = 1,…, T), the scale-aligned IMFs and residue are extracted by the following eight steps (He et al., 2016; Huang et al., 1998): Compute the projection directions L = 2kπ/K (k = 1,…, K), where K is the number of projection directions. Consider the projections pLt of the bivariate inputs xt in direction L, that is, pLt = Re[e−jLxt]. Identify the local maxima of the projections pLt and record the corresponding time index tL. Generate the envelope eLt in the form of spline interpolation in direction L and obtain the envelops in all K directions in a similar way by following steps (2) to (4). Calculate the mean of all K envelopes as follows: Calculate the difference between the time series data x(t) and mean value m(t): d(t) = x(t) − m(t). Check whether d(t) satisfies the conditions of an IMF. If not, it substitutes d(t) with the original time series x(t). Repeat steps (2) to (6). If conditions of d(t) satisfy all the requirements of an IMF, calculate the residue r(t) = x(t) − d(t). The sifting process stops when the residue satisfies one of the following termination criteria (Wei and Chen, 2012). First, the residue or the IMF is smaller than the predetermined threshold or becomes a monotonic function such that no more IMF can be extracted. Second, the number of zero crossings and extrema is the same as that of the successive sifting step.
Tourist volume prediction
This step conducts individual prediction at each timescale using an effective forecasting technique and final prediction result in a linear combination form across all timescales. In particular, an effective statistical (e.g. LR, seasonal autoregressive integrated moving average (SARIMA)) or AI technique (e.g. SVR, ELM or random vector functional link (RVFL)) is introduced to model each pair of common modes and generate the corresponding individual prediction
LR
It is known that the LR is a basic prediction model, which may have the problem of spurious regression (He et al., 2017). However, as a popular econometric model, the LR technique is usually used as benchmark for prediction using SED, such as Li et al. (2016) and Peng et al. (2017). Therefore, this article employs the LR model as a benchmark model to predict tourist volume, which can be expressed as follows:
where xi, (i = 1,…, k) is the ith input data, βj, (j = 0,1,…, k) are the LR coefficients, y is the independent variable, and is the residual.
SARIMA
SARIMA is a popular econometric model for dealing the time series data. It have a common form of
where xt is the observed value at time of t (t =1,…, n), et is the error, (1 − B) d is mathematical operation of nonseasonal differencing, (1 − BS) D is mathematical operation of seasonal differencing, φp(B) is the autoregressive operator, Φ P (Bs) is the seasonal autoregressive operator, θq(B) is the moving average operator, and Θ Q (Bs) is the seasonal moving average operator.
SVR
SVR is a typical AI tool that maps the original data into a high-dimensional space and minimizes the generalization error via LR (Cortes and Vapnik, 1995). The training data can be described as
where f(xi) denotes the prediction result for yi,
where
ELM
ELM is an extended version of single hidden layer feedforward neural networks (SLFN), in which the main model parameters (i.e., weights and biases of the hidden nodes) are randomly generated without an iterative training. Thus, the ELM significantly enhances the learning speed and effectively obtains generalization performance, which is popular in an iteratively searching method (Tang et al., 2015).
For a typical SLFN with M hidden nodes and N samples (xi, yi) (i = 1,2,…, N) (where
where g(x) is the activation function,
In ELM, the weights wi and biases bi of the hidden nodes are randomly fixed without a training process (Huang et al., 2006) to generate the output matrix of the hidden layer H according to equation (14), while the output weights β are obtained by
RVFL
Similarly, RVFL can also randomly generate the weights and biases of hidden nodes without an iterative training process. In contrast to ELM, RVFL incorporates a direct link from the input nodes to the output nodes
Empirical design
To verify the effectiveness of the novel BEMD-based methodology with SED, the tourist volume to Hainan of China, which is a global tourist attraction, is considered to be the target, and popular and similar forecasting techniques are introduced as benchmarks. The following subsections present the study data, benchmarks, evaluation criteria, and model specification.
Data descriptions
The domestic tourist volume to Hainan of China, which is a global tourist attraction, is considered to be the study sample (Li et al., 2017). The monthly tourist volumes are obtained from the Wind (http://www.wind.com.cn/) for the period January 2011 to November 2017 (with 83 samples). Besides, the SED are used as exogenous variables for tourist volume prediction. According to the existing studies, 219 tourism-related SED series, considering eight popular tourism-related factors (i.e., tourism, destination, lodging, traffic, scenic location, eating, shopping, and weather), are collected from the Baidu search engine (https://index.baidu.com/) (Li et al., 2017). Furthermore, Pearson correlation analysis is conducted and 16 keywords with an absolute correlation coefficient above 0.8 are selected as powerful predictors. To keep the massive SED (i.e., exogenous variables) in the model, PCA is employed to construct an SED index for removing the potential multicollinearity and overfitting issues. This SED index reflects the common components of SED, representing approximately 86.54% of the total variances. A strong concordance between tourist volume and the SED index was present. To predict the future tourist volumes, the tourist volume and SED index with lagged history observations are selected. Then, all the time series data (i.e., monthly tourist volumes and SED index), as the independent variables, were divided into a training dataset (encompassing the first 90% of the sample period) and a testing dataset (encompassing the last 10% of the sample period) to construct the forecasting model. Multistep ahead predictions at the horizons of 1–4 month(s) are conducted.
Figure 2 illustrates the normalized series data of the tourist volume and the SED-I. A potential positive impact of the constructed SED-I on the tourist volume to Hainan of China is observed, with a Pearson coefficient of 0.88. It can be obviously seen that the time series data display a trend and a seasonality pattern. Therefore, the X-12 seasonal adjustment method (Madaleno et al., 2017) is employed in this study to further analyze the trend, seasonal, and irregular components of the time series, as shown in Figures S1 and S2 in the Online Supplemental Material. The comparison results confirmed that (1) the SED-I and the tourist volume exhibited similar general fluctuating trends; both experience a significant and fluctuating ascent from January 2011 to November 2017; (2) the two series had distinct seasonality characteristics, attaining the troughs in the low seasons for tourism (from January to April each year) and the peaks in the high seasons (from August to October); (3) a regular lead–lag relationship existed, that is, the troughs of SED-I usually preceded those of the tourist volume by approximately 1 month in the low seasons, while the peaks lagged by approximately 2 months in the high seasons due to a long increasing period of online searching behaviors (Yang et al., 2015; Zhang et al., 2017). These patterns, that is, similar general trends, distinct seasonality features, and regular lead–lag relationship, imply a strong potential predictive power of the SED for tourist volume.

Normalized series of tourist volume to Hainan of China (blue) and the SED-based index (red). SED: search engine data.
Furthermore, to verify the robustness of the proposed model, a comparison between the proposed methods and the X-12-based methods (without the seasonally component) has been conducted, as shown in Figure S3 in the Online Supplemental Material. The performance of X-12-based model (without the seasonality component) is essentially equal to that of the proposed model. This result indicated that the proposed BEMD-based model has already considered the various periodicity (including the seasonally component) as inputs (Yu et al., 2015). Therefore, repeated seasonal analysis might not be necessary.
Benchmarks
Using the proposed framework (refer to Figure 1), five popular forecasting techniques, that is, LR, SARIMA, SVR, ELM, and RVFL, are introduced to form five novel BEMD-based methods with lagged SED (denoted as the M4 methods), that is, SED-BEMD-LR, SED-BEMD-SARIMA, SED-BEMD-SVR, SED-BEMD-ELM, and SED-BEMD-RVFL. To verify the superiority of the proposed method, three types of benchmarks are designed for comparison: (1) original forms without SED and multiscale method, (2) extended variants with SED, and (3) extended variants based on EMD.
Evaluation criteria
In the field of prediction, various evaluation criteria have been used to assess the forecast accuracy, including the mean absolute percent error (MAPE), root mean square error (RMSE), mean square error (MSE), root mean square percentage (RMSP) error, and mean absolute deviation. (Hyndman and Koehler, 2006; Li et al., 2015; Wang et al., 2018; Witt and Witt, 1992). These evaluation criteria are essentially measuring the dispersion degree between predicted and actual values (Chu, 2009; Kim and Malek, 2018). In general, previous studies usually choose two criteria (Sun et al., 2019; Tang, 2018), three criteria (Xu et al., 2019; Zhang et al., 2017), or four criteria (Li et al., 2018b) to reflect the forecast accuracy. In this study, the MAPE and RMSE are utilized as the evaluation criteria, which is consistent with previous research on tourism demand forecasting (Li et al., 2017; Li and Rob, 2019; Shahrabi et al., 2013; Song et al., 2011; Sun et al., 2019):
where D is the size of the testing data set, and
From a statistical perspective, the Diebold–Mariano (DM) test is performed (Diebold and Mariano, 1995). Given the loss function of the MSE, the DM statistic can be defined as follows:
where
Model specification
In SVR, the Gaussian RBF kernel function is employed; the regularization and kernel parameters are determined via the grid search method (Tang et al., 2012). In the ELM and RVFL, the number of hidden nodes is similarly determined via the grid searching method; the input weights and hidden biases are randomly fixed based on the Gaussian distribution; the sigmoidal function
Empirical results
For illustration and verification, the novel BEMD-based learning paradigms with SED, that is, SED-BEMD-LR, SED-BEMD-SARIMA, SED-BEMD-SVR, SED-BEMD-ELM, and SED-BEMD-RVFL, are performed to predict the tourist volume to Hainan of China, and the data are compared with the corresponding single and similar counterparts (refer to Table 2). The results for the three major steps of the methodology, that is, SED process, multiscale analysis, and prediction, are presented and discussed in subsections “SED process, Multiscale analysis, and Prediction results.”
Learning paradigms of proposed methodology (M4) and benchmarking methods (M1–M3).
Note: SED: search engine data; LR: linear regression; BEMD: bivariate empirical mode decomposition; SARIMA: seasonal autoregressive integrated moving average; SVR: support vector regression; ELM: extreme learning machine; RVFL: random vector functional link; EMD: empirical mode decomposition. “√” denotes that SED is considered in the related learning paradigm, whereas “—” shows that SED or multiscales are not considered.
SED process
In the first step, that is, SED process, informative SED regarding various tourism-related factors are collected and then transformed into an effective predictive index. Four sub-steps are taken, that is, SED collection, SED selection, index construction, and relationship investigation.
To statistically measure the predictive power of the constructed SED-I for tourist volume (i.e., the fourth sub-step), two typical relationship analyses, that is, co-integration test and the Granger causality test, are applied. In a statistic test, statistically significant differences are determined using a certain level of probability (such as the “p-value”) that the researcher chooses, to ensure that one does correctly reject the null hypothesis. The generally accepted p-value = 0.1 (or 0.05 and 0.01) suggests there is a 90% (or 95% and 99%) probability to correctly reject the null hypothesis when there is a significant correlation between series (Page, 2014). In this study, Table 3 presents the related testing results and the p-value (at significance level below 1%), which may provide evidence supporting the relationship between tourist volume and SED-I. The co-integration test indicates that the SED-I series is co-integrated with the tourist volume in the long term, since the null hypothesis with no co-integration has been rejected at a significance level of 1%. The Granger causality similarly concludes that the SED-I series strictly Granger causes the tourist volume; the null hypothesis that SED-I does not Granger cause tourist volume has been rejected at a significance level of 1%. These findings imply that the informative SED-I, in a significant relationship with the tourist volume, can be introduced as a power predictor for tourism volume and improve existing forecasting techniques.
Co-integration and Granger causality test results.
Note: SED: search engine data-based index.
Multiscale analysis
In this step, the multiscale relationship of tourist volume to Hainan of China and the constructed SED-I is explored by the BEMD, in terms of extracting matched (or scale-aligned) common modes, which allows us to obtain IMFs with the different frequencies (or timescales) of each time series. All of the IMFs present changing frequencies and amplitudes, as the IMFs with lower frequencies (or timescales) present long-term oscillation and those with higher values reflect short-term trends (Wang et al., 2018). Figure 3 shows the extracted scale-aligned modes in an increasing order of timescale, and Table 4 presents the timescales based on a Fourier transform.

Scale-aligned modes of (a) tourist volume and (b) SED-I extracted by BEMD. SED-I: search engine data-based index; BEMD: bivariate empirical mode decomposition.
Timescales of the common modes extracted by BEMD (months).
Note: BEMD: bivariate empirical mode decomposition; SED-I: search engine data-based index; IMF: intrinsic mode functions.
As shown in Figure 3 and Table 4, each pair of the modes for the two series (i.e., tourist volume and SED-I) has quite similar timescales, indicating the better aligning effect of BEMD in generating matched IMFs, in terms of both having the same number of IMFs (i.e., two IMFs) and similar timescales of each pair (i.e., 8–12 months for IMF 1 and 24 months for IMF 2). Therefore, an important conclusion is that the BEMD dealing with the bivariate data of tourist volume and SED-I can effectively detect the common factors in the two time series, according the similar timescales in a pair but different timescales across pairs, which is basically consistent with the existing literature (Wang et al., 2018; Yu et al., 2015). This satisfactory multiscale analysis, therefore, may help model the relationship between the tourist volume and SED-I and improve the prediction precision for the tourist volume.
Prediction results
The final prediction is generated in a linear combination of the prediction results for tourist volume with related SED across timescales. To verify the effectiveness of the proposed method (i.e., M4 models in Table 2), a comprehensive comparison with three types of benchmarks without SED and/or EMD (M1-3 models) is conducted. Figures 4 and 5 show the comparison results of the criteria (i.e., MAPE and RMSE (bars)), respectively, with the corresponding uncertainty ranges (error bars), and Table 5 displays the results of the DM test.

Performance comparison of different methods in terms of MAPE. (a) Horizon 1; (b) horizon 2; (c) horizon 3; and (d) horizon 4. MAPE: mean absolute percent error.

Performance comparison of different methods in terms of RMSE. RMSE: root mean square error.
Results of DM test for different learning paradigms in terms of statistic (p-value).
Note: SED: search engine data; LR: linear regression; SARIMA: seasonal autoregressive integrated moving average; SVR: support vector regression; ELM: extreme learning machine; RVFL: random vector functional link. T = 83 denotes the length of the sample period.
Generally, it can be obviously obtained that the proposed BEMD-based method with SED (i.e., M4 models) is superior to the benchmarks for forecasting the tourist volume to Hainan of China at a confidence level of 90%. According to the error bars in Figures 4 and 5, the uncertainty range is consistently small and stable, which verifies the robustness of the proposed model. These results further demonstrate that the proposed BEMD-based methodology with SED can be employed as an effective forecasting technique for tourist volume, particularly in the era of big data.
Comparing the methods that consider SED (i.e., M2 and M4) and the corresponding benchmarks without SED (M1 and M3, respectively), the SED, as an exogenous variable, can enhance the prediction accuracy in the tourism forecasting. Regarding the MAPE and RMSE, the forecasting methods with SED, that is, the M4 models (green bars) and M2 models (pink), significantly exceed the corresponding benchmarks without SED, that is, M3 models (yellow) and M1 models (blue), respectively, in most cases. For example, the novel BEMD-based models, that is, SED-BEMD-LR, SED-BEMD-SARIMA, SED-BEMD-SVR, SED-BEMD-ELM, and SED-BEMD-RVFL, can reduce the MAPE (and RMSE) by approximately 0.9% (5.0%), 1.5% (−0.6%), 27.6% (25.4%), 7.0% (13.6%), and 6.2% (11.2%), respectively, on average by introducing SED into the M3 models. In terms of the single models with SED, the M2 models (pink) are obviously superior to the corresponding single models (i.e., the blue M1 model), with the MAPE (and RMSE) enhanced by approximately 33.7% (25.9%), 11.3% (3.7%), 12.4% (5.2%), and 11.4% (12.4%) of LR, SARIMA, ELM, and RVFL, respectively. The DM test statistically proves that the SED can improve prediction accuracy at a confidence level of 90% in most cases (refer to panel A in Table 5).
Some exceptions exist. For example, the SED-SVR method (pink) is inferior to the SVR model (blue) in prediction accuracy, which implies that the single model without multiscale analysis, particularly the SVR model, often experiences local optimization and parameter sensitivity (Tang et al., 2018) and hinders analysis of the complicated relationship between the tourist volume and the SED. Thus, multiscale analysis-based models become increasingly prevailing in the field of tourism forecasting (Chen et al., 2012). With multiscale analysis, the M4 methods with SED (green) perform poorer than their respective benchmarks without SED (M3 models) in the one-ahead-step prediction (Figures 4(a) and 5(a)) but significantly outperform at the prediction horizons above 1 month, which corresponds to the powerful predictor power of SED for tourist volume prediction in the long term.
The multiscale analysis-based methods (i.e., M4 and M3 models) significantly defeat the corresponding single benchmarks (i.e., M2 and M1 models) in all cases. As shown in Figures 4 and 5, the criteria (MAPE and RMSE) of the two multiscale methods (last two bars) are considerably lower than the two single models (first two bars). For example, the MAPE (and RMSE) of the novel BEMD-based models are approximately 39.5% (50.9%), 3.3% (−0.7%), 65.7% (64.4%), 66.4% (68.1%), and 64.8% (65.8%) lower than the corresponding benchmarks (i.e., M2 models) of all the forecasting techniques, respectively. The DM test confirms the superiority of the multiscale analysis-based methods over the single methods for the confidence level of 90% (refer to panel B in Table 5). The above results also confirm the effectiveness of multiscale analysis in terms of improved prediction performance by effectively investigating the multiscale relationship between tourist volume and the related SED. For individual techniques, none of the five considered techniques (i.e., LR, SARIMA, SVR, ELM, and RVFL) can uniformly outperform other techniques in all cases. They have been significantly improved with the BEMD-based analysis framework: the five novel M4 learning paradigms, that is, SED-BEMD-LR, SED-BEMD-SARIMA, SED-BEMD-SVR, SED-BEMD-ELM, and SED-BEMD-RVFL, are superior to the corresponding original forecasting techniques (i.e., LR, SARIMA, SVR, ELM, and RVFL), respectively, without an exception. The results further verify the robustness and universality of the BEMD-based methodology in the field of tourism forecasting.
Conclusions and discussion
This article proposes a novel BEMD-based methodology with SED for forecasting tourist volume, in which the informative SED regarding various tourism-related factors are employed and analyzed in terms of the multiscale relationship with tourist volume. The proposed methodology involves three main steps, that is, SED process, multiscale analysis, and tourist volume prediction. First, the SED about various tourism-related factors are found and transformed into an index with predictive power for tourist volume. Second, a promising multiscale analysis of bivariate data, that is, BEMD, is introduced to capture the complicated multiscale relationship between the tourist volume and SED, in terms of extracting matched common factors on similar timescales. Third, a popular prediction technique—either a statistical model or AI model—is employed to model the tourist volume based on SED on each extracted timescale, and the ensemble prediction is calculated using a linear combination of the individual predictions on different timescales. Relative to existing research, this study makes major contributions to the literature by (1) being the first attempt to introduce BEMD to capture the multiscale relationship between tourist volume and the related informative SED, by forming a novel BEMD-based method with SED; and (2) verifying the effectiveness of the proposed model in comparison with popular forecasting techniques (original forms that do not consider SED or multiscales) and similar counterparts (consider SED or multiscales).
Using the tourist volume to the Hainan province of China as a study sample, the empirical results statistically prove the effectiveness of the novel BEMD-based forecasting method considering SED based on the prediction accuracy and model robustness. First, the predictive power of the SED can be statistically verified by a set of predictive power tests and a comprehensive comparison between the methods with/without SED. Second, the effectiveness of multiscale analysis in improving prediction performance can be statistically confirmed by a thorough comparison between the novel models (that consider multiscale relationship between the tourism volume and SED) with the counterparts (that do not consider the multiscale relationship). Therefore, considering both informative SED and effective multiscale analysis, the proposed method significantly outperforms both original variants (without SED and multiscales) and similar variants (consider SED or multiscales) in prediction accuracy.
To verify the suitability of the proposed method, the MEMD-based method is conducted, in terms of the MAPE and RMSE. In comparison, the MAPE (or RMSE) of the BEMD-based is lower 21% (4%), 98% (114%), 32% (49%), and 5% (7%) of SARIMA, SVR, ELM, and RVFL than their respective MEMD-based methods with most cases excepting the LR technique. Detailed comparison results are presented in Figure S5 in the Online Supplemental Material. Therefore, in this case, the comparison results also verify that the proposed method (i.e., BEMD-based method) is more suitable for tourism demand prediction with SED.
To verify the robustness of the proposed method, this study also considers the 80% of the sample period as training data set. Taking LR technique as the individual predictor, the four-step-ahead comparison results is reported in Figure S4 in Online Supplemental Material. The MAPE of the learning paradigms, that is, M1 through M4, are 0.12, 0.11, 0.03 and 0.03, respectively. In particular, the empirical study reveals similar results with 90% of sample period. That is, the methods with SED (i.e., M2 and M4) are better than their respective benchmarks without SED (i.e., M1 and M3), and the multiscale methods (i.e., M3 and M4) also outperform their respective benchmarks (i.e., M1 and M2). Therefore, the different training data sets might not affect the prediction results.
To statistically ascertain the effectiveness of the proposed methods, the modified DM test is adopted (Clark and McCracken, 2001; Harvey et al., 1997). The result of the modified DM test is essentially consistent with the results of the DM test, as shown in Table S1 in the Online Supplemental Material. The modified DM test also statistically confirms the improvements made by introducing SED at a confidence level of 90% in most cases (refer to panel A in Table S1) and the superiority of the multiscale analysis-based methods over the single methods for the confidence level of 90% (refer to panel B in Table S1). These results repeatedly support the effectiveness of the proposed BEMD-based method with SED in improving prediction performance.
The proposed methodology can be further improved from the following perspectives. In addition to the tourist volume, the proposed method can be applied to other forecasting areas, such as financial markets. In terms of the forecasting techniques, more powerful algorithms (not limited to the five popular methods in this study) can be employed. SED collection and process may be the key factor in the proposed methodology, which deserves a careful technical innovation for further enhancing the predictive power of SED. Besides, given the multivariate nature of the comparison, other multivariate tests (such as the SPA test) could be applied to further verify the validity of the proposed methods. We will investigate these interesting issues in the near future.
Supplemental Material
Supplemental Material, editing_certificate - A novel BEMD-based method for forecasting tourist volume with search engine data
Supplemental Material, editing_certificate for A novel BEMD-based method for forecasting tourist volume with search engine data by Ling Tang, Chengyuan Zhang, Tingfei Li and Ling Li in Tourism Economics
Supplemental Material
supplementary-r1 - A novel BEMD-based method for forecasting tourist volume with search engine data
supplementary-r1 for A novel BEMD-based method for forecasting tourist volume with search engine data by Ling Tang, Chengyuan Zhang, Tingfei Li and Ling Li in Tourism Economics
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by grants from the National Natural Science Foundation of China (NSFC Nos. 71622011, 71988101 and 71971007), the National Program on Key Research Project of China (Grant No. 2016YFF0204405), the construction project of Collaborative Innovation Center of e-Tourism of Beijing (822139917160102), and the National Program for Support of Top Notch Young Professionals.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
