Abstract
Numerous methodologies have been offered to forecast tourism demand; however, accurate forecasting has been a major challenge for policymakers despite its critical importance for tourism planning. Therefore, we propose and test a novel forecasting methodology that combines principal component analysis (PCA) and long short-term memory (LSTM) network, along with the Baidu index, to forecast daily tourist arrivals for a popular tourist attraction in China. Word2Vec, a software tool launched by Google, is used to improve the coverage and accuracy of search keywords in the construction of the Baidu indexes. Before training the LSTM network, PCA is used to reduce noise and optimize the data. Considering the study’s timeframe, the impact of COVID-19 pandemic has also been assessed. The efficacy of the proposed forecasting methodology is verified, and the results show that the PCA-LSTM model outperforms other models in terms of prediction accuracy and stability. Theoretical and practical implications are discussed.
Keywords
Introduction
The tourism industry is one of the major drivers of the development of transportation, accommodation, catering, entertainment, retail, and many other industries and hence it can make a significant contribution to economic growth (Dogru and Bulut, 2018; Pérez-Rodríguez et al., 2022). Therefore, research on tourism demand forecasting has attracted considerable attention over the past five decades (Song and Hyndman, 2011). Certainly, initial studies in the context of tourism literature have primarily focused on issues related to modelling and forecasting tourism demand, owing to its paramount significance in formulating strategic plans for tourism development. Throughout the years, many forecasting methodologies and models have been proposed attempting to provide an accurate forecasting methodology (Song et al., 2019). These forecasting methodologies can be summarized into four main categories: time series models, econometric models, artificial intelligence technology, and qualitative methods (Dogru et al., 2023; Jiao and Chen, 2019; Song et al., 2019). Although the extant forecasting methodologies in the literature have made significant advancements in tourism demand forecasting (Li and Jiao, 2020), accurately forecasting tourism demand still poses a major challenge for policymakers. That is, major issues remain to be solved to improve the accuracy of tourism demand forecasting and advance the extant literature.
First, research on tourism demand forecasting mainly focuses on large areas, such as countries, regions, and rarely provinces (Calero and Turner, 2020; Dogru et al., 2021). There is a lack of research on testing the accuracy of tourism demand forecasting methodologies in smaller geographic areas; in particular, specific tourist attractions. Second, most extant studies attempt to estimate tourism demand at monthly, quarterly, or annual intervals while neglecting the significance of forecasting lower frequency patterns such as daily tourist arrivals. The forecasting of high-frequency tourist arrival data is more preferable than that of low-frequency tourist demand, primarily due to the challenges associated with data collection and the more advanced technical requirements (Bi et al., 2020; Li and Jiao, 2020). Third, although the utilization of Internet search index for tourism demand forecasting is feasible, there remains a lack of well-established methodologies to find and determine the necessary keywords for an index (Li et al. 2021). Fourth, while deep machine learning has gained attention due to its predictive effectiveness, further investigation is needed to explore the influencing factors associated with tourist arrivals and reduce data dimensionality without compromising calculation efficiency and accuracy (Li et al., 2018). Furthermore, the tourism industry has been significantly impacted by the COVID-19 pandemic. Therefore, it is imperative to incorporate the implications of this global health crisis into tourism demand forecasting methodologies (Kocak et al., 2023).
Accordingly, the purpose of this study is to propose and test a novel forecasting methodology that combines principal component analysis (PCA) and long short-term memory (LSTM) network, along with the Baidu index, to predict daily tourist arrivals. More specifically, we use daily data to predict the number of daily tourist arrivals in Siguniang Mountain Scenic Area, a popular tourist destination in China, utilizing the combined PCA and LSTM network methodology with aid from the daily Baidu index. The Baidu index reflects the impact of various factors on tourist arrivals through search queries related to the destination, thereby facilitating lag structure identification and dimensionality reduction required for demand forecasting. Further, Word2Vec, a software tool launched by Google, is used to search keywords in the construction of the Baidu indexes. Considering the study’s timeframe, the impact of COVID-19 pandemic has also been assessed.
In so doing, this study makes several significant contributions to the extant tourism forecasting literature. First, we propose an innovative and novel methodological approach to improve the effectiveness and accuracy of tourism demand forecasting. Second, the examination of the methodological procedure fills an important gap in the extant literature by focusing on a specific tourist attraction and using daily frequency data to predict the number of daily tourist arrivals. Third, different from previous studies, we utilize Word2Vec to train big data to help find more possible Internet search keywords to increase the efficacy and accuracy of the model. Fourth, by assessing the impact of the COVID-19 pandemic on tourist arrivals, we shed light on the implications of the recent pandemic on tourism demand in general and tourism demand forecasting in particular. This examination also allows us to further evaluate how compatible our proposed forecasting methodology is with the COVID-19 pandemic situation. This study also contributes to the extant literature by presenting comprehensive evidence on the efficacy of the proposed novel forecasting methodology by comparing the findings with the model confidence set (MCS), out-of-sample
Literature review
Tourism demand forecasting
The research on tourism demand forecasting has primarily focused on the development of forecasting models and the evaluation of their performance. The line of research has thereby contributed to the development of tourism demand forecasting theories and methodologies significantly (Song et al., 2019). Furthermore, the extant studies in the field have collectively emphasized that the accuracy of forecasting models is crucial for both theory advancement and policy development in tourism demand forecasting. However, there is abundant evidence indicating that no single forecasting methodology can consistently outperform others and accurately predict estimates in all situations (Peng et al., 2021).
While the time series based econometrical forecasting methodologies have been widely utilized in the extant literature, qualitative methods, such as the Delphi techniques and scenario-building methods have also been applied in the field. However, utilizing qualitative methods alone may be less than ideal because of the difficulty in confirming their accuracy (Song et al., 2019). In a recent study, Song et al. (2019) conducted a comprehensive retrospective analysis of research on tourism demand forecasting that highlights the usage of time series models, qualitative models, artificial intelligence techniques, and other tourism demand forecasting methodologies employed in the extant literature. The findings from this review further reinforce the notion that although various forecasting methodologies exist in the extant literature that attempt to accurately forecast tourism demand, these models consistently fail to generate accurate forecasts across varying conditions. Therefore, it is necessary to further develop new methods to advance the extant literature and provide more effective and accurate forecasting methods.
Tourism demand forecasting with deep learning/artificial intelligence techniques
Artificial intelligence techniques have been increasingly applied to tourism demand forecasting models in parallel with the rapid development of computer technology (Bi et al., 2020). The primary attraction of artificial intelligence-based forecasting methodologies is that artificial intelligence techniques can capture complex time-varying patterns in a large number of time series data sets. Moreover, the technology automatically learns the complex evolution dynamics of tourism demand patterns in the data to improve forecasting performance and accuracy (Bi et al., 2020).
From a hierarchical structure point of view, artificial intelligence technology can be categorized into shallow learning methods and deep learning methods. Shallow learning methods typically have fewer hidden layers and limited generalization capabilities for complex classification problems. Common shallow learning techniques employed in the field of tourism demand forecasting include support vector regression (SVR), back propagation neural networks (BPNNs), the extreme learning machine (ELM), fuzzy time series (FTS), the rough sets approach (RSA), and grey theory (GT) models (Bi et al., 2020; Song et al., 2019). Among these shallow learning methods, SVR has been the most widely employed for forecasting stock prices, power loads, and traffic flow. Furthermore, SVR has demonstrated superior performance compared to ARIMA, ES, and BPNN models in tourism demand forecasting tasks (Chen et al., 2015). In a recent study, Mishra et al. (2021) used SVR to forecast international tourist arrivals in the world and their results showed that the prediction accuracy of the SVR model was significantly better than traditional time series models.
The SVR model, although proven to be more efficient than many time series models, does have certain limitations when compared to deep learning methods. Deep learning models offer a greater number of hidden layers and exhibit superior capability in capturing feature information and correlations within complex datasets as opposed to shallow learning methods (Bi et al., 2020; He et al., 2021). Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are the prevailing deep learning models extensively employed in the extant literature. These models have been successfully employed in machine translation (Esan et al., 2020), and tourism route recommendations (He, 2022), among other related fields.
Although deep learning methods yield more accurate results compared to shallow learning methods and time series models, further advancements have been made in the context of deep learning techniques to enhance their efficacy and accuracy. Specifically, the Gate recurrent unit (GRU) and the LSTM network are two improved algorithms of the recurrent neural networks (RNN), which can effectively solve the problems of gradient disappearance and gradient explosion that may occur in the basic RNN (Han et al., 2021). Hsieh (2021) conducted an empirical investigation on the accuracy of LSTM, BI-LSTM, and GRU deep learning models for tourism demand forecasting. The results showed that the LSTM model had the optimal root mean square error (RMSE) value.
Some studies have also examined the improvements in the GRU and LSTM models or the development of composite models. For example, looking at a tourism flow prediction for China’s famous Huangshan Scenic area, Lu et al. (2020) presented evidence showing that the improved attention-GRU (IA-GRU) model yielded a better forecasting ability. More recently, using the monthly search engine strength data and related influencing factors, Yu and Chen (2022) proposed the SAE-LSTM prediction model with random weight initialization and showed that the proposed model generates improved forecasting results.
Tourism demand forecasting with hybrid models
Although deep learning has been successfully applied in tourism demand forecasting, there is still a scarcity of relevant studies, and the application of deep learning methods to enhance the accuracy of tourism demand forecasting remains a challenging field (Li and Jiao, 2020; Song et al., 2019). Specifically, deep learning models need to achieve better fitting effects, reduced prediction time, and improved stability for enhancing the secure application of prediction results (Fan et al. 2021). Combined and hybrid forecasting models can effectively enhance forecast accuracy compared to single forecasting models (Song et al., 2019). In this context, Internet search indices have consistently demonstrated their ability to improve the accuracy of tourism demand forecasting (Andariesta and Wasesa, 2022; Li et al., 2021). Internet search engines enable the development of indices based on extensive data that reflect user interests as well as media focus during specific time periods. Such data also encompass various factors influencing tourist arrivals (Bi et al., 2020; Li et al., 2018).
Pan et al. (2012) collected data from Google Trends and formed an Internet search index to predict hotel room usage and tourist arrivals. Their results showed that the utilization of the Internet search index has improved the tourism demand forecasting accuracy significantly. Yang et al. (2015) further showed that integrating search engine data into tourism demand forecasting model reduces prediction errors and improves the accuracy of tourism demand forecasting. Conducting empirical research on Beijing and Hainan provinces, Li et al. (2018) also showed that a hybrid tourism demand forecasting model, specifically the PCA-ADE-BPNN model along with the Baidu index was superior to other models in terms of prediction accuracy for tourist flows. Moreover, Sun et al. (2019) reported that a knowledge-enhanced language model (KELM) built by combining the Baidu index and the Google index could significantly improve the prediction accuracy and robustness of the tourism demand forecasting in Beijing. More recent studies have also provided evidence showing that hybrid tourism forecasting methods yield more accurate and robust results under varying conditions and hence suggest the utilization of hybrid models in tourism demand forecasting (Bi et al., 2020).
After conducting an extensive review of the existing literature in this study, it becomes apparent that a singular forecasting methodology capable of consistently outperforming others and accurately predicting estimates in all situations has yet to be developed. However, it is also evident that hybrid tourism demand forecasting models can effectively enhance forecast accuracy compared to single forecasting models, particularly when incorporating an Internet search index.
Although some hybrid models for forecasting tourism demand have been proposed and empirically tested, incorporating an Internet search index, further investigation is needed to explore the influencing factors related to tourist arrivals and reduce data dimensionality while considering efficiency and accuracy calculations (Li et al. 2018). Additionally, the establishment of a method to find and determine necessary keywords for an index remains insufficient in utilizing the Internet search index for tourism demand forecasting (Li et al., 2021). Therefore, it is crucial to carefully consider selected search engines, keyword coverage and accuracy, as well as strategies to minimize noise and irrelevant information when employing the Internet search index for tourism demand forecasting. These efforts will significantly impact the performance and accuracy of such forecasts (Li et al., 2018, 2020a, 2020b, 2020c, 2021).
Also, the majority of the extant studies attempt to estimate tourism demand for long-term (i.e., monthly, quarterly, and annually), while overlooking the significance of forecasting lower frequency tourism demand, such as daily tourist arrivals. However, accurate predictions of daily arrivals can directly or indirectly contribute to the growth and development of not only the tourism industry but also other related sectors. Moreover, considering the lasting impact of the COVID-19 pandemic on the tourism industry, it becomes imperative for tourism research to incorporate this influence into demand forecasting models (Kocak et al., 2023).
Methodology
Sample and data
To evaluate the effectiveness and practicality of our proposed prediction method, we examined tourist arrivals at the famous Siguniang Mountain Scenic Area in China. The Siguniang Mountain Scenic Area is a World Natural Heritage site, national scenic area, geological park, nature reserve, and Sichuan giant panda sanctuary that has been dubbed the “Eastern Alps” (https://www.sgns.cn/). We utilized daily data on tourist arrivals from September 25th, 2015 to April 21st, 2022 obtained from the official Web site of Siguniang Mountain Scenic Area (https://www.sgns.cn/). After removing zero values during the closure of the scenic area during the early stages of the COVID-19 pandemic, the time series of daily tourist arrivals in the Siguniang Mountain Scenic Area was constructed with 2338 observations.
The main characteristics of the data trend of daily tourist arrivals in the Siguniang Mountain Scenic Area are shown in Figure 1, and the descriptive statistics are shown in Table 1. Detailed data can be obtained from the official Web site of the Siguniang Mountain Scenic Area or are available on request from the authors. Trend of daily siguniang mountain scenic area arrivals. Descriptive statistics of Siguniang Mountain Scenic Area daily arrivals data. Note: ADF test adopts the form of trendless term; *** indicates that the null hypothesis is rejected at the significance level of 1%.
Word2vec
Word2Vec (word to vector) is a software tool launched by Google in 2013 that trains word vectors. Specifically, in a big data corpus environment, Word2Vec can quickly and effectively output a set of word vectors corresponding to words that can be input into the deep network by training the text corpus (Church, 2017; Di Gennaro et al., 2021). The Word2Vec model is a shallow neural network, where the primary computations are performed by Continuous Bag of Words (CBOW) and Skip-gram algorithms. (Adewumi et al., 2022). In our study, we adopt the Skip-gram because of its higher efficiency over the Continuous Bag of Words. The internal structure of the Skip-gram is shown in Figure 2. Internal structure of Skip-gram.
Through the given known central word
Principal component analysis
The Principal Component Analysis (PCA) is a widely used data analysis method that aims to reduce the dimensionality of data by obtaining mutually uncorrelated principal components. Additionally, PCA enables the retention of essential information from the original dataset while effectively reducing noise and irrelevant information (Li et al., 2018).
Long short-term memory
The Long short-term memory (LSTM) is an improved version of recurrent neural networks (RNN) that can learn information timing better and partially solve for the gradient disappearance and short-term memory disappearance. Compared with the traditional RNN, the LSTM network adds a state of memorized information in a hidden layer to save and control this information for a long time, achieved through three “gates”: the forget gate, the input gate, and the output gate. The forget gate determines which information is discarded from the cell state. While the input gate determines which information needs to be saved in the cell state, the output gate determines which information in the cell state is produced as output (Giang et al., 2022; Jonides et al., 2008). The internal structure of the LSTM network is presented in Figure 3. Internal structure of LSTM network.
The main calculation process for the relationships among the gates and their status update is defined as follows.
Hybrid model of the PCA-LSTM and the proposed forecasting model
In this study, we merged the PCA and the LSTM network into a combined PCA-LSTM model to improve the efficacy of tourism forecasting methods, as hybrid forecasting models yield more accurate results than single models. The PCA-LSTM flowchart is shown in Figure 4. Flowchart of PCA-LSTM.
First, the PCA is used to assess the Baidu indices to identify irrelevant principal components that retain the featured information of the original data set. Then, the obtained principal components are input into the LSTM network as input variables for training to predict the number of tourist arrivals at the tourist destination.
Accordingly, we propose using a PCA-LSTM network, with Baidu index, to model and forecast daily tourist arrivals at the Siguniang Mountain Scenic Area in China. The proposed model utilizes Word2Vec, which can improve the coverage and accuracy of Internet search keyword selection.
Also, while the Internet search index can improve the accuracy of tourism demand forecasting, deep learning technology can help better predict tourist arrivals. Figure 5 presents the conceptual framework of the proposed tourism demand forecasting method. The proposed hybrid tourism demand forecasting model aims to provide tourism practitioners and policymakers a forecasting tool that accurately predicts tourist arrivals using daily data. Forecasting framework using PCA-LSTM with Baidu index.
Results
This section introduces the main steps and results from the analysis of this study. First, we assess the impact of COVID-19 pandemic on the daily tourist arrivals. Second, the Word2Vec toolkit was used to train and determine the relevant search keywords. Third, the corresponding Baidu index data were obtained through the search keywords, and the Baidu index daily time series data was constructed. Then, different prediction models were designed according to daily tourist arrivals and the daily Baidu indices, and the prediction performance of each model was evaluated with alternative tourism demand forecasting models to test the efficacy of the proposed model.
The impact of the COVID-19 pandemic
We assess the impact of COVID-19 pandemic on the daily tourist arrivals in the Siguniang Mountain Scenic Area. Based on this assessment, we further consider variable selection and whether it is necessary to divide the sample into the pre-COVID-19 and the COVID-19 period into two scenarios for forecasting.
We note in the previous Figure 1 that the graphical structure of the data in terms of the number of daily tourist arrivals in the Siguniang Mountain Scenic Area before and during the COVID-19 pandemic is very similar. Hence, we separated out the data during the COVID-19 pandemic, from March 31, 2020, to April 15, 2022, into one time series, and cut-off the data before the COVID-19 pandemic, from March 30, 2016, to April 14, 2018, with the same length into another time series. The Pearson correlation analysis was conducted on the two-time series; the correlation coefficient was 0.6123, and the corresponding p-value was Comparison of similarity between before and during the COVID-19 pandemic.
The results depicted in Figure 1 demonstrate that the data characteristics of the specified two time periods exhibit similarity (Atoum, 2019; Feng et al., 2018). Despite the COVID-19 pandemic having an impact on daily tourist arrivals at the Siguniang Mountain Scenic Area, this impact appears to be relatively minimal. The limited influence of the COVID-19 pandemic on the Siguniang Mountain Scenic Area can be attributed to effective efforts made by both Chinese and local governments in preventing and controlling its spread, as well as a growing understanding and mature response towards treating COVID-19 as a commonplace situation.
Based on these findings, in the subsequent prediction for the Siguniang Mountain Scenic Area, we introduced relevant variables to represent the impact of the COVID-19 pandemic. We considered the variables representing this impact as unpredictable and not significantly distinct from other variables, which would be automatically captured and learned by our proposed model. Additionally, we directly divided the data into in-sample subsets and out-of-sample subsets to simplify problem analysis and reduce computational workload. Importantly, this structure enables us to assess whether our proposed forecasting framework is compatible with large-scale public emergencies such as the COVID-19 pandemic.
Internet search keywords
We chose Chinese Wikipedia as the body and trained the big data using the Word2Vec toolkit to find the Chinese search keywords. First, based on browsing a large number of tourism experience records and judging from experience, “tourism” was selected as the core seed word. Additionally, “scenic area”, “tickets”, “strategies”, “specialties”, and “accommodation” were selected as auxiliary seed words. According to the eight versions, “core seed word”, “core seed word +1 auxiliary seed word”, and “core seed word +5 auxiliary seed words” and with the aid of the Word2Vec toolkit, 8000 initially related words were identified by calculating the cosine similarity training.
Second, the initial correlative words “Siguniang Mountain” and “Sichuan” were combined and “Siguniang Mountain” was kept. According to the tourism characteristics of the Siguniang Mountain Scenic Area on the official Web site of Siguniang Mountain, through manual screening of 138 alternative search-related words, we checked their Baidu index inclusion to determine 42 terms as search keywords.
Third, in the process of checking inclusion in the Baidu index, the official Web site of the Baidu index automatically recommended 14 strongly related words; most related to Chengdu, the capital city of Sichuan Province, as the city is the access point for travel to the Sinugang Mountain Scenic Area. Thus, these suggested 14 strongly related words were added as search keywords.
Big data Word2Vec training keywords and finally selected keywords.
Note: ○ indicates that recognition and screening by Word2Vev training; □ indicates that they are recommended by Baidu index or introduced to represent the impact of COVID-19 pandemic; ★ indicates that they are the finally selected keywords.
Baidu indices
We selected the Baidu index as the Internet search index time series. Although worldwide, Google is the most popular search engine with the highest adoption rate, in China, Baidu accounts for more than 70% of the market and is the first choice for Chinese netizens as a search engine. Furthermore, Yang et al. (2015) provided evidence showing that Baidu’s search query data performed better than that of Google in the Chinese market. Therefore, we utilized Baidu to construct the Internet search index, as it is more suitable for the collection of relevant search and query data on China’s Siguniang Mountain scenic area. Through keywords, we obtained daily data from the official Baidu index Web site (index.baidu.com) from September 25, 2015, to April 21, 2022, and constructed Baidu index daily time series data for the Siguniang Mountain Scenic Area.
Lag order and HEGY test results.
Note
Lag order and ADF test results.
Note
Forecasting using PCA-LSTM with baidu index
In our proposed model, we estimated the daily tourist arrivals at the Siguniang Mountain Scenic Area using the combined PCA-LSTM network approach along with the Baidu indices and evaluated its prediction performance. For 2338 valid statistical values in the overall time series of daily tourist arrivals in the Siguniang Mountain Scenic Area, we selected 500 in the tail of the data to establish an out-of-sample subset for model testing, and the remaining 1838 in the front to establish an in-sample subset for model training. Before our calculation, Z-score normalization was used to standardize the data.
To test the efficacy of the forecasting performance of the proposed PCA-LSTM tourism demand forecasting model, which combines the characteristics of multi-variable time series forecasting and the research experience of scholars in tourism demand forecasting, we selected some other models proven to have strong forecasting performance to carry out a forecasting competition. The MCS, out-of-sample
For the parameters of the PCA-LSTM and the GRU, LSTM, and PCA-GRU models, the number of random seeds was set to 100, the number of neurons was set to 100, epochs were set to 200, and batch size was set to 80. Through trial-and-error testing, the parameters of the VAR, SVR, and LASSO models were adjusted to minimize the sample prediction error. Figure 7 shows the comparison of the forecast results of each model based on the 500-day out-of-sample forecast. Comparison of prediction results of various models.
Results of the MCS test.
Note
Results of
Note
Robustness analysis
Although the analysis based on the proposed hybrid tourism demand forecasting model using a PCA-LSTM network, with Baidu index, to model and forecast daily tourist arrivals Siguniang Mountain Scenic Area in China provided substantial evidence that the proposed hybrid tourism demand forecasting model is superior to that of alternative models, we conducted further analysis testing the robustness of the model in alternative out-of-sample forecasting lengths and alternative destinations.
The choice of benchmark may impact the model’s performance. Keeping all other conditions constant, we conducted several tests using alternative benchmarks, such as ARIMA or Bayes to examine whether the predictive performance of PCA-LSTM remains superior. Then, we applied the same methodology outlined in the analysis to estimate tourism demand forecasting and evaluate the model’s performance.
The selection of out-of-sample prediction length may influence both prediction and evaluation results of a model. Hence, we tested different out-of-sample forecasting lengths to assess their impact. Specifically, we examined whether the predictive performance of PCA-LSTM remained superior at varying levels by alternating the out-of-sample prediction length to 600 and 700 for tourist arrivals in Siguniang Mountain Scenic Area.
Furthermore, choosing a different tourist attraction could affect both prediction and evaluation results of our model. Therefore, we tested our proposed tourism demand forecasting model in an alternative tourist destination – Jiuzhai Valley – which is another famous tourist attraction in China apart from our main study destination.
The results were robust to alternative benchmark, out-of-sample predictions, and also destinations. The findings from the robustness tests demonstrated that the PCA-LSTM network model exhibited superior predictive ability and was characterized as the most stable and effective methodology for forecasting tourism demand. The results from these alternative tests are not presented in the manuscript due to space limitations; however, they are available from the authors upon request.
Discussion and conclusion
In this study, we proposed and empirically tested the efficacy of a novel tourism demand forecasting methodology that combines principal component analysis (PCA) and long short-term memory (LSTM) network, along with the Baidu index, to forecast daily tourist arrivals for Siguniang Mountain Scenic Area in China. The results demonstrated that the PCA-LSTM network model accurately predicted the number of daily tourist arrivals and significantly improved out-of-sample prediction accuracy. Specifically, the out-of-sample prediction performance of the PCA-LSTM network model consistently and significantly surpassed that of the benchmark VAR model; thus, indicating an enhancement in out-of-sample prediction accuracy compared to commonly utilized tourism demand forecasting methodologies. In fact, among all evaluated models in this competition, the PCA-LSTM network model exhibited superior predictive capabilities. Furthermore, we conducted additional tests by altering benchmarks, out-of-sample predictions, and tourist attractions to assess the stability, robustness, and validity of our novel forecasting methodology - PCA-LSTM network model. The results revealed that regardless of different benchmarks used or varying lengths of out-of-sample predictions or even different tourist attractions considered; when compared with other models employed in these scenarios, the PCA-LSTM network model consistently displayed superior predictive abilities while maintaining stability as well as effectiveness as a tourism demand forecasting methodology.
The results further showed the capability of the presented model to effectively incorporate a significant catastrophic event, such as the recent COVID-19 pandemic. In this study, we analyzed the impact of the COVID-19 pandemic on tourism demand forecasting for the Siguniang Mountain Scenic Area. Based on our findings, we introduced relevant variables to represent the impact of COVID-19 pandemic, and the impact was automatically captured and learned by the proposed model. The results demonstrated that the PCA-LSTM network model effectively accommodates and adapts to dynamic scenarios like the COVID-19 pandemic.
Overall, the proposed novel tourism demand forecasting methodology, PCA-LSTM network, demonstrates superior prediction performance compared to alternative models tested in this study. Specifically, the hybrid model that combines principal component analysis (PCA) and long short-term memory (LSTM) network along with the Baidu index, namely the PCA-LSTM network model, exhibits enhanced stability and efficiency when compared with alternative methodologies.
Theoretical and practical implications
From a theoretical perspective, the current study makes a significant contribution to the extant literature on tourism demand forecasting. Specifically, the findings suggest that incorporating various influencing factors and employing dimension reduction techniques enhances the effectiveness of tourism demand forecasting models. In this regard, the proposed PCA-LSTM network model provides robust evidence that it outperforms other commonly used models in terms of precision, stability, and effectiveness. Moreover, due to the complexity and volume of daily tourist arrival data, combining machine learning models (such as SVR, LASSO, GRU, LSTM, PCA-GRU) yields higher prediction accuracy compared to traditional econometric models (e.g., VAR). However, among all tested models, our novel methodology - the PCA-LSTM network model - demonstrates superior efficacy. Notably, leveraging Word2Vec proves instrumental in identifying numerous relevant search keywords for Siguniang Mountain Scenic Area and contributes to the exceptional performance of our proposed model. These results highlight Word2Vec’s potential as an efficient tool when combined with machine learning-based approaches for enhancing coverage and accuracy in tourism demand forecasting. Additionally, the examination of the impact of the COVID-19 pandemic on tourist arrivals provides valuable insights into its implications for overall tourism demand and specifically for tourism demand forecasting (Kocak et al., 2023). Consequently, these findings can serve as a pivotal reference point for future studies considering the significance of COVID-19 in modeling and predicting tourism demand. Overall, the proposed innovative forecasting framework offers a significant point of reference for future research analyzing such implications.
The findings from this study also offer several practical implications for destinations and policymakers. First, estimating a higher frequency interval, such as monthly, quarterly, or annually, can be a more manageable task; however, forecasting tourism demand in higher frequency intervals might have limited policy implications. Although daily tourist arrivals pose greater challenges, such estimations may yield better implications for destinations and policymakers. The novel PCA-LSTM network model enables the accurate prediction of daily tourist arrivals with enhanced efficacy and predictive capacity. Therefore, the proposed PCA-LSTM network model can serve as a valuable tourism demand forecasting tool for tourism practitioners and policymakers utilizing daily data. Second, in the era of big data, Word2Vec can be employed as a method to enhance keyword selection coverage and accuracy while aiding in constructing Internet search indices that further improve the effectiveness of tourism demand forecasting. Third, integrating traditional econometric approaches with deep learning/artificial intelligence techniques into hybrid tourism demand forecasting models enhances their efficiency. Consequently, this approach better assists tourism practitioners in optimizing operations and formulating strategic pricing and marketing decisions while facilitating specific personnel arrangements and emergency planning efforts. Moreover, the proposed model serves as an evidence-based decision-making tool for government authorities and destination management organizations to strengthen tourist attraction construction and maintenance along with supporting facilities improvement initiatives while enhancing traffic infrastructure development and safety management measures.
Limitations and recommendations for future research
Although this study has made significant contributions to the extant literature, it is not without limitations. The keyword screening used in constructing the Internet index was limited to sample selection and search engines, which may have been influenced by the utilization of Baidu indices. As user demand is affected by various factors such as policy and environment, the efficacy of other search engines like Google should be examined in future studies for comparison with Baidu’s index. Additionally, alternative Internet search indices might capture different influencing factors that need further examination regarding their relationship with tourist arrivals. While the PCA-LSTM network model can solve complex problems encountered in traffic flow predictions, hotel accommodation predictions, stock trend predictions, exchange rate predictions and crude oil price predictions; its efficacy needs investigation in different settings through future studies. Furthermore, although robust results were obtained when tested on alternative tourist destinations (i.e., Jiuzhai Valley), further generalizability of findings requires investigation.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was partly supported by Sichuan Social Science Planning Project of China (SC22B030).
