Abstract
Because of high fluctuations of tourism demand, accurate predictions of tourist arrivals are of high importance for tourism organizations. The study at hand presents an approach to enhance autoregressive prediction models by including travelers’ web search traffic as external input attribute for tourist arrival prediction. The study proposes a novel method to identify relevant search terms and to aggregate them into a compound web-search index, used as additional input of an autoregressive prediction approach. As methods to predict tourism arrivals, the study compares autoregressive integrated moving average (ARIMA) models with the machine learning–based technique artificial neural network (ANN). Study results show that (1) Google Trends data, mirroring traveler’s online search behavior (i.e., big data information source), significantly increase the performance of tourist arrival prediction compared to autoregressive approaches using past arrivals alone, and (2) the machine learning technique ANN has the capacity to outperform ARIMA models.
Keywords
Introduction
Understanding, explaining, and forecasting tourism demand has been an important area of scientific enquiry and practice for a long time (cf. Witt and Witt 1995) and is also a relevant domain in contemporary tourism research (Frechtling 2002, 2011). Forecasting tourism demand is a way of reducing risk and is imperative in tourism, since the tourism service product is perishable and the process of production and consumption is inseparable, depends on complementary services, and is extremely sensitive to crises (Frechtling 2011; Vanhove 2011). Forecasting in tourism implies indicating the direction of future demand and, as such, it provides meaningful information for destination management and the various tourism suppliers (Vanhove 2011). Tourism demand modeling and forecasting use a variety of measures. However, a commonly used measure is tourist arrivals (Song and Li 2010). Managerial interest in accurate forecasts of future events and changes in tourism demand is related to the extent to which actions can be implemented to influence the demand, monitor fluctuations for planning or resource capacities as well as manage the (e.g., socioeconomic) effects of demand. Hence, the success of tourism businesses largely depends on the ability to predict tourism demand most accurately. In turn, the consequences of poor forecasting lead to poor marketing effectiveness, inefficient use of resources, and decay of sustainability (Frechtling 2011).
Accurate forecasting of tourism demand is a thought-provoking science and art (S. Li et al. 2018). As a key finding by Song and Li (2008) and Song, Qiu, and Park (2019), the methods used in modeling and forecasting tourism demand are highly diverse, and, in fact, there is no single model that consistently outperforms other models in all situations. A further challenge in demand forecasting is the access to timely and cost-effective data (Önder 2017). Other essential challenges include the lack of historical time series data and demand volatility (Frechtling 2011; Song et al. 2010). Therefore, new data sources and modeling techniques are crucial issues of contemporary tourism research, and big data sources (e.g., web traffic or search engine traffic), apparently, have shown promising potential to overcome these issues.
Even at an early stage, the use of big data in tourism research has achieved amazing improvements, in terms of new theoretical as well as methodological insights (J. Li et al. 2018; Mariani et al. 2018). In recent literature, big data have already been used for tourism demand predictions (Y. Yang, Pan, and Song 2014; Önder and Gunter 2016; Höpken et al. 2017b; Höpken et al. 2018; Y. Liu, Tseng, and Tseng 2018). A growing group of international tourism scholars has concluded that, in particular, search engine traffic has the potential to significantly increase the accuracy and robustness of tourism demand forecasts (cf. Antolini and Grassini 2019; Bangwayo-Skeete and Skeete 2015; Bokelmann and Lessmann 2019; Camacho and Pacce 2018; Dergiades, Mavragani, and Pan 2018; Höpken et al. 2017b; Jackman and Naitram 2015; W. H. Kim and Malek 2018; X. Li et al. 2017; X. Li and Law 2020; Pan and Yang 2017; Park, Lee, and Song 2017; Rivera 2016; Siliverstovs and Wochner 2018; Sun et al. 2017; Sun et al. 2019; D. C. Wu, Song and Shen 2017; X. Yang et al. 2015; Zhang et al. 2017; Önder and Günter 2016; Önder 2017). However, further efforts to improve the performance of forecasting models by incorporating tourist online behavioral data are advocated (D. C. Wu, Song and Shen 2017). More precisely, some researchers point out, that there are particular challenges in using Google Trends data, such as the constant changes in the search result ranking-algorithm and changes to the functionality of searches (Rivera 2016).
In past literature, noncausal time series models and causal econometric models are the two dominant approaches used for quantitative demand modeling (Song and Li 2008; Song and Turner 2006). More concretely, autoregressive integrated moving average (ARIMA) models (Box and Jenkins 1970) appear most frequently in the tourism literature (Song and Li 2008, p. 210; Song, Qiu, and Park 2019). Findings by Moro and Rita (2016) show that time series analyses are still the most strongly adopted methods and that in particular the seasonality phenomena in tourism continues to justify this use (Kulendran and Wong 2005; J. L. Chen et al. 2019). Yet, a major advantage of econometric approaches is their ability to identify economic factors influencing tourism demand (B. Peng, Song, and Crouch 2014; Athanasopoulos, Song, and Sun 2018). Most interestingly, recent research has sought to find explanations of tourism demand beyond neoclassical economic theory by including additional explanatory variables, such as tourists’ online behavior (D. C. Wu, Song, and Shen 2017).
It has even been proposed that the emerging big data paradigm is set to transform the landscape for socioeconomic policy and research along with management and decision making (Blazquez and Domenech 2018). In fact, big data sources, like web search traffic, web usage data, or customer online feedback, are naturally related to tourism demand and, therefore, turn out to be valuable inputs for prognosticating tourism demand.
Furthermore, from a methodological perspective, machine-learning approaches, like k-nearest neighbor (k-NN), support vector machines (SVMs), or artificial neural networks (ANNs), are increasingly used for the purpose of tourism demand modeling and forecasting (Moro and Rita 2016). These artificial intelligence-based approaches offer the advantage of not depending on specific statistical characteristics of the data set at hand, such as (normal) distribution, linearity, and noncollinearity (Song and Li 2008, p. 212), and being more robust against biased, incomplete, redundant, and noisy data (X. Li et al. 2016). Thus, machine learning–based approaches typically yield superior results, not only for noncausal but, in particular, also for causal approaches, showing a high-dimensional input space (Kon and Turner 2005; Song and Li 2008; Lin, Chen, and Lee 2011; Ricardo, Goncavales, and Costa 2018).
The objective of this article is to enhance tourism demand forecasting based on past arrivals alone (i.e., autoregressive approaches), by including travelers’ induced web search traffic as an additional explanatory input variable. More concretely, the study presents a novel approach to identify and aggregate tourism-relevant search terms, and evaluates whether the corresponding web search volume, represented by Google Trends data as additional input to forecast tourist arrivals, can increase prediction accuracy compared to models using past arrivals alone (research proposition 1). As a second objective, the study uses machine learning techniques for predicting tourist arrivals (i.e., ANNs) additionally to an ARIMA model as a comparative statistical approach. As said, machine learning techniques are not limited to linear models and are typically more robust against data not conforming to specific statistical characteristics, like noncollinearity of input attributes or biased data. These characteristics of machine learning approaches make them a promising methodological alternative, especially in the case of more complex and diverse input data, which holds true when adding big data sources as additional input for arrival prediction. Thus, in this study we further evaluate whether the machine learning–based method of ANNs achieves better prediction accuracies than ARIMA models as a statistical approach, also in case of a big data–enriched prediction (research proposition 2). The tourist destination of Sweden serves as a case for this study, using arrival data of major sending countries (i.e., Denmark, Finland, Norway, Russian Federation, and United Kingdom) and Google Trends data for the period 2008–2016, including text and video search, respectively.
The structure of the article is as follows: the next section gives an overview of related literature dealing with the task of tourism demand prediction when including travelers’ online search traffic. The third section discusses the study design and the methodology by presenting techniques of collecting and preparing data as well as the process of model building. The fourth section presents major findings. Finally, the fifth section summarizes and discusses insights and results, while the final section considers limitations of the study and points at possible future extensions and consecutive research activities.
Literature Review
Demand Modeling in Tourism
As outlined, literature on quantitative demand modeling consists of two major subdomains: (noncausal) time series modeling and (causal) econometric methods. Time-series models explain a target variable based on its own past values. In the course of the past decades, tourism research has used the ARIMA approach (Box and Jenkins 1970), heavily (Song and Li 2008, p. 210; Song, Qiu, and Park 2019). Moreover, exponential smoothing models (Geurts and Ibrahim 1975; Lim and McAleer 2001; Cho 2003) as well as shift-share techniques (Fuchs et al. 2000) are found in the literature for forecasting and modeling tourism demand. Econometric approaches, by contrast, offer the advantage to enable the analysis of causal relationships between tourism demand (i.e., the dependent variable) and its explanatory variables (B. Peng, Song, and Crouch 2014; Peng et al. 2017). Recent econometric studies proposed a broad range of possible determinants of tourism demand (Khaidi, Abu, and Muhammad 2019), for instance, consumer price index in a destination, substitute prices, gross domestic product, currency exchange rates, but also interest and unemployment rates, as well as export/import rates (Cho 2001; Song and Li 2008, p. 211; Athanasopoulos, Song, and Sun 2018). Additionally, events (especially mega-events), advertising investments (Divisekera and Kulendran 2006; Kronenberg et al. 2016), but also financial crises or terrorist attacks (Smeral 2009, 2017; Song and Lin 2010) and disasters (e.g., the SARS and H1N1 pandemics; Y. C. Chen, Kang, and Yang 2007) have shown to significantly influence tourism demand. From a statistical point of view, the error correction model (EDM), autoregressive distributed lag models (ADLM), the time-varying parameter (TVP) model, and the vector autoregressive (VAR) model emerged as the main econometric models (Peng et al. 2017). In addition, tourism demand modeling also used the linear structural equation model (SEM (Turner and Witt 2001)).
Using Google Trends Data to Predict Tourism Arrivals
In recent years, researchers all over the globe used web search traffic for predicting economic indicators (Wu and Brynjolfsson 2015). For instance, Vosen and Schmidt (2011) constructed a model to forecast private consumption using Google Trends data and revealed that the addition of web search data outperforms the majority of survey-based factors. Carrière-Swallow and Labbé (2013) could improve the prediction accuracy for automobile purchases by using online search traffic as additional input, while Hand and Guy (2012) reached the same result for cinema admissions.
Information search is a fundamental feature of consumer decision-making behavior (Xiang, Magnini, and Fesenmaier 2015). In a tourism context, travel information search is defined as the stage of the decision-making process, wherein travelers actively collect and integrate information from numerous sources prior to making their travel decision and destination choice (Vogt and Fesenmaier 1998; Fodness and Murray 1999). In fact, travel information search serves a variety of travelers’ goals, from simple ones, addressing basic functional information needs, such as knowledge about the price of a hotel, to highly emotional ones, such as understanding the symbolic meaning of destination places (Xiang, Magnini, and Fesenmaier 2015; Fuchs and Baggio 2017).
Since the late 1990s, the Internet has fundamentally changed the way tourism-related information is distributed and the way travelers search for travel products. Travelers particularly use search engines to find relevant information for all aspects of the trip, including accommodations, attractions, activities, and dining (Pan and Fesenmaier 2006; Steen Jacobsen and Munar 2012; Xiang et al. 2015; Choe, Vogt, and Fesenmaier 2017). Technically, every time a tourist interacts with the Internet, be it through a search engine, a website, a mobile phone, or a social media platform, electronic traces of this interaction can be captured, stored, and analyzed later on (Fuchs, Höpken, and Lexhagen 2014; Höpken et al. 2015). Researchers use these online data, such as search engine query volumes, amount and types of tweets, website traffic, and social media posts, for various analytical purposes, such as online customer segmentation (Pitman et al. 2010) and tourism sentiment analysis (Schmunk et al. 2014; Höpken et al. 2017a). Most important, search engine query volumes are successfully used for tourism demand forecasting and modeling (Artola, Pinto, and de Pedraza García 2015; X. Yang et al. 2015; Padhi and Pati 2017; Siliverstovs and Wochner 2018; Höpken et al. 2019).
Literature on travelers’ search behavior builds on different theoretical foundations, like economic approaches (Stabler, Papatheodorou, and Sinclair 2010; Kronenberg et al. 2016), information-processing theories (Chung et al. 2015), and theory of planned behavior (TPB; Ajzen 2005; Erawan, Krairit, and Khang 2011). In fact, the best predictor of tourist behavior in selecting travel destinations when searching online seems to be information processing (Padhi and Pati 2017, p. 36). More precisely, the TPB framework suggests that factors like attitude and behavioral control are strongly associated with tourist behavior and intention for online search and subsequent holiday bookings (E. Kim et al. 2016). Indeed, with the current high penetration of the Internet, online search engines are gratifying a wide spread of individuals’ needs compared with traditional information sources (Padhi and Pati 2017, p. 36). Thus, recent studies have impressively demonstrated that Google Trends data reflect crucial aspects of tourists’ keyword-based queries that, in turn, provide vast opportunities to investigate and predict travelers’ planned behavior (Padhi and Pati 2017; X. Yang et al. 2015; E. Kim et al. 2016; Höpken et al. 2019).
A number of researchers in the travel and tourism domain have begun to use web search data to predict tourism demand. The latter variable is typically expressed in terms of tourist arrivals. For example, Bangwayo-Skeete and Skeete (2015) underline that web search traffic increases the quality of predicting tourism demand, based on autoregressive mixed-data sampling (AR-MIDAS) models. Likewise, Önder and Gunter (2016) demonstrate that Google Trends data for text and image search enhances the quality of tourism demand forecasts, compared with simple exponential smoothing time-series models (e.g., Holt-Winters) or autoregressive models. X. Li et al. (2016) make us of search engine data for tourism forecasting with noise processing. The work by X. Yang et al. (2015) demonstrates that employing the volume of web search traffic for tourist arrivals prediction helps to improve forecasting accuracy significantly as compared to autoregressive moving average (ARMA) models. B. Pan, Wu, and Song (2012), Y. Yang, Pan, and Song (2014), and Pan and Yang (2017) use web search traffic to improve the forecast accuracy of hotel demand. More recently, the study by X. Li et al. (2017) presents an approach for a composite search index integrated into a generalized dynamic factor model (GDFM) to forecast tourist demand. Findings show that the approach increases prediction accuracy compared to a time series model and a search index-based model created by principal components analysis. Moreover, Camacho and Pacce (2018) show that Google’s search volume indices improve predictions of overnight stays in Spain, thereby outperforming models that exclude these leading indicators. More recently, Antolini and Grassini (2019) use Google Trends data to predict foreign arrivals in Italy. Thereby, assessing the contribution of lagged Google Trends variables in a standard ARIMA model and in a time series regression model with seasonal dummies and autoregressive components. In a similar way, Gunter, Önder, and Gindl (2019) integrate Google Trends data in autoregressive distributed lag (ADL) models to predict tourist arrivals in four Austrian cities. Most recently, Li and Law (2020) demonstrate that a decomposition-based approach using Google Trends data is particularly superior in forecasting turning points.
Machine Learning and Demand Prediction
Fairly recently, nonstatistical, that is, machine learning, methods have been applied to tourism demand prediction. The main advantage of machine learning techniques over statistical approaches is that they do not make any preliminary assumptions about the data, such as normal distribution, linearity, and noncollinearity (Yu and Schwartz 2006; Song and Li 2008, p. 212; Song, Qiu and Park 2019). Concretely, the machine learning methods ANN, rough set theory, fuzzy time-series method, genetic algorithms (GAs), SVMs, and, most recently, deep learning approaches (Law, Li, and Feng 2019) are commonly used for tourism demand forecasting. While emulating the human brain, ANNs became apparent as a superior model to predict tourism demand compared with ARIMA and multiple regression models, respectively (Kon and Turner 2005; Song and Li 2008, 212). Applications of ANNs for tourism demand prediction are presented by Law and Au (1998), Palmer, Montano, and Sese (2006), Lin et al. (2011), Claveria and Torra (2014), Çuhadar, Cogurcu, and Kukrer (2014), Claveria, Monte, and Torra (2015), and Silva et al. (2019). The approach of a decision rule induction is the basis of the rough set theory focusing on analyzing the classification of imprecise, uncertain, or incomplete data (Song and Li 2008, p. 213). Law and Au (1998) and Goh, Law, and Mok (2008) have used the rough set theory for tourism demand modeling. The fuzzy time-series method has shown to work particularly well in analyzing short time series with a limited amount of observations in the past (Hadavandi et al. 2011; Tsaur and Kuo 2011). GAs, that is, optimization algorithms applying the fundamental principles of evolution (Song and Li 2008, 213), are useful in recognizing changes in the structure of tourism demand (Hernández-López and Cáceres-Hernández 2007; Hong et al. 2011). Finally, SVMs are used in solving the nonlinear estimation and prediction problem and have been successfully used for tourism demand analysis by Pai et al. (2006), Hong (2006), K. Y. Chen and Wang (2007), and Pai, Hung, and Lin (2015). Importantly, empirical evidence shows that SVMs can outperform ARIMA models and traditional counterparts in predicting tourism demand (Song and Li 2008). A study by Zhang et al. (2017) hybridizes support vector regression (SVR) with the Bat algorithm to forecast tourist volume by incorporating search engine data, where the Bat algorithm is used to adjust the SVR parameters (ibid., 245). Most recently, Assaf et al. (2019) have shown the advantages of Bayesian global vector autoregressive (BGVAR) models to capture spillover effects of international tourism demand, typically accruing in touristically strongly interlinked countries, such as Southeast Asia.
However, when looking specifically at the comparison of ARIMA and ANN—the two approaches used in the study at hand—their relative superiority in predicting tourism demand is judged differently in research studies. In a study by Lin, Chen, and Lee (2011) comparing ARIMA, ANN, and MARS (multivariate adaptive regression splines) for predicting visitors to Taiwan, ARIMA outperformed ANN and MARS. Similarly, in a study by Claveria and Torra (2014), ARIMA outperformed ANN (as well as a self-exciting threshold autoregression) when forecasting overnight stays to Catalonia, especially for shorter forecasting horizons. In contrast, Aslanargun et al. (2007) could show that models with nonlinear components, like ANNs with a nonlinear activation function, can outperform linear models like ARIMA, demonstrated in a study, which is forecasting tourist arrivals in Turkey. Burger et al. (2001) compare a variety of classical forecasting methods, like moving average, decomposition, exponential smoothing, ARIMA, and multiple regression, with nontraditional approaches, like genetic regression and ANN, applied to forecasting the US demand to Durban, South Africa, and could show that ANN reaches the overall best performance. C-F. Chen, Lai, and Yeh (2012) combine a decomposition model with an ANN model and can show, based on forecasting international visitors to Taiwan, that the combined approach outperforms both an ANN, directly learned on the original time series data, as well as a traditional ARIMA model.
Preprocessing Web Search Data
Preprocessing web search data typically does not follow a standard approach. Nevertheless, three main tasks for preparing search engine traffic as input to demand prediction emerged: keyword selection, identification of time lags, and construction of search indices. Keyword selection aims at identifying potential keywords, either by making use of domain-specific knowledge (e.g., by domain-specific ontologies), by employing web information extraction and text mining methods, or by keyword recommendations offered by search engine providers (Y. Liu et al. 2012). The second task, identification of time lags, intends to identify those time lags with the strongest correlation between tourist arrivals and corresponding search queries. The third task, construction of search indices, intends to combine a multitude of different search requests into a compound search index with a high predictive power in order to avoid both multicollinearity and the overfitting phenomenon, often caused by high-dimensional time series data (Varian 2014). Y. Liu et al. (2012) used search engine requests to predict the Chinese stock market and proved that such a compound search index, which is aggregating lagged search queries, significantly increased forecasting performance. X. Yang et al. (2015) recently adopted this approach considering online search requests to forecast tourism demand.
Related to the second task, identification of time lags, the literature shows different approaches for measuring lagged correlations between predictor and target time series attributes. Y. Liu et al. (2012) combined the Pearson correlation with the Kullback-Leibler divergences, while other authors use the Pearson correlation alone (X. Yang et al. 2015; Xiaoxuan et al. 2016; Pan et al. 2017).
A final step of preprocessing search query data is the elimination of useless information by methods of noise reduction. X. Yang et al. (2015) and Xiaoxuan et al. (2016) point out the importance of noise reduction when using Google Trends data as forecasting input and propose the Hilbert-Huang transformation (HHT), which shows the capacity to reduce prediction errors significantly. Another methodological advancement has been proposed by Peng et al. (2017) by using the Hurst exponent (Hurst, Black, and Simaika 1965) to remove search queries with a low predictive power.
Methods
Data Collection and Preparation
Specification of data set
The data set used in this study contains monthly data concerning inbound tourist arrivals to Sweden from major sending countries (Denmark, Finland, Norway, the Russian Federation, and the United Kingdom) for the period between January 2008 and December 2016, resulting in 9 years of monthly time series data for each sending country. More precisely, the resulting data set consists of 108 entries, reflecting past tourist arrivals to Sweden (Statistics Sweden 2017). Figure 1 shows the time series of tourist arrivals per sending country for the complete observation period and clearly demonstrates a strong seasonality of tourist arrivals on a monthly basis.

Monthly tourist arrivals to Sweden per sending country between 2008 and 2016.
Besides data on tourist arrivals, the data set includes search traffic on web search engines for the different sending countries. Analogous to Bangwayo-Skeete and Skeete (2015), X. Yang et al. (2015), Önder and Gunter (2016), and Höpken et al. (2018), search traffic on the Google search engine has been selected, as Google is the predominant search engine in most sending countries (c.f. Pearson CMG 2017). Google’s search volume is available via the Google Trends service, providing the relative search volume for specific keywords over time and geographical regions. Although Google’s world dominance is undisputable, in specific sending countries, like China, Japan, or Russia, local favorites are more popular than Google. In Russia, for instance, the majority of Internet users make use of Yandex for web searches, while Google search plays only a subordinate role for most Russian Internet users (Return On Now 2017). In the literature, this is known as platform bias, as users from certain countries tend to perform searches on local search engines (Dergiades, Mavragani, and Pan 2018). Since Yandex does not provide search queries older than one year, we used Google for Russian travelers as well, although a low search intensity was expected.
Collection of web search data
We used Google’s Keyword Planner for collecting the keywords that tourists tend to use when planning their trip to Sweden. Suggested keywords were used as input for retrieving appropriate time series from Google Trends, reflecting tourists’ planning behavior before visiting Sweden. Accordingly, we chose the search category “Travel and Tourism” to restrict results to search queries related to tourism. As the study’s aim is the prediction of Swedish inbound tourist arrivals with respect to the above-mentioned sending countries, the search term “Sweden” was spelled in English as well as the official language of each sending country, that is, “sverige” for travelers from Denmark and Norway, “ruotsi” and “sverige” for Finnish travelers, as well as “швеция” and “shvetsiya” for travelers from Russia, respectively. For Finland and Russia, two terms for “Sweden” were used in order to account for all the languages spoken in the respective sending country, thus, preventing the data to be biased by language (Dergiades, Mavragani, and Pan 2018). As a result, Google suggested a maximum number of 700 keywords for the sending countries Denmark and Norway, followed by 144 keyword suggestions for the sending country Finland. For Russia and the United Kingdom, Google suggested 29 and 52 keywords, respectively. Additionally, as a specific extension, each of the keyword lists were extended by adding 290 Swedish tourist destinations (i.e., regions, cities, and villages).
For the extraction of query series (defined as the daily search volume for a given search term over a specific time-period), an automatic crawling algorithm was individually developed. The algorithm starts by iterating over a list of seed keywords (the keywords described above) and extracts corresponding query series. If no query series exists for a given keyword, the algorithm skips the keyword. If the iteration finds a query series, the list of keywords is expanded by keywords the users are likely to use within the same search session, as suggested by Google. Besides obtaining time series reflecting search behavior regarding Google Search (i.e., text search), as done by previous research, in this study, search queries are extended to tourism-related video content as well, in order to optimally reflect search behavior of potential tourists. Table 1 shows the number of search query series that correspond to the number of keywords used within tourism-related search queries for both text and video search queries collected for each sending country.
Number of Search Query Series per Sending Country.
Normalization of search terms
Compared to existing literature, a step of normalization enhances the process of identifying relevant search terms, intending to improve predictive power and to reduce any kind of redundancy by removing synonymous or interchangeable search terms (B. Liu 2008). More specifically, search term normalization deals with identifying linguistic variations, synonyms, or even meaningless variations caused by misspellings. In this study, we handled the following types of variations concerning query names: First, queries containing the same terms but arranged differently (e.g., “sweden skiing” vs. “skiing sweden”); second, variations only caused by the existence of stop-words (e.g., “skiing sweden” vs. “skiing in sweden“); third, different keywords belonging to the same word stem, e.g. “ski” and “skiing”; additionally, the same search term appearing in different languages (e.g., “sverige sää” and “sweden weather”). Finally, queries may also differ by the usage of special characters, in this case by Nordic special characters {Å, å, Ä, ä, Æ, æ} and {Ø, ø Ö, ö}, which were substituted by {a} and {o}, respectively.
To cope with search term variations, as mentioned above, first, typical text preprocessing techniques were executed, that is, tokenization, character substitution, stop-word elimination, and stemming (B. Liu 2008). Second, for each query we created a word vector, based on the contained search terms, and we calculated similarities based on cosine similarity (B. Liu 2008, p. 190). The cosine similarity cos(θ) between two vectors A and B is defined as follows:
with Ai and Bi being word occurrences of word vectors A and B, respectively. Search queries, containing the same words in principle but having a different order of words, other stop words, or other word forms, will still have a cosine similarity of one and, thus, were merged.
Statistical Evaluation of the Arrival Series
Statistical techniques for time series prediction usually require the time series to comply with the characteristics of stationarity (Mukherjee, White, and Wuyts 1998, p. 335; Song et al. 2010; Frechtling 2011), that is, having a constant mean and variance (weak stationarity) and auto-correlations between two values being independent of the point in time within the series (strong stationarity) (Frechtling 2002).
To achieve stationarity, the time series were first analyzed for the existence of seasonal patterns, one of the main reasons for nonstationarity, especially in case of tourism arrival series. Yearly seasonality (i.e., a seasonal frequency of 12 months) has been tested by Maravall’s QS test (Maravall 2011) and the Kruskal-Wallis test. The QS test is a variant of the Ljung-Box test executed on seasonal lags, where only positive auto-correlations are considered (Maravall 2011). By contrast, the Kruskal-Wallis test (Kruskal and Wallis 1952) is similar to the Friedman test (Friedman 1937), where the observations are checked for significant variances in their period-specific mean ranks, with the difference that period-specific values of observations are assigned to ranks over the entire observation period. According to Webel and Ollech (2018), the Kruskal-Wallis test can be understood as a one-way analysis of variance without repeated measures.
Following the methodology of Webel and Oellech (Webel and Oellech 2018), the QS test was applied twice. First to the original arrival series and second to the fitted residuals of a nonseasonal ARIMA model, estimated with the Hyndman-Khandakar algorithm (Hyndman and Khandakar 2008). Similarly, we applied the Kruskal-Wallis test to the residuals of the nonseasonal ARIMA model, as well. If the p values of the QS tests are below 0.01 or the p value of the Kruskal-Wallis test is below 0.002, the Webel-Oellech test will classify the corresponding time series as seasonal (Webel and Oellech 2018).
Test results in Table 2 clearly show that seasonal patterns exist in the arrival time series as the p value of all QS tests for all arrival series is below 0.01.
Seasonality Tests on the Arrival Time Series.
Note: KW = Kruskal-Wallis.
For that reason, we further analyzed the time series for the strength of seasonal patterns with the method of X. Yang et al. (2015), and for the existence of seasonal unit roots using the Osbourn-Chui-Smith-Birchenhall (OCSB) test (Osborn et al. 1988).
In a second step, the time series were analyzed for level and trend stationarity by the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test, in contrast to other tests having stationarity as the null hypothesis (Hill, Griffith, and Lim 2011), and for covariance stationarity by the Augmented Dickey-Fuller (ADF) test, identifying nonseasonal unit roots (Baddeley and Barrowclough 2009).
Test results in Table 3 show that all time series either contain seasonal unit roots (Finland and Norway), were found to be not trend or level stationary (Denmark and Russian Federation) or both (United Kingdom). While the arrival series for the Russian Federation and the United Kingdom are neither trend nor level stationary, the KPSS test applied to the arrival series for Denmark suggests the time series to be at least trend stationary at a 90 per cent significance level. In addition to the seasonal unit roots found, the seasonal strength of all time series is close to unity, thus, confirming that the time series have strong seasonal patterns. According to X. Yang et al. (2015), time series should be differentiated when the measure of seasonal strength exceeds 0.64. Therefore, we first differentiated all time series with lag 12 to remove strong seasonality. Second, we again differentiated the arrival time series for Denmark, the Russian Federation, and the United Kingdom with lag 1 in order to eliminate both level and trend nonstationarity as identified by the KPSS tests.
Stationarity Tests for the Original Arrival Data Sets.
Table 4 shows that all time series, except the arrival series for the Russian Federation, are clearly stationary after transforming the time series as described above. As the p-value of the KPSS test for Russia is only slightly above the threshold (0.110 > 0.100), no further action has been taken. The differentiation with lag 12 clearly eliminated seasonal unit roots and the strength of seasonality declined significantly. Therefore, no additional transformations were necessary.
Stationarity Tests for the Transformed Arrival Data Sets.
Construction of Aggregated Web Search Indices
Both Song et al. (2010) and Y. Liu et al. (2012) see search queries as a manifestation of tourists’ preferences and needs, accounting for important trends that are relevant for the development of tourism destinations. As dimension reduction techniques show the capability to increase the predictive power in case of a high-dimensional input space, different single search queries from the same sending country are aggregated into a compound search index (c.f. X. Yang et al. 2015). As specific information needs of tourists differ in the various phases of the information and decision-making process, different search queries, satisfying different needs, will have a different time lag with the corresponding tourist arrivals. Thus, before aggregating multiple search queries into a compound search index, the predominant time lag has to be identified for each search query, and each single search query is shifted by the appropriate time lag, before being incorporated into the search index. The described approach proved its capability to increase forecasting performance significantly (Y. Liu et al. 2012).
Previous literature often used the Pearson correlation coefficient to identify the time lag, leading to the highest correlation between a search query series and the corresponding arrival series (X. Yang et al. 2015; Xiaoxuan et al. 2016; Pan et al. 2017). Thus, the first step of the index aggregation procedure is the calculation of the Pearson correlations between each search query and the corresponding arrival series for a time lag of zero to six months in order to cope with both short- and midterm travel planning behavior (Fesenmaier et al. 2010). Based on such correlation coefficients, search query series were shifted toward the arrival series by their most dominant time lag (i.e., the time lag with the highest correlation), and weighted by the corresponding squared correlation coefficient. The approach of shifting each single search term series by its most dominant time lag before aggregating all search term series is justified by the fact that search terms are mostly used in one dominant month ahead of arrival and, thus, the corresponding time series shows a strong peak in this dominating month. According to literature (X. Yang et al. 2015), online search activities executed immediately prior to arrival have typically no predictive power in forecasting tourism demand. Thus, queries with a dominant time lag of zero are excluded from the index building process.
G. Peng et al. (2017) have shown that an input series has a higher predictive power, if it follows the same autocorrelative patterns as the target series, measured by the Hurst exponent of both series. Hence, as an attempt to further increase the predictive power of the search index, we excluded query series with a significantly different Hurst exponent than the corresponding arrival series from the search index as well. However, we could show that the exclusion of the affected time series could not improve the forecasting accuracy in this specific case.
As a final step of input parameter selection, a backward-stepwise regression has been executed, to successively remove query series from the search index, which do not lead to a significant increase of the correlation with the corresponding arrival series (Roecker 1991), thus, in order to increase model parsimony and generalizability (B. Liu 2008).
To ensure stationarity of the search index series, analogous to the arrival series, the (seasonal) unit root and stationarity tests were applied to the search index time series as well. Test results in Table 5 clearly show the absence of (seasonal) unit roots and, therefore, confirm stationarity. Thus, no further transformations had to be applied to the search index time series.
Stationarity Tests for the Search Index Data Sets.
Figure 2 shows the stationary tourist arrival series and the corresponding aggregated search index series. The high conformity of the time series evidently demonstrate a potentially high predictive power of the aggregated search index related to tourist arrivals.

Monthly tourist arrivals (solid line) and corresponding search indices (dashed line) normalized with a range between 0 and 100.
The Pearson correlations between the arrival series and the corresponding search index series, shown in Table 6, confirms this.
Correlation between Arrival Series and Corresponding Indices.
Importantly, the exact overlap of the main peaks of the arrival series and the corresponding search index series is simply caused by the fact that for each search term the corresponding search query series is shifted toward the arrival series by their dominant time lag, as explained before.
Model Building
While typical autoregressive (i.e., univariate) forecasting approaches use past arrivals as the only prediction input, the study at hand makes use of web search traffic as an additional input variable (i.e., bivariate approach) (Hill, Griffith, and Lim 2011; Frechtling et al. 2011). The aim is to evaluate whether the web search volume (for relevant search terms in the form of Google Trends data), as additional input to predicting tourist arrivals can increase predictive accuracy, compared to using past arrivals alone (research proposition 1).
This study first makes use of a traditional ARIMA model, proposed by Box and Jenkins (1970), to predict tourist arrivals autoregressively (i.e., univariate) and a regression model with ARIMA errors for the prediction with the bivariate data sets consisting of the tourist arrivals series and the corresponding aggregated search indices. Regression models with ARIMA errors are equal to linear regression models with
where ηt describes the ARIMA error for a given model (Hyndman and Athanasopoulos 2018). In both cases (i.e., univariate and bivariate), we chose appropriate ARIMA models by using the Hyndman-Khandakar algorithm, which selects the model with the lowest AIC (Akaike information criterion) by estimating models with different combinations of the p and q parameters (i.e., the order of the autoregressive part and the order of the moving average part). We determined the optimal parameter combinations by a stepwise search, traversing the model space (Hyndman and Khandakar 2008).
Because the degree of differentiation involved is zero for all time series (because the time series were made stationary before), the fitted models are equal to ARMA models for the autoregressive prediction and regression models with ARMA errors for the bivariate data sets, respectively. Table 7 shows the models fitted to the univariate as well as the bivariate data (i.e., the arrivals series with corresponding aggregated search indices).
ARIMA Models.
In addition to ARIMA models, we applied ANN models in this study, representing a modern machine learning approach for time series prediction (Kamel et al. 2008). ANNs represent a well-known type of machine learning technique, used for both supervised learning (i.e., classification, estimation, or prediction) and unsupervised learning (i.e., clustering) (Du and Swamy 2019). ANNs imitate the human brain and consist of neurons (called nodes) responsible for identifying certain patterns, that is, correlations of attributes (McCulloch and Pitts 1943). A specific form of ANN is a feed-forward, fully connected network, called multilayer perceptron (MLP) (Rosenblatt 1961). An MLP consists of an input layer, one or more hidden layers, and an output layer, where each node at one layer passes on information to all nodes at the consecutive layer (feed-forward); thus, each node is connected with all nodes of the subsequent layer (fully connected). In each node, an activation function takes the sum of all weighted inputs, provided by the nodes of the preceding layer, and calculates a corresponding output, passed on to all nodes of the subsequent layer. Nonlinear activation functions (e.g., SIGMOID, tanh, and Gaussian) enable the MLP to learn complex and nonlinear patterns. While the nodes of the input layer simply represent the input (i.e., independent) attributes and the node of the output layer the target (i.e., dependent) attribute, the nodes of the hidden layers represent patterns meaningful to most precisely identify the correct target attribute values. Most importantly, these patterns are not predefined, but the system learns them automatically, thus, constituting the high flexibility and predictive power of MLPs. In fact, the process of learning meaningful patterns means adapting the weights of the edges between two nodes in order to reduce the overall prediction error (typically the sum of squared errors (SSE) of all training data examples). We use the gradient descent optimization algorithm in order to iteratively find the optimal weights minimizing the SSE. We also use the back-propagation algorithm to percolate the necessary weight adaptions, identified by the gradient descent method, back to the preceding layers within the network (Rumelhart, Hinton, and Williams 1986). The hyperparameter learning rate and momentum allow specifying how fast the weights are adapted and how fast the direction of weight adaptations can be changed in order to avoid oscillating behavior. The well-known problem of the vanishing gradient constitutes an upper limit for the number of hidden layers of an MLP. Previous research has shown that in most cases already one hidden layer leads to optimal results (Hochreiter 1998).
In this study, the ANN model represents an MLP neural network, using back-propagation for optimization (Rumelhart, Hinton, and Williams 1986) and a single-hidden layer as a common architecture for ANNs in research practice (Kamel et al. 2008). More precisely, the number of neurons in the hidden layer is set to the number of input variables divided by 2 plus 1. A sigmoid activation function is employed within the hidden layer. The hyperparameter error epsilon was set to 1.0E–5 for all prediction models. The hyperparameters learning rate, momentum, and training cycles have been set differently for each prediction model (cf. Table 8).
Neural Network Models.
To summarize, the study at hand evaluates whether the machine learning technique ANNs achieves superior prediction accuracies than the statistical approach of ARIMA especially in case of making use of Google Trends data as an additional “exogenous” variable (research proposition 2).
Finally, we used a six-month forecasting horizon to validate the forecasting performance for a midterm prediction scenario. The remaining prediction error is usually used to measure forecast accuracy (Frechtling 2002; Song et al. 2010) and typically operationalized by the root mean square error (RMSE).
In addition to the performance measure itself, an appropriate validation method has to be chosen (Frechtling 2002; Kennedy 2010). As a traditionally used method for estimating forecasting performance in time series data, we evaluated the prediction accuracy based on an out-of-sample validation (Mozetic, Cerqueira, and Torgo 2019). The training data set consisted of 83 samples from August 2009 until June 2016 and the test data set of 6 samples, from July 2016 until December 2016 (Chatfield 2000).
As a final step of validating the quality of the learned prediction model, the Shapiro-Wilk test is used to test whether the residuals are normally distributed and, thus, do not contain any remaining patterns or information (Hill, Griffith, and Lim 2011).
Results and Discussion
In this study, we evaluated four different approaches for predicting tourist arrivals. An autoregressive approach based solely on tourist arrivals itself and an approach based on web search traffic as additional model input were used in order to validate research proposition 1. Additionally, for both approaches, we employed an ARIMA model and a neural network MLP in order to answer research proposition 2. All four prediction models were learned for all five sending countries separately. For evaluating midterm forecasting capabilities, we evaluated the prediction models with a forecasting horizon of 6 months. As an example, Figure 3 shows monthly tourist arrivals and the corresponding predicted tourist arrivals based on the forecasting approach ANN with Google Trends data. The high prediction accuracy of the presented approach becomes evident.

Predicted values (dashed line) versus stationary arrival series (solid line) from July 2016 to December 2016 based on an artificial neural network (ANN) with Google Trends data.
Table 9 shows the overall results for all four prediction models and Sweden’s major sending countries. For the ARIMA and the ANN approach, we calculated the relative difference between the purely autoregressive models and the approach including Google Trends data (where negative values indicate an error reduction by adding Google Trends data), in order to assess research proposotion 1. Furthermore, we used a Shapiro-Wilk test for both models using Google Trends data to check if residuals (error terms) comply with a normal distribution. The latter is considered as an important indicator that the residuals do not include any kind of information or patterns that have not been recognized by the regression model (Kennedy 2010; Hill, Griffith, and Lim 2011). Finally, the two right-most columns calculate the difference between the ARIMA model and the ANN regression for the autoregressive approach as well as the approach adding Google Trends data (i.e., assessment of research proposition 2).
Results for All Four Prediction Models and Sending Countries.
Note: RMSE = root mean square error;
The root mean squared error (RMSE) values for the four different prediction models in Table 9 show satisfactory results. When forecasting autoregressively, the ARIMA models achieved better results for the sending countries Denmark, Finland, and Russian Federation, while the autoregressive forecasts for Norway and the United Kingdom using ANNs were more accurate. On average, the prediction errors for the autoregressive ANN models were slightly higher than the RMSE achieved by the autoregressive ARIMA models (+2.4%). In contrast, nearly all the neural network–based models led to significantly more accurate forecasting results compared to the ARIMA models, when including Google Trends data in the prediction. The inclusion of Google Trends data in the ANN models led to an RMSE reduction for all sending countries (except United Kingdom) of at least 18.61% and a maximum RMSE reduction of 78.91%. On average, the inclusion of Google Trends data achieved an RMSE reduction of 47.75%. In this case, the average RMSE for the neural network models was slightly lower compared with the corresponding ARIMA models. The results of the Shapiro-Wilk test underpin the superior results of the ANN models with Google Trends data, where for all sending countries the hypothesis of normal distribution is not rejected (p values clearly >0.05). Regarding the ARIMA models, the results of the Shapiro-Wilk test indicate that at least for the sending country Denmark, the hypothesis of normal distribution of the residuals has to be rejected (p values <0.05), and, thus, ARIMA could not fit the data for the Denmark data set optimally. To summarize, we view the model fit and forecasting accuracy of all proposed approaches as sufficient to meaningfully substantiate the research propositions of this study.
Empirical findings in Table 9 impressively demonstrate that extending a purely autoregressive approach by adding Google Trends data reduces the prediction error, both for the ARIMA and the ANN approach. On average, the prediction errors for both models (i.e., ARIMA and Neural Net) declined nearly 50% compared to the corresponding autoregressive prediction errors. Thus, adding Google Trends data to an autoregressive approach to predict tourist arrivals clearly reduces the prediction error and, thus, confirms research proposition 1. This (1) demonstrates that tourists make extensive use of Internet search engines, like Google, during their information gathering and travel planning phase, and (2) that different search terms can be assigned to different travel planning phases as search terms have a dominant time lag, which is a prerequisite to aggregate them to a compound search index.
When comparing the two different estimation methods, on average the ANN models slightly outperform the ARIMA models when including Google Trends data. Overall, the RMSE obtained by the ANN models with Google Trends data is 4.2% lower compared to the corresponding ARIMA models. Thus, ANN-based estimation models tend to outperform the ARIMA models when adding Google Trends data. Consequently, we can confirm research proposition 2 as well. This demonstrates the ability of the machine learning technique ANNs to learn complex and nonlinear patterns effectively, which turns out as a clear advantage relative to linear models, especially when adding multiple and diverse input data, like Google Trends data (Kamel et al. 2008). While ARIMA delivers competitive or even slightly superior results in a standard autoregressive setting, ARIMA cannot benefit from adding Google Trends data to the same extent than ANNs do. The robustness of ANNs against correlated, irrelevant, or even biased input data enables ANNs to fully benefit from the predictive power of the relevant search volume on a search engine, like Google.
However, Google Trends data is not the only new potential input to be added to tourism demand prediction. Other big data sources, such as e-reviews (user-generated content), social media interactions, network data, etc., constitute promising data input as well (Mariani et al. 2018). Therefore, we expect machine learning–based techniques, such as ANNs, to gain more attention in tourism demand forecasting and in big data analytics in the future.
Conclusions
This study presented a novel approach to extend the autoregressive time-series forecasting method by adding the web-search traffic of tourists as external input for predicting tourist arrivals. More concretely, the study introduced a new method to identify and aggregate relevant Google search terms into a compound web-search index, serving as an additional input variable to an autoregressive forecasting approach. As prediction methods, the study compared the forecasting performance (i.e., accuracy in terms of RMSE) gained by the statistical approach of ARIMA with those gained by the machine learning–based technique ANN. The forecasting study was conducted for Sweden, using arrival data of Sweden’s major sending countries (i.e., Denmark, Finland, Norway, Russian Federation, and the United Kingdom) for the period 2008–2016, and corresponding Google Trends data (i.e., text and video search) as a big data source. All statistical computations used R-Statistics, while the algorithm for retrieving search queries used Spyder® (Scientific Python Development Environment). Rapid Miner Studio® was used in the study for the machine learning processes for both learning and evaluating the prediction models.
Study results clearly show that extending purely autoregressive forecasting approaches by Google Trends data significantly reduces the prediction error. Google Trends data can be used as an effective data source to increase forecasting accuracy for the midterm range of future tourism demand and demand fluctuations. Findings of this study constitute a remarkable insight, as the web search volume in the form of Google Trends data is only one out of many available big data sources (e.g., web navigation behavior, social media interactions, and e-reviews), likely to have a similar potential to increase the accuracy of tourism demand prediction (Mariani et al. 2018). Therefore, we envisage that adding further big data sources can increase the precision of tourist arrival predictions in the future (Kamel et al. 2008).
Second, the study could demonstrate that when comparing statistical approaches, such as ARIMA models, with machine learning–based approaches, like ANNs, the latter tend to outperform the former when using the big data–based approach. Interestingly enough, the big data approach benefits relatively more from applying ANNs than the autoregressive approach, which can be explained by the high flexibility and robustness of ANNs, especially suitable for a multidimensional input space (Haykin 2008). Thus, for future scenarios of considering the inclusion of additional big data sources as input to tourism demand prediction, ANNs, or similar machine learning techniques, like deep learning, will most likely play a dominant role for reliably predicting tourist arrivals.
To summarize, as the main scientific contribution, the study proved that (1) big data sources, like Google Trends data, show strong potentials to increase the prediction performance of tourist arrivals compared with autoregressive approaches. Furthermore, (2) machine learning techniques, like ANN, have the potential to outperform statistical approaches, such as an ARIMA when adding search query indices obtained by Google Trends.
Additionally, the study results are of high practical relevance. Because of the perishable nature of tourism services, precise and reliable demand predictions are of utmost importance for tourism stakeholders and decision makers (Grönroos 2008; Edgell et al. 2008). The extension of traditional (e.g., autoregressive) approaches by integrating big data sources does not only increase the forecasting precision, but specifically enables the prediction of demand fluctuations in extraordinary, or even singular circumstances (e.g., shocks, such as financial and economic crises, natural disasters, mega events, and epidemics), especially when autoregressive approaches fail systematically (Y. C. Chen, Kang, and Yang 2007; Song and Lin 2010; Song, Qiu, and Park 2019).
When looking at major study limitations, first, the cosine similarity used definitely has limited capabilities for matching semantically identical search terms when constructing the aggregated search index. Thus, in order to automatically and more powerfully detect semantically identical search terms, text-mining approaches are recommended for future studies (see Schmunk et al. 2014; Menner et al. 2016; Höpken et al. 2017a). Second, while the current study makes use of web search traffic in terms of Google Trends data, certainly, other big data sources, for example, web navigation behavior, social media interactions, and e-reviews, show a similarly strong potential to serve as input for tourism demand modeling and prediction (Kamel et al. 2008; Mariani et al. 2008). Therefore, future studies in the domain of tourism demand forecasting should extend the presented approach to other big data sources. Third, the current study is potentially limited by using a feed-forward MLP, learned by back propagation with gradient descent as machine learning approach. MLPs are typically limited to a relatively small number of hidden layers (i.e., only one in the case of this study) due to the vanishing gradient problem typically occurring within back-propagation over multiple hidden layers (Hochreiter et al. 2001). Deep learning approaches might best overcome this limitation, as the use of a high number of hidden layers can cope with large model complexity and flexibility and, ultimately, increase explanation power (Schmidhuber 2015). Therefore, deep learning might constitute a promising future forecasting approach, especially when adding multiple big data sources as additional input to the prediction of tourism demand. Finally, the identified search queries and their corresponding time lags with a high correlation with future tourist arrivals are an excellent input to analyze tourists’ online search behavior (Höpken et al. 2019). Thus, identifying concrete search terms, actually preceding tourism demand, would constitute valuable input to destination marketing generally and search engine marketing and optimization specifically.
Footnotes
Acknowledgements
The authors would like to acknowledge contributions to this article from the project “The Knowledge Destination II” funded by the European Regional Development Fund (project ID 20200778) and run in collaboration with Region Halland, Sweden, and the destinations of Halmstad, Varberg, Falkenberg, Hylte, Kungsbacka, and Laholm.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
