Abstract
We examine whether Google’s search volume indices help economic agents with real-time predictions about the checked-in and overnight stays of travellers in Spain. Using a dynamic factor approach and a real-time database of vintages that reproduces the exact information that was available to a forecaster at each particular point in time, we show that the models, including Google’s query volume indices, outperform models that exclude these leading indicators. In this way, we are the first in finding conclusive evidence that tourism-related queries help to improve tourism forecast in Spain. Our finding is of significance in this literature, since Spain is one of the world’s top tourism destinations and extremely depends on tourism.
Introduction
The Spanish economy is extremely dependent on tourism and is one of the world’s top tourism destinations. In 2016, according to the World Tourism Organization, by international tourism receipts, Spain was in second position, with US$60.3 billion, only behind the United States. By the volume of international arrivals, Spain ranked the third best, with 75.6 million tourists, after France and the United States. In 2015, as reported in the latest publication from the Spanish Tourism Satellite Account, the volume of tourism activity reached the amount of 11.1% of gross domestic product (GDP). 1
In accordance with these magnitudes, having accurate previsions about the dynamism of current and upcoming tourism is of primary importance for policy authorities in assessing overall economic developments. In addition, having timely information about the evolution of tourism is also crucial in the previsions of the hospitality and tourism industry, which need to find and develop new means to distribute travel and hospitality products and services, to manage marketing information for consumers and to provide comfort and convenience to travellers. Unfortunately, in spite of these real-time monitoring requirements, data on the number of travellers checked-in and on the number of overnight stays, the two major measures of tourism in Spain, are published monthly with a 1-month lag, which makes the previsions difficult.
In this article, we follow the idea that the increasingly widespread use of the Internet by travellers has led to the creation of a potentially useful data source of leading tourism indicators that could help both policy authorities and the tourist industry to perform early assessments on tourism performance. In this context, the tourist industry has been among the first to capitalize on new technology, and the number of travellers that use the Internet to plan and book their business and pleasure trips has significantly grown during the last decade. In line with those developments, recent literature has focused on exploiting the valuable information search query data provided about tourists’ behaviour. Google’s dominance in the field of search engines makes this web search engine a reliable representative from which to examine the forecasting contents of search results. 2
While not claiming to be exhaustive, Pan et al. (2012) showed that including information about aggregated search trends improved the weekly forecast accuracy of demand for hotel rooms in South California. Jackman and Naitram (2015) found that air passenger arriving in Barbados from Canada and United Kingdom could be better predicted one week ahead, by including a Google trends series with queries performed from those two countries. Li et al. (2017) used a generalized dynamic factor model to extract a weekly ‘search index’ based on Google trends data to obtain out-of-sample improvements in forecast accuracy of tourist arrivals in Beijing. Yang et al. (2015) examined the predicted power of the queries entered into search engines on the number of visitors in Hainan (China). Bangwayo-Skeete and Skeete (2015) used query trends from Canada, the United States and United Kingdom to forecast values 12 months prior to monthly tourist arrivals in five Caribbean countries. Rivera (2016) found that including information about query trends from the United States helps to improve forecasting accuracy on a 12-month horizon, but not for short-term forecasts.
In spite of these promising results for other countries, the potential ability of the amount of information from Internet searches to forecast tourism in Spain has been underestimated. To the knowledge of the authors, only Artola and Galan (2012) presented a very specific application for the Spanish economy, namely British tourists (the Spanish tourist industry’s main clients) visiting Spain. Although they computed an adjusted indicator of the flow of British tourists with a lead of almost 1 month, the improvement in forecasting tourism provided by their short-term models is very limited. Therefore, they suggested exploring in further research the information available from other countries to compute leading indicators of incoming tourists.
This study pretends to fulfil this gap by contributing to the literature in several ways. In collaboration with Google, we develop a novel data set that collects information on the volume of queries associated with different specific tourism-related terms from some specific countries. This Google search volume data set provides reports on the real-time evolution of queries related to various tourism industries in the online travel market and on the use of the Internet and e-commerce for travel.
The data set based on volume searches departs from Google trends data in two main aspects. First, the volume of search data set is related to the total number of queries of a set of terms from a specific country while Google trends refer to the popularity that a specific term reaches with respect to the total searches performed at a specific time range and geography. Second, while Google trends data come from a periodic random sample of searches data, volume searches are always collected from a larger, but fixed, sample of queries regardless of the moment when the data are extracted. 3
In particular, the Google search volume data set is built at a country level for queries done from Austria, Germany, France, Ireland, Italy, Switzerland, the United States and the United Kingdom, which accounted for almost two-third of the total non-resident overnights stays in Spanish hotels during 2015. In addition, the query volumes are related to travel facilities (air, ferries, bus and rail), accommodation (hotel, holiday rental and camping), vacation packages and general matters about travel and destination (city and short trips, activities, weather and rent a car). This amounts to a total of 65 series of searches volumes from eight different countries in real time. 4
To deal with this large amount of information, we rely on Dynamic Factor Models (Stock and Watson, 2011). Within this framework, the goal is to explain the maximum amount of variance in the search volumes with the fewest number of common factors. Therefore, we allow all the information contained in the series to be potentially valuable in order to extract the relevant signals on the query volume dynamics in a small number of common components. Then, we examine the usefulness of this information to improve the accuracy of short-term forecasts of the checking in and overnight stays of travellers in real time.
Our results suggest that the model using search volumes yields significant forecasting improvements over benchmark predictions computed from standard autoregressive specifications. To show the advantages of our proposal, we develop a pseudo real-time forecasting exercise, which is carried out over from September 2014 until January 2016, in a recursive way. With every new vintage of data, the model is re-estimated and the forecasts for different horizons are computed. The vintages are constructed by taking into account the lag of synchronicity in data publication that characterizes the real-time data, by mimicking the pattern of the actual chronological order of the data releases. In each forecasting day, in month t, the model predicts the tourism data in month t − 1 (backcast), in month t (nowcast) and in month t + 1 (forecast). Although the gains depend on the forecasting horizon, we do find forecasting improvements from using the query volumes to forecast tourist indicators in real time for all the forecasting horizons.
The structure of this article is as follows. ‘Dynamic factor models’ section outlines the dynamic factor model, which relates the tourism indicators to be forecast to the set of Google search volumes. ‘Empirical results’ section analyses the estimated factors and examines the empirical performance of Google query volumes in forecasting tourism indicators in Spain. ‘Conclusion’ section concludes and proposes several future lines of research.
Dynamic factor models
Models that manage large sets of indicators typically suffer a trade-off between the data reduction requirements and the cost of discarding relevant information. Factor models are traditional dimensionality reduction techniques that try to mitigate this problem by summarizing the whole cross-sectional dynamic in a few common factors (Geweke, 1977; Sargent and Smith, 1977). Then, the estimated factors can be used to provide efficient forecasts of a target variable in a simple linear regression. Significant examples can be found in Stock and Watson (2002a, 2002b), Bai (2003) and Forni et al. (2005).
The forecast problem can be described using two basic equations. Let yt be either the checking in or overnight stays of travellers, the target series to forecast. Let Xt be an N-dimensional vector of search volumes. 5 Assume that the query volumes admit a factor model representation, that is, the evolution of the time series can be decomposed as the sum of r common unobserved factors, Ft , and their respective idiosyncratic dynamics, et
where Λ is an N × r matrix of the factor loadings, and et is an N × 1 vector of independent idiosyncratic disturbances. Provided that Ft + h is available, the h-horizon forecast equation is described by the forecasting equation
where μ is a constant, β(L) is a vector lag polynomial, α(L) is a scalar lag polynomial and εt + h is the forecast error. 6 The term HWt is a dummy variable that takes on the value one if month t refers to the Holy Week. 7 Once the model is estimated, the forecast is then performed as
where
In order to estimate the unobserved common factors, we follow the lines suggested by the influential contribution by Stock and Watson (2002a). Skipping details, the methodology is based on estimating the dynamic factors through principal components. Following their notation, it is possible to write the nonlinear least square function
as a function of hypothetical values for factors,
Empirical results
Data description
Due to the widespread popularity of the Internet, a growing number of travellers use web search engines to planning their trips and stays. The anonymized searches made with Google have been used to construct weekly indices that collect the relevant information on the trips and stays that travellers take and intend to take. The search volumes used to obtain all the results of this article come from weekly reports on indexed volumes of different search term baskets related to various tourism industries that cover the period from the first week of July 2007 to the second week of January 2016.
This data set, based on searches volumes, differs from the data sets collected from Google trends in two main aspects. First, Google trend is an index of the popularity of a specific term with respect to the total searches performed at a specific time range and geography. In this sense, Google trend data are typically scaled on a range of 0 to 100, while searches volumes are referred to a value of 100 at the first observation of the sample. The second distinctive feature of our data set with respect to Google trends data sets has to do with randomization issues. While Google trends data come from a periodic random samples of searches data that change every week, volume searches are always collected from a larger, but fixed, sample of queries regardless of the moment when the data are extracted.
Table 1 summarizes the searches related to tourism, the country of origin and the availability of the data. Classified by the country of origin, searches volumes show how often several travelling-related topics have been searched for on Google over time. The countries where the searches were collected from are Austria, Germany, France, Ireland, Italy, Switzerland, the United States and the United Kingdom, which accounted for 62% of the total non-resident overnight stays in Spanish hotels during 2015.
Query volume series available per countries.
Note: The symbol a (na) means that the query volume was (not) available for that country.
The query volume indices rely on searches on travel facilities (air, ferries, bus and rail), accommodation (hotel, holiday rental and camping), vacation packages and general travel and destination (city and short trips, activities, weather and rent a car). 8 As previously said, all search volume indices start with a large sample of the total query volume related to each specific term in a specific country divided by a constant at a point in time. The resulting figures are then normalized so that they start at 100 in the first week of July 2007. Finally, to be compared with the checked-in and overnight stays of travellers, which are published on a monthly basis, we compute the monthly averages of the weekly indices.
To examine the dynamics of travel-related Google search, Figure 1 shows a weighted average of all query indices, which, although not used in the empirical analysis, is obtained for reasons of presentation. In addition, the figure also plots two official tourism statistics, the overnight stays and the number of non-resident travellers checked-in hotels. Regarding tourist indicators, the National Statistics Institute states that checked-in travellers include all people who stay one or more consecutive nights in the same collective tourist accommodation. Overnight stays include every night that a traveller spent in these establishments. In this article, we focus on the versions of tourist indicators that only account for non-residents. 9

Query index and non-resident tourism indicators. (a) Overnight stays. (b) Travellers checked-in.
The figure shows a high correlation between short-term movements in the tourist indicators and the weighted query index, in both cases, showing the same strong seasonal pattern. Moreover, the averaged query index appears to start growing a few months before the beginning of each summer season, which could be related to people planning ahead for their holidays.
To remove seasonal patterns, we use year-on-year growth rates instead of monthly growth rates of seasonally adjusted data. 10 Therefore, to be compared with the annual growth rate transformation employed in the case of the query indices, we also use year-on-year growth rates for the tourist indicators in the model. According to Figure 2, the evolution of tourist indicators in Spain showed a phase of deep decline during the Great Recession followed by a period of steady growth thereafter. In light of the severity of the 2008 downturn and the rapid recovery in 2009 suffered in the tourism sector, the relevant question is whether query volumes can help to anticipate the current and short-term evolution of tourist developments, to allow policymakers and the tourist industry to adopt pre-emptive measures.

Comparison of yearly growth rates. (a) Overnight stays. (b) Travellers checked-in.
Figure 2 also reveals that search volumes and tourist indicators cohere strongly across time during the sample period. In fact, the in-sample correlation between total travel-related Google queries and non-resident overnight stays or the checked-in into hotels are up to 0.61 and 0.58, respectively. A good example of this closed relationship among searches volumes and tourist indicators can be depicted in Figure 3, which shows how the annual growth rate of each of the travel-related query from Italy correlates with the annual growth rate of Italian overnight stays in Spanish hotels. In particular, we show a 2-year rolling window of that correlation for each of the query volume index specified. According to the figure, the correlations are close to one in most of the cases and along the complete period (vintages from July 2010 to December 2015).

Correlations between Italian overnights stays (Spanish hotels) and travel-related Google query. Note: Two years rolling windows correlations. Windows from July 2010 to January 2016.
In-sample analysis
A total of 65 series of year-on-year growth rates of query volumes are used to estimate the common factor model by principal components. The first three estimated factors are plotted in Figure 4.

Estimated common factors.
In order to give an interpretation of the estimated unobserved components, we follow Stock and Watson (2002a) and we compute the R 2 of the regression of the 65 query volume series against each of the first three factors estimated over the full sample period. These R 2 are plotted in Figures 5 and 6 as bar charts with one chart for each factor. In Figure 5, the search volumes are grouped by category, starting from those which have a larger R 2 with respect to the first factor.

R 2 between factors and individual query (grouped by query).

R 2 between factors and individual query (grouped by country).
The figure shows that the first factor loads primarily on ‘pure destination’, where the R 2 is above 0.3 in seven out of eight cases. For the second factor, the query volumes are mainly related to ‘hotels’ and ‘bus and rail’, while pure destination continues to be relevant. 11 Regarding the third factor, query volumes related to ‘hotels’, ‘air’ and ‘activities at destination’ are the most significant, although the R 2 is bigger than 0.1 in only 6 out of 65 search volumes series.
In Figure 6, the query volume indices are grouped by countries to examine the importance of the country searches on the formation of factors. The figure shows high correlations between the first factor and the country searches, which implies that the first factor is representative for all countries. However, searches from Italy and the United States seem to play a prominent role in the formation of the second factor, while the first third rests on the United Kingdom, Germany and Ireland.
Simulated real-time analysis
The results obtained in the in-sample analysis are in practice only of limited usefulness. In monitoring the tourist sector, the analysis is developed in real time, where data are subject to differences in publication lags, which we need to take account of when computing the forecasts. Accordingly, we propose a forecast evaluation exercise that is designed to replicate the typical situation where the model manages real-time data flow. For this purpose, we construct a sequence of data vintages from the final vintage data set that tries to mimic the actual real-time vintages, in the sense that the delays in publication are incorporated.
Without losing generality, we assume that the forecasts are computed on the 15th of each month t. According to the publication lags, in month t, the data set used in the forecasts is updated with the tourist indicator up to month t − 2. However, query indexes are available to compute monthly averages up to month t − 1 and the average of the first 2 weeks of month t. Figure 7 shows that the latter are accurate proxies of the monthly query averages of month t.

Query indices with partial information.
In each month t, using the generated sequence of data vintages, the models compute the inferences of the tourist indicators in month t − 1 (backcast), in month t (nowcast) and in month t + 1 (forecast) in a recursive way. Starting with the backcasts, the model
where r refers to the number of factors and m to the number of factor lags, is estimated using data up to t − 2. Then, the backcasts of t − 1 are computed as
To compute the nowcast, the model
is estimated with data up to t − 1. 12 Then, the nowcast is computed as
where we use the backcast
Finally, the forecasting equation is re-estimated to compute forecasts
with the extended data set up to t. The forecast of t + 1 is
where
The first data vintage of this experiment refers to data as it would be known on 15 October 2014. According to the 3-month blocks of forecasts computed from the model, the models produce forecasts of the tourist indicators in September 2014 (backcast), October 2014 (nowcast) and November 2014 (forecast). 13 Following this updating scheme, the data set is updated each month up to 15 January 2016, leading to 15 different vintages.
We are now in a condition to assess the extent to which the searches in Google data help tourism prediction. For this purpose, we compute the root mean squared error (RMSE), which is the average of the deviations of the predictions from the latest releases of the tourist indicators available in the data set. In addition to the model that incorporates the information coming from Google search volumes, a univariate autoregressive model, which is also estimated in pseudo real-time producing iterative forecasts, is included as a benchmark model. 14
To facilitate comparisons, Table 2 reports the RMSEs relative to the univariate autoregressive model. Hence, an entry of less than one indicates that the factor model forecast is superior to the autoregressive univariate forecast. The immediate conclusion obtained when comparing the forecasts results displayed in the table is that it is beneficial to use the query volumes information in forecasting the Spanish tourism. However, the relative gains from the model that uses the search volumes indices depend on the number of factors and lags for the factors included in the model. Regarding the backcast and nowcast ability of the model, major gains are obtained when two factors and three lags for those factors are included in equation (3), both in the case of predicting overnight stays and checked-in traveller variables. In the former, the RMSEs fall, in general, by at least 7% (in the case of rental apartments, major gains are found when three factors and one lag for those factors are included). Regarding checked-in travellers, the gains are relatively lower, being in general between 6% and 10%. When the focus is on forecasts, the higher gains are found when a model with three factors and zero lag for the factors is used. In that case, the relative RMSEs are, depending on the target variable, between 13% and 24% lower than in the case of an AR(2).
Predictive accuracy: Enlarged AR (values relative to an AR model).
Note: RMSE: root mean squared error. t − 1, t and t + 1 refer to the backcasting, nowcasting and forecasting exercises, respectively; k and m refers to the number of factors and lags (for those factors) included in the model. The forecasting sample is from September 2014 to January 2016, which implies comparisons over 17 forecasts. Entries are the relative (to an AR model) RMSE of an autoregressive model that is enlarged with the first k common factors extracted from a principal component for travel-related query.
This result confirms the leading forecasting ability of tourism indicators by query volume indices, which is clearly achieved when the early available search data are accounted for by the model. The promptly published information of search volumes series is relatively much richer and more valuable in forecasting than in the backcasting and nowcasting exercises.
As a final remark, we point out that this model can be used to compute backcasts, nowcasts and forecasts on any day of the month, which implies using information on query volumes updated until the day before the forecast computation. As an example of how the model produces inferences, Figure 8 shows the backcast, nowcast and forecast of overnight stays in hotel that were obtained on 15 February 2016, along with the prediction errors. It should be noticed that the remarkable increase expected for March is associated with a base effect related to Easter. 15

Overnight stays in hotels. Backcast, nowcast and forecast done on 15 February 2016.
Conclusions
The Internet has radically changed the manner in which tourists and travellers obtain travel-related information. The evidence presented in this article, based on the performance of tourism search volumes provided by Google over a real-time exercise, has provided very promising support for using search information to predict checked-in and overnight stays of non-resident travellers in Spain. Our finding is of significance in this literature, since Spain is extremely dependent on tourism and is one of the world’s top tourism destinations.
As in any big data setup, the first step is to capture the big amount of information provided by the volume of searches. For this purpose, we assume that the query volume indices admit a factor model decomposition, in which each query volume series is the sum of a small set of common factors and an idiosyncratic component. Then, common factors are used to forecast checked-in and overnight stays of travellers. Within this framework, we find that the promptly published information of search volumes series is relatively much richer and more valuable in forecasting than in the backcasting and nowcasting exercises.
Despite these promising results, it is important to recognize that the conclusions regarding the performance of searches volume series examined in this article are necessarily tentative, mainly because of the limited number of observations that are available for the query volume indices. As more data become available, future work on the help of search volume series in the forecasting of tourism indicators could include using additional tourism indicators, extracting seasonal components from the time series with seasonal adjustment techniques, and using nonlinear forecasting methods.
Footnotes
Acknowledgement
The authors are thankful to R. Domenech, M. Cardoso, C. Ulloa, A. Urcola, M. Trias and M. Moya for their comments that have greatly improved the quality of the paper. MC acknowledges Groups of Excellence, Fundación Séneca, and Science and Technology Agency for financial support. All remaining errors are the authors’ responsibility.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The work of MC was financially supported by the Groups of Excellence, Fundación Séneca, and Science and Technology Agency (project nos ECO2013-45698-P, ECO2016-76178-P and 19884/GERM/15).
