Abstract
Traditional tourism data collection includes surveys, interviews and focus groups. However, these methods are both expensive and time consuming. Moreover, there is a lag between the time of data collection and the receipt of that data for analysis. Today, almost all individuals leave digital footprints on the Internet, which can also be used for tourism research. One type of digital footprint is the photos uploaded on websites such as Flickr. The aim of this study is to determine whether the digital footprints in Flickr provide a useful indicator for tourism demand. Photos tagged with “Austria” between 2007 and 2011 were collected using Flickr API. Residents were distinguished from tourists using the data, and spatial analyses were conducted of the tourist-generated data. The results indicate that geotagged photos in Austria are more representative of actual tourist numbers at the city level than at the regional level.
Introduction
Tourism is one of the leading industries in the world and is continuing to grow each year. According to the United Nations World Tourism Organization (UNWTO 2014), international tourist arrivals increased by 5% to reach 1.087 billion in the past year. In order to accommodate this high and increasing demand, the tourism industry requires accurate statistics and other information regarding tourism demand. Tourism statistics are essential knowledge for tourism in areas such as planning, forecasting, marketing, and policy making (or legislation). Tourism statistics are traditionally collected from accommodation providers and guest surveys (Law 1993; Page 1995), yet these statistics tend to omit day visitors (Cockerell 1997), thus causing tourism demand to be underestimated (Wöber 1999). It is also important to receive the statistics in a timely and accurate manner, yet traditional methods such as guest surveys are both time consuming and expensive.
New and more efficient methods of data collection are needed in order for the tourism industry to work with the data more easily. One way of collecting data is from the Internet, especially from digital footprints left on travel-related websites. Digital footprints include messages on online forums, uploaded photos, or metadata regarding links clicked on a website. With the introduction of smart phones that can take high-quality photos, tourists no longer need to carry a dedicated camera on their travels. In addition, smart phones are constantly connected to the Internet, allowing phone owners to upload the photos to the Internet within seconds of taking them. In most cases, individuals take photos during their vacation and share these with their friends by uploading them to social media platforms.
Several prominent online photo-sharing websites have emerged in the past years, such as Flickr and Picassa. Photos on Flickr include metadata regarding the object of the photo (user tags), the date and time when the photo was taken, the GPS coordinates of the photo location (latitude and longitude information from the user camera), and the profile of the user who uploaded the specific photo. This metadata shows the exact location of individuals and thus can be used in many different ways by destinations. One of the advantages of this type of data is that it enables estimation of approximate numbers of individuals in open areas such as parks and other places with no admission tickets. This type of information is especially useful during free admission events in open areas to know how many individuals are there and where they go during the event.
In this study, photos on Flickr are used to show whether digital footprints constitute useful indicators for tourism demand at particular destinations. Austria has been chosen as the specific destination under study, and photos were collected from the online portal Flickr according to their designation with the tag “Austria.” Unique users were identified from the group of photo contributors and were categorized as either residents or tourists. A polynomial regression analysis was conducted using the tourist numbers identified from Flickr to see if Flickr data can be used to predict bednights at both regional and city levels. In addition, the same regression models were used for forecasting bednights using Flickr data to measure the accuracy of the models.
The article is structured as follows: in the second section, an overview of previous related work is provided, followed by a description of the data and websites used in the study. The third section presents the study methodology, including descriptions of the data collection, data cleaning, and data analysis processes. In the last section, the results of the study and implications for the tourism industry are discussed.
Literature Review
Traditional tourism research methods include primary data collection using surveys, interviews, and focus groups at destinations and hotels. However, this type of data collection is expensive and takes a long time to prepare and get the results, which can be problematic in industry-related studies where the timeliness of the data is paramount. Additional problems arise in finding respondents, acquiring sufficient data to analyze, and generating meaningful results. Online data, on the other hand, is usually freely available, is easy to collect, and is abundant compared to the amount of data that can be collected with traditional methods.
One of the aims of social research is exploration and this includes trying different methods to conduct research, analyzing problems from different perspectives, or analyzing phenomena for the first time. The traces left behind on the Internet by users, which are called digital footprints, represent a new source of data that can be used for conducting research. Digital footprints occur in different forms such as uploaded photos, posted reviews, or clicked links. These data are collected by most websites and can be used as a data source for research and marketing. For instance, the product recommendations made for individuals by websites are based on this type of data.
Girardin et al. (2008a) categorized digital footprints into two types: active and passive. Active footprints are left by users who actively add something to the web page such as by uploading photos and writing reviews, whereas passive footprints are left as a result of interaction with a website (for instance, by browsing through products on Amazon.com, which collects user data) (Girardin et al. 2008a).
Taking photos provides evidence of travel and a memento of a vacation. With the use of smart phones and the Internet, individuals can upload their vacation photos onto social media sites such as Facebook and Twitter, as well as photo-sharing platforms like Flickr. A study among Hong Kong residents revealed that 89% of pleasure travelers take photographs, and 41% of those post their photos on photo-sharing platforms such as Flickr and Picassa (Lo et al. 2011).
Geotagged photos available online indicate where the user has been and, if the user has uploaded more than one photo, the respective times and geolocations reveal the sequence of locations visited by the user. Girardin et al. (2008b) use Flickr data to create a map of places frequented by tourists in Rome, also showing the tourist density in those places. Popescu, Grefenstette, and Moellic (2009) examine photos of 183 cities on Flickr to determine the attractiveness of various places according to the number of photos uploaded and, additionally, to identify the places visited, the duration of stay, and the panoramic spots of each destination. In a similar study by Kisilevich et al. (2010), attractive places and points of interests are identified by spatiotemporal analysis based on geotagged photos, and behaviors of Flickr, Panoramio, and Tripadvisor users are also compared. Lee, Cai, and Lee (2013) use geotagged photos on Flickr to mine points of interests in Queensland. Kádár (2014) investigates tourist activities in European cities based on tourists’ spatial patterns retrieved from geotagged photos and indicates that Vienna, Prague, and Budapest show similarities.
Identifying tourist attractions and creating tourist maps based on geotagged photos is another research theme. Lin et al. (2014) use geotagged photos from photo-sharing communities to create tourist maps showing tourist attractions that are rated as better than maps generated by similar methods and that are comparable to hand-designed tourist maps . Another way of creating tourist maps is by clustering Flickr photos based on their locations and identifying the popular tags for those places (Chen et al. 2009).
Tourist movements in relation to regions of attractions and the topological characteristics of regions have been examined by Zheng, Zha, and Chua (2012). The data are collected from various websites, and a database is built to depict the travel patterns of different tourists. The analysis of tourist flows from one region of attractions to another is performed using a Markov chain model. The results show that the proposed model works well with the four major cities tested. A recent study by Vu et al. (2015) used geotagged photos to explore tourist behaviors in Hong Kong. The movements of tourists in Hong Kong are analyzed based on time, place, and tourists’ origin (Western vs. Asian tourists) by using a Markov chain model for travel pattern mining.
Recommendation systems based on geotagged photos have also been examined. Jiang et al. (2013) investigated a method to identify tourism attractions based on geotagged photos and estimate the popularity of the attraction based on the number of users’ photos. Mamei, Rosi, and Zamonlli (2010) developed an intelligent recommendation system especially for first-time visitors that can learn from past tourist behaviors based on geotagged photos. Kurashima et al. (2010) created a travel route recommendation system based on the photo trails of Flickr users that estimates the user’s probability of visiting a landmark, taking into account user preferences and present location information. De Choudhury et al. (2010) used Flickr data to create automated travel itineraries and compared these itineraries with professional popular bus tours, finding that Flickr data are useful for creating meaningful travel itineraries. In another study conducted in Taiwan to evaluate tourist satisfaction, Flickr data were used to identify where tourists are and to indicate convenient transportation points (Shyang-Woei 2010). Yin et al. (2012) proposed a travel recommendation system based on geotagged photos from Panoramio.com, where users can search for regions or points of interests and generate a map of how to get there. Majid et al. (2013) introduced a personalized and context-aware recommendation system based on Flickr data that performs better than other landmark recommendation methods.
In a completely different perspective, Da Rugna, Chareyron, and Branchet (2012) attempted to identify photographers’ countries of origin based on geotagged photos and behaviors such as the length of stay at one place.
Previous research has investigated different uses of Flickr data, such as identifying user movements and points of interests at a destination or examining event-based movements; however, the representativeness of the data for the overall tourist population has yet to be examined. Specifically, how well does the data from a destination represent the actual numbers of tourists visiting that location. The purpose of this study is to investigate whether Flickr data are representative of actual tourist numbers in different regions of Austria based on the geotagged photos on Flickr.
Methodology
The Data and the Web Sites Used in the Study
In this study, active digital footprints that include photographs posted on Flickr with user-defined tags are used. Flickr (www.Flickr.com) is an online platform that allows users to share their photographs. It has nearly 80 million unique visitors worldwide and 51 million registered users who upload approximately 4.5 million photographs per day (Yahoo 2013). The photographs on Flickr are either geotagged by default from the user’s camera or smart phone or geotagged by the user after uploading the photo on Flickr by placing it on a map.
The tourism statistics relating to bednights that have been used in the analysis were collected from TourMIS (www.tourmis.info). TourMIS is an online tourism database that has monthly and annual tourism statistics (bednights, arrivals, and capacities) at the national and municipal level. Tourism statistics are entered into the database by tourism authorities such as national tourist offices (NTOs) or city tourism offices (CTOs). The data in the system are maintained by the users and updated frequently; thus, it is one of the most up-to-date tourism databases available. The Austrian data comes directly from Statistics Austria, which collects tourism statistics regarding accommodations, including the visitors’ countries of origin, from private and commercial establishments on a monthly basis (Statistics Austria 2014). While other online databases such as EUROSAT, the World Travel and Tourism Council, and the World Tourism Organization make tourism statistics available to the public, they present only annual data and are not as up-to-date as TourMIS. Thus, monthly tourism bednights data from TourMIS were used in this study.
Data Collection
An application was developed by one of the authors using the public Flickr REST API (Flickr 2012) to collect data from Flickr. The application retrieves photos and the corresponding meta-data for a given place (i.e., region, city, or province) for a given time frame. The meta-data includes, among other things: (1) user-defined textual information such as title, description, and tags of the image; (2) geographical information such as longitude, latitude, and a plain-text name of the location, for example, “Innsbruck/Tyrol/Austria”; and (3) date, including the date when the photo was taken and uploaded. Additionally, information about the user is retrieved, such as the name, current location, and current occupation.
The data were collected between March and May 2013, and consisted of the metadata from photos that were uploaded to Flickr between January 01, 2007, and December 31, 2011. Actual photographs were not downloaded since the content of the photographs was not a concern for this study, but the links to the photographs including longitude, latitude, date the photograph was taken, user identification, and photo identification were downloaded as relevant data. The selection criterion for photos was tags by users of “Austria” or any of its regions, including Vienna, Burgenland, Carinthia, Styria, Upper Austria, Lower Austria, Salzburg, Tirol, and Vorarlberg. Austria and its regions were chosen for data collection since bednights data were available for comparison. A total of 1,183,889 photos were collected for the study.
Monthly bednights data regarding Austrian total bednights and foreign bednights in nine different Austrian regions and their capitals were retrieved from TourMIS in July 2013. In total, bednights data in all forms of paid accommodations from 18 different destinations were retrieved.
Data Check and Cleaning
In Flickr, users can place their photos on a map and Flickr assigns longitude and latitude values automatically. Moreover, when users upload photos to Flickr the metadata from the camera is uploaded as well (Pereira et al. 2011). During the data collection process, the authors realized that some of the photos were anchored in the wrong places on the map. For instance, the longitude and latitude revealed that some photos tagged with Austria were actually taken outside Austrian borders. Thus, the data required cleaning by deleting all photos that had coordinates outside of Austria as well as ones that were not taken between 2007 and 2011. In addition, multiple photos by the same user at the same location on the same date were removed from the data set, since the aim of the study is to determine the number of people in a certain destination and not the number of photos.
Residents versus tourists
Since the data collected from Flickr include both tourist and resident data, as a next step tourist data were filtered out in accordance with the purpose of the study to identify the number of tourists. In previous studies, researchers have mainly used heuristics to separate tourists from residents. For instance, De Choudhury et al. (2010) used a 21-day time span between the first and the last date of photos taken and at least 2 points of interest visited in the same city as an indication of a tourist. Girardin et al. (2008) used a 30-day time span between the first and the last photo taken and uploaded as an indication of a tourist. Kádár and Gede (2013) used a 5-day time span for city visitors by calculating the difference between the first and last photos’ time stamps. However, there is no commonly accepted and rigorous statistical method that is used to separate tourists and residents at a destination using geotagged photos.
Identifying tourists from the data set included a few steps in this study. Firstly, the number of individuals was determined from the photos by removing any photos that were taken by the same user at the same location and on the same day.
In the next step, the subsample of users who indicated their hometown and current location on their profiles as Austria were classified as residents, while users who indicated two different locations (excluding Austria) on their profiles were classified as tourists. The two user groups were compared and the time span of their photos, the density of the photos, and the number of cities/regions visited were all found to be different. Based on these results, a multivariate logistic regression model was created that determined the criteria for classification as a tourist to be the following:
Time span: This is the difference in days between the first photo and the last photo that were uploaded on Flickr (Girardin et al. 2008).
Density of the photos in an area: The places where the photos are taken can also indicate whether an individual is a tourist or a resident, since tourists tend to take more photos around tourism attractions than residents (Kádár and Gede 2013); thus, tourists’ photos would be more clustered around specific places whereas residents’ photos would be more sparsely distributed. Density-based clustering was conducted by dividing the area of Austria into 380,000 polygons, each covering an area of 0.69 km2, and identifying the number of photos in each polygon. The area of the polygons was chosen to cover enough area to calculate the density value.
The number of cities/regions visited during the same time span: The number of cities or regions in Austria visited by each individual was also identified. Under the assumption that tourists visit more than one city or region when they arrive in a country, higher numbers of cities or regions visited during the time span corresponds to a greater chance that the individual is a tourist.
A multivariate logistic regression analysis was conducted for the whole data set, and the results are presented in Table 1.
Multivariate Logistic Regression Results.
Note: R2 = .25 (Hosmer-Lemeshow), .24 (Cox-Snell), .36 (Nagelkerke). Model χ2(3) = 556.17, p < .01.
p < .01.
Results
Spatial Distribution
The number of tourists identified in the study sample was 38,080. These data were projected onto a map of Austria that shows the density of the tourists according to regions. This process was done by using R software. Figure 1 represents the aggregated tourist density in Austria between 2007 and 2011 based on Flickr data. The color of the dots represents the number of tourists in that region. Light colors indicate a lower number of tourists and darker colors indicate a higher number of tourists. It can be seen that tourists cluster around major cities such as Vienna and Salzburg and in the western part of the country around the mountainous areas such as Innsbruck.

Tourist density map of Austria.
To assess the accuracy of estimating the number of tourists in a region or a city in Austria, polynomial regression has been performed where the unique Flicker users per day per region/city served as the independent variable and the bednights on TourMIS database as the dependent variable. The data were analyzed at both regional and city levels, and the results are reported accordingly in the next sections.
Regional Level Results
In the regional-level analysis, Vienna was excluded from the sample since it is both the city and the region. The remaining sample includes Burgenland, Carinthia, Styria, Upper Austria, Lower Austria, Salzburg (region), Tyrol, and Vorarlberg.
To ensure normal distribution of the data, bednights as well as the Flickr data have been log transformed. As the scatterplot of bednights and Flickr users suggests a curvilinear relationship between these two variables instead of a linear relationship, a polynomial regression was carried out (see Table 2). The polynomial regression model is as follows:
Polynomial Regression Results According to Region.
“R” represents region.
The models were found to be significant for the Lower Austria, Upper Austria, Vorarlberg, and Carinthia regions, which indicates that the number of bednights increases in these regions as the number of Flickr tourists increases. However, the model was not significant for Burgenland, Salzburg, or Styria.
On the other hand, the polynomial regression on its own does not verify whether the Flickr data are useful in estimating tourist numbers. Thus, the same polynomial regression model was used to generate forecasts for 2011. The accuracy measures of mean absolute percentage error (MAPE) and root mean squared errors (RMSE) were indicated in the results.
Overall, the forecasting results for the regions can be described as inaccurate with high error measures. Thus, for Austrian regions, Flickr data are not useful for estimating the number of tourists in the area.
City-Level Results
In the city-level analysis, the cities included in the sample were Vienna, Innsbruck, Graz, Linz, Eisenstaedt, Klagenfurt and Salzburg. Bregenz and St. Pölten were excluded as not enough Flickr tourists were identified in these cities to conduct the analyses.
Following the same procedure as the regional level analysis, the variables were first log transformed, before a polynomial regression was conducted.
Table 3 shows the results of the polynomial regression at the city level. All of the polynomial regression models were found to be significant for all of the cities; however, the R2 values varied. Salzburg (0.73) and Vienna (0.64) both had high R2 values, followed by Innsbruck and Bregenz, which had R2 values greater than 0.50. Flickr tourists therefore provide a useful indicator of bednights in major cities such as Salzburg and Vienna.
Forecasting Accuracy of Regions.
Note: MAPE = mean absolute percentage error; RMSE = root mean squared error; NA = not applicable.
The insignificant models were not used in forecasting; thus, the error measures are not available.
“R” represents region.
The same forecasting methodology was followed for the cities as with the regions, and the results are shown in Table 4. Vienna, Salzburg, and Innsbruck had reasonable forecasting accuracy results, with forecasting errors of less than 30%. Although Linz and Graz also had low forecasting errors, overall the models’ R2 values were not high, less than 50%. These results indicate that Flickr data can in fact be used as an estimation of tourists in some Austrian cities, but that the addition of other indicators to Flickr tourist data may improve the forecasts (Table 5).
Polynomial Regression Results for Cities.
Forecasting Accuracy of Cities.
Note: MAPE = mean absolute percentage error; RMSE = root mean squared error.
“C” represents city.
The comparison of regional and city level regression results shows that the model is a better predictor at the city level than the regional level.
Conclusion
Overall, this study shows that digital footprints as in the form of geotagged photos are indicators of tourism demand. The results confirm that at the city level, the Flickr data provide a better representation of the actual tourist numbers than at the regional level. This may be due to the fact that city tourists are better represented in Flickr compared to regional tourists. Since city tourists visit points of interest that are well known and are eager to take photos (Jafari 2000), they are better represented in the overall Flickr data. In addition, the actual bednights data for the regions are higher in the winter season compared to summer, yet people spend more time outside in the summer time and may take more photos in summer than winter. In general, regional tourists in Austria are more heterogeneous than city tourists. Depending on the region of the country, different types of activities can be done such as skiing, hiking, biking, or spending time in a wellness hotel. This can influence the number of photos being taken.
Previous research has used Flickr data in different ways, but not as an indicator of tourist numbers at a destination. Overall, this study shows that Flickr data can be used as an estimation of tourists in Austrian cities. However, the results may not be representative of the whole world and it is therefore recommended to replicate the study in different cities and countries.
Although the results of the study indicate that Flickr data can provide an indication of actual tourist numbers at a destination, not all tourists post their photos on Flickr. Thus, the Flickr data by itself cannot be used to identify the number of tourists or visitors at a certain place but can give an estimation of the numbers. This study is meant to show another way of retrieving visitor data for destinations or attractions. It is not intended to replace traditional methods of data collection such as surveys, but it can complement that data. Destinations that have no way of collecting data, such as parks and attractions that do not have admissions tickets, may find this tool particularly useful in tracking visitor numbers.
The results of this study show the distribution of tourists and identify the crowded places on a map. DMOs can use this type of information for improving the tourist density in highly visited areas by implementing different types of marketing to promote the movement of tourists to lesser known areas at the destination or in neighboring areas. Moreover, the findings show where the tourists have been; thus, DMOs at the city or national level can identify the preferred routes of tourists and thus adjust tourist maps accordingly. In addition, the data can be used for improving public transportation at the destination and updating too.
Future research can focus on different cities and for comparison between cities. In addition, the data can be used for forecasting city tourism demand with different forecasting models and identify whether it improves forecasting accuracy. This study shows only one example of a forecasting model using polynomial regression; however, there are other forecasting models and additional variables that may be used for forecasting tourism demand and could result in better forecasting accuracy using Flickr data.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
