Abstract
This study models the relative expenditure of tourists in terms of budget allocation according to their dependence on a group of covariates. A model that captures how tourist distributes its budget among the diverse items is introduced to characterize and compare different types of tourists according to their relative expenditure patterns. For the empirical exercise, data for the period 2014–2019 collected by the Ministry of Tourism from the Inbound Tourism survey in Uruguay is analyzed by means of the compositional data analysis and modeled by a Dirichlet regression. The empirical results show that the expending pattern in accommodation, food, and others items depend on the destination, the season, the nationality, and the type of accommodation. In addition, the inferential analysis reveals different typologies of tourist, providing a novel interpretation of the tourist behavior from the microeconomic perspective.
JEL ClassificationL83, Z30, C14.
Introduction
The level of tourism consumption by international tourists arriving in Uruguay has continuously increased in the last decade. Tourism consumption has played an important role in promoting economic growth and for jobs creation, see Isik et al. (2018) and Danish and Wang. (2018). According to Uruguay’s Statistics Office of the Ministry of Tourism, in 2019, 3,220,602 tourists entered Uruguay and the international tourism expenditure was USS 1.753:781.316, accounting for 8% of total GDP Ministerio De Turismo (2020) and around 20% of exports. The same study reports that the aggregate expenditure of international tourists consisted of 30.5% on accommodation, 26.9% on food, 13.9% on purchases, 8.7% on transportation, 0.2% on tours, 12.4% on cultural and recreational spending, and 7.5% in other items. This aggregate description of the composition of expenditure does not reflect the differences that can be observed for different segments of tourists. There are several empirical papers analyzing the aggregate demand of international tourists in a destination (see the review studies Crouch (1994); Divisekera (2013); Lim (1997); Song and Li (2008); Witt and Witt (1995), and references therein), but studies on the determinants and modeling of tourist demand at the microeconomic level are less numerous. In particular, most of these microeconomic studies on the determinants of individual tourist expenditure examine the total expenditure, but only few of them analyze the dependence between the different categories of expenditure of individual tourists and their determinants, see Park et al. (2020). In this paper, we do not carry a systematic literature review, given the restrictions in the length of the paper, and given that there are recent reviews on CoDa analysis applications in Social Sciences Coenders and Ferrer-Rosell. (2020) and on the determinants of the expenditure and methodologies to analyze it Park et al. (2020). We prefer to direct the reader to these reviews so that they can directly learn about the state of the art on these topics, and the general results of these papers will be used to explain the contribution of our study.
This study analyzes how tourists distribute their expenses in a destination, and focuses on studying and modeling not only the different items of expenditure but also the dependence between them. In other words, the objective of the paper is to model the proportion of tourist spending in each category from a set of covariates. In this sense, the analysis is carried out through the theory of the compositional data analysis (CoDa) and, in particular, a Dirichlet regression to model the composition of expenditure in the different items. The paper studies the behavior of tourists with respect to their expenditures, using official statistical data from the Inbound Tourism Survey (Ministry of Tourism of Uruguay) that collects data on non-resident passengers who visit Uruguay, between 2014 and 2019 (see Ministerio De Turismo (2020); Rodríguez and Montelongo (2016)). The set of the variables surveyed include country of origin, group composition, destination, length of stay, and expenses incurred. The objective of the survey is to determine a profile of the tourist that allow to a sectoral analysis. Not considered in the survey are: (i) people in transit who do not enter through immigration controls, (ii) people working on ships, planes or buses who return in less than 24 h (i.e., people not spending the night in the country), and (iii) people who enter with the aim of residing permanently in Uruguay. Spending can be found reported in different currencies for each surveyed group; however, since the comparison is relative to total spending, this fact does not affect the analysis to be carried out. This article aims to contribute to the empirical literature on tourism research by an application of CoDa tools and Dirichlet regression, focused on the analysis of composition of tourist expenditure and its determinants. The purpose of this study is twofold. On the one hand, describe the marginal influence of each covariate on the relative expenditure vector by using the descriptive CoDa. On the other hand, we are interested in analyzing whether this marginal influence is significant in the aforementioned covariates. To analyze these marginal effects, the Dirichlet regression model is introduced. This model allows to address the complexity of the interactions between the covariates in the explanation of behavior of our variable of interest (the tourist expenditure).
The rest of the study is organized as follows. Section 2 presents the data and the general characteristics using a descriptive analysis. In Section 3, the results of CoDa analysis is presented and the Dirichlet model is introduced. Section 4 introduces the empirical results and the final section is used for conclusions and discussion.
The dataset
The database comes from the Inbound Tourism survey in Uruguay, which is aimed at non-resident visitors. The data is open and accessible at Ministerio De Turismo (2020). The sample size covers 70,501 surveys conducted between 2014 and 2019. The analysis was carried out on those tourists who report spending on at least one of the items (98.5%). Removing those who have zero expenditures, the size of the database is 69.456 observations. Each observation is associated with a (compositional) data of the sample.
This survey has various objectives: to quantify the income of foreign currency to the country, to gather information about the profile of visitors and their financial behavior, and thus be able to carry out different sectoral analyzes of inbound tourism. The survey includes, based on a random sample, all non-residents who enter the country for various reasons: conferences, vacations, or business. Note that the sampling unit is not each individual, but the group of people that they integrate (with at least one non-resident) at the time they leave the country. As indicated in the survey report, it is not possible to precisely delimit the sampling frame. However, assuming some stability from year to year, the determination of sample sizes is based on the framework used in the previous year and the weights are quantified using data from the National Migration Directorate available month by month. The sample design presents different stages through a double stratification by post out of the country and by fortnight. The design also has a cluster sampling by day and time in which the survey is carried out. The selection of the days is not random because it is necessary to establish balanced daily workloads. The pollster continues conducting surveys until a predetermined amount is reached. If the number of tourists on that day and time exceeds the number of surveys, carry out a simple random sample of the necessary size. Taking into account this sample design, the weights that will later be used in the estimates of the models used are constructed.
Proportions of answers in each covariate.
Modeling the relative tourist expenditure
In this section, we propose a model to understand the association of the proportion of total tourists’ expenditure on each spending item (accommodation, food and meals, tours, shopping, and others) with a set of covariates. The purpose is not to model the level of spending of each group, but: (i) Study how it distributes its spending among the different items (ii) Identify the characteristics of the group that allow inferring a relative pattern of spending and (iii) To analyze the differential pattern according to the seasons of the year.
An advantage of applying compositional analysis to this type of data is that mitigate the “self-reported overestimation effect.” It is well known (see Howard and Dailey (1979); Van de Mortel et al. (2008)) that the respondent may bias his response about expenditure for reasons of social desirability or confidentiality of its economic situation. Analyzing the proportions can reduce this phenomenon. Furthermore, since the groups are formed by a different number of members, it would seem arbitrary to compare the absolute expenditure of the different groups. By working with the proportions of expenditure, the data get comparable. Previous to the empirical modeling, the reported absolute expenditure is divided by the number of days that the group of tourists remains in the country to obtain the relative expenditure per day of stay. This is the dependent variable of the model, and offers valuable information to the different tourism agents (including policy makers, entrepreneurs, and planners) given that it captures the temporary trend dynamics of relative expenses by item, see Pavlić et al. (2020). From this compositional point of view, a descriptive analysis of the database is carried out to subsequently fit a regression model.
Descriptive CoDa analysis
The analysis in reference to relative expenses according to the different covariates is described in Figures 1, 2, and 3. These graphs are called “parallel coordinate graphs.” The horizontal axis represents each of the components of the vector of expenses by category: accommodation (E-Acco), foods (E-Food), culture (E-Cult), shopping (E-Shopp), tours (E-Tour), transportation (E-Transp), and other expenses (E-Others), and on the vertical axis, the relative proportion of spending in each component. In the graph of each covariate, each line (marked with different colors) represents the proportions relative to each of the covariates’ categories. Parallel Coordinates Plots: Accommodation. Note that each vertical line of points shows the comparison of the proportions of expenditure in each item, for the eight categories of accommodation. Parallel Coordinates Plots: Nationality. Note that each vertical line of points shows the comparison of the proportions of expenditure in each item, for the nine categories of nationality. Parallel Coordinates Plots: Days. Note that each vertical line of points shows the comparison of the proportions of expenditure in each item, for the number of days of stay (in five periods).


Regarding the accommodation choice (see Figure 1), those who stay at their own home (second residence), stay in hotels or in apart hotels, are the ones that make the highest relative spending (per person) on accommodation. The intuitive explanation of this fact is the following. On the one hand, those who have a second residence spend the whole year on maintenance, taxes, electricity, etc., and this expense is greater than that of renting a house during the period they are in the destination. On the other hand, most of the hotels in the Sun and Beach destinations in Uruguay (which are few) have high prices, and those who go to Apart Hotels in general are in the timeshare modality, so they have an expense throughout the year. They are followed by those who rent the house and then those who choose a hostel or camping. Although it is expected that the spending on accommodation of those who stay in hotels (or apart hotels) will be relatively higher than the other spending items (many times these options even include some meals), there was not an priors for those who stay in their second residences. Probably, the expenditure of these tourist groups includes some annual costs of maintenance of the house who own. Additionally, those who stay at a relative’s house do not report spending on accommodation and, alternatively, their highest spending is on food and shopping.
In general, those who do not have accommodation are excursionists whose higher expense refers to transportation. Spending on tourist packages, in general, is negligible compared to the other items. Regarding nationality (see Figure 2), Uruguayan tourists (non-residents in Uruguay), as expected, do not spend on accommodation, where their highest spending percentage is on souvenirs and food. Brazilians are the tourists who spend the highest proportion of their expenditure on accommodation, while European are those who have the lowest proportion of their expenditure on this item, probably because of their shorter visit.
Figure 3 shows the different behavior with reference to relative expenses according to the number of days of stay. As expected, tourists who do not spend the night in the country distribute their expenses between transportation and food. Those who stay a few days direct most of their expenses to accommodation, and those who have a long stay (more than 14 days) concentrate their expenses on food.
In the first model, the temporal variable is considered as an explanatory factor. To obtain a more robust analysis and where the values acquire comparability in relative terms, the destinations are grouped into Montevideo, East Coast, and the rest of the country. Following the same logic, the nationalities are grouped in Brazil, Argentina and the Rest of the World; and the expenses in Food, Accommodation and the other of the items. Therefore, the methodology is based on fitting a regression model where the output is a vector of relative expenses from the set of explanatory variables. The importance of this model lies in characterizing the significance and importance of the variables that frame the different groups of tourists and determining how these influence the relative distribution of spending in the various items. These results can allow us to visualize which sectors are the most relevant in the different seasons of the year and how their behavior changes over the years. The implications of the results can allow decision-makers to develop policies and strategies in reference to which sectors to underpin in a particular season of the year or support those who are losing interest in this tourist destination. In addition, the detection of significant covariates in the regression model will reveal possible associations that allow describing the evolution of the phenomenon and not only contemplating the dynamics. In this analysis, as the proportions are dimensionless, it is not necessary for the reports to be in the same currency. The year and month that is assigned to the tourist are in reference to the date of the average day of their entire stay.
If X
i
denotes the random variable that indicates the expenditure in the ith item, the relative expenditure in this item Y
i
is given by Transformation of variable X to simplex in 
In this study, only those groups that declare spending in at least one of the items are considered,
This restriction implies that classical statistical methods cannot be applied or must be reformulated to fulfill the assumptions they require. For example, it is not possible to directly make assumptions of normality of the data and, on the other hand, the imposed restriction leads to spurious correlations, see Chan and Bentler (1993).
The literature on compositional data and its applications to different fields of science is vast. In recent years, a high number of publications have been found in various areas: in geology (e.g. Buccianti et al. (2006)), in medicine (e.g. physical activity times, Dumuid et al. (2017), in education (Borba et al. (2020)), customer satisfaction survey data (e.g. Vives-Mestres et al. (2016)), food composition in the food industry (e.g. Leite (2016)), in genetics (e.g. Tsilimigras and Fodor (2016)), and environment (e.g. Filzmoser et al. (2009)), among others. However, there are few studies in the tourism area where these techniques have been developed. Through mixed models and the use of compositional data for spending in various areas, in Ferrer-Rosell et al. (2015) and Ferrer-Rosell et al. (2016) a segmentation of tourists is carried out. In Song et al. (2019), using the log-ratio methodology, a regression model is implemented to explain the expenses of air tourists based on different variables. In the recent paper, Coenders and Ferrer-Rosell (2020) reports an exhaustive review of papers in this topic in tourism studies. The study highlights the importance of the study of the parts that make up the whole.
Figure 5 shows a descriptive analysis by origin of the tourist groups, revealing how the average proportions vary depending on the year and season. The following example shows how to read this figure. Note that the coordinates of the distribution of the expenses in Spring–Autumn for the year 2017 show that the Argentines hold 65%, the Brazilians 13%, and 22% the rest of the nationalities. In all seasons, there is a rise in the proportion between 10% and 15% of Argentines between 2016 and 2017, and then a decline in 2019. Both events can be explained by the good (bad) performance of the Argentinian economy and the influence of exchange rate. Another relevant aspect is the difference in Argentines between seasons. The greatest impact of tourists from Argentina is during the mid and low season. Tourists from a non-border country opt in terms of the winter season compared to the summer season, in the opposite direction to the Argentines. For example, in 2019, the percentages of non-neighboring countries were 38%, 30%, and 22% in the low, medium, and summer seasons, respectively. The flow of Brazilian tourists is more stable than that of other tourist origins, throughout the four seasons, varying between 10% and 20%. This fact could indicate that it is the origin of tourism less determine by the seasonality of the sun and beach tourism modality. Average of the proportions of tourists by year and season according to their origin. The graphs below are a zoom of the figures above to better appreciate the differences.
The descriptive analysis carried out only accounts for some marginal influences of the covariates on the relative expenditure vector but it does not provide information on whether this influence is significant neither on the joint effects of the mentioned covariates. To analyze these subjects, a Dirichlet regression model is proposed in the following subsection.
The model
The following step is to model the compositional data as a function of independent variables. There are several statistical methodologies reported in the literature to model this type of data (see, for example Aitchison (1982); Di Marzio et al. (2015)). The log-ratio methodology is in general the most used for the analysis of compositional data due to different desirable properties (see Pawlowsky-Glahn et al. (2015) and Filzmoser et al. (2018)). However, other authors highlight the virtues of the Dirichlet regression models in reference to the log-ratio transformations (see Hijazi and Jernigan (2009); Maier (2014)). For example, this can give greater importance to components that have little overall emphasis on a meaningful 285 understanding of the composition (Hijazi and Jernigan (2009)) and the estimated parameters are only interpretable within the transformed space (Maier (2014)). The Dirichlet model was first developed by Goodhardt, Ehrenberg, and Chatfield and it is frequently used in Marketing research for the patterns of repeat purchases of the brands within a product category (see Ehrenberg (1959) and Goodhardt et al. (1984)).
This work applied the Dirchlet’s parametric regression which allows to study the significance and contribution of the explanatory variables in the model in an easily way. In particular, a random variable
Figure 6 shows the different simulations varying the parameters α
i
> 0 with i = 1, 2, 3, 4. Level set Dirichlet’s density and simulation of 200 observations for (a) α1 = (0.1, 0.1, 0.1), (b) α2 = (1, 1, 1), (c) α3 = (5, 5, 5), and (d) α4 = (0.5, 2, 5).
In order to apply Dirichlet regression, the dataset needs to consist of strictly positive values, contrary to what happens in this case. To solve this problem, Greenacre (2018) proposes to replace the zeros with small positive values. In this paper, we apply the transformation
Tsagris and Stewart (2018) proposed a modification to the Dirichlet regression to account for the presence of zeros (without transforming them into positive values). On the other hand, the parameter estimates take into account the sample weights considered in the sample. Dirichlet regression is an alternative way of studying not only the contribution of each covariate to the behavior of a given phenomenon (in this case tourism expenditure), but also the complexity of the interactions between the set of covariates. One of the main virtues of this method is that the parameters can be easily interpreted in terms of the contribution of each covariate to the explanation of the phenomenon as a whole. However, as can be seen, by construction the Dirichlet covariates are negatively correlated, which does not necessarily conform to the assumptions or aprioris of the research. This is a limitation of the Dirichlet model, especially in comparison with other traditional likelihood methods.
Results
Simple models
To understand the general model, first, we develop three simple models, on which the vector of expenses, Y, is considered in four groups (accommodation, food, shopping, and tours) and the covariates, in each model.
In the first model (Model 1), they are included as covariates the number of days of stay and nationality (Argentines, Brazilians, or non-residents Uruguayans). Figure 7 shows the value of the vector components α′ for each nationality by varying the number of days of stay. It is observed that as the tourist stays longer, proportion of expenses on accommodation and tours decreases meanwhile shopping and food increases for all nationalities but the intensity is diverse. This phenomenon is particularly clear for Brazilian tourists. It is interesting to note that as the stay is prolonged, the α′ increases, that is, the behaviors between groups become more homogeneous. Model 1: Estimated parameters α′(i) for tourists who stay between 1 and 20 days, Argentinians (left panel), Brazilians (center panel), and Uruguayans (right panel).
The second model includes as covariates (Model (2) the number of days of stay and tourist destination (Montevideo or Punta del Este). Figure 8 shows the value of the vector components α′ for each destination the number of days of stay will vary. The trends towards an increase in the proportion of spending allocated to food and shopping and a decrease in the proportion of spending associated with accommodation and tours observed in Model 1 seem to be maintained. Estimated parameters α′(i) for tourists who stay between 1 and 20 days, with destination Montevideo (left panel) and Punta del Este (right panel).
Finally, the third model (Model 3) incorporates the covariates total expense and nationality (Argentines, Brazilians, or non-residents Uruguayans). Figure 9 shows the value of the vector components α′ for each nationality and the total spending of the group of tourists varies. The results show that as total spending grows, both Argentines and Brazilians increase the proportion of spending on accommodation, but tourists of Uruguayan nationality increase spending on food. This could be justified by the fact that a significant proportion of Uruguayan tourists do so in second homes. Estimated parameters α′(i) for total expenditure, Argentinians (left panel), Brazilians (center panel), and Uruguayans (right panel).
However, if we want to analyze the global behavior of the phenomenon and control for the different effects of the set of covariates, it is necessary to formulate models (such as the one in the next section) that encompass all the information. This also gives robustness to the analysis.
General models
Coefficients of the Dirichlet general model (standard error are reported in brackets).
Model-predicted proportions (output) for different scenarios.
Remark 1: Goodness of fit of models
After implementing the model (Dirichlet general model), in this section, we test how well it fits the observed data.
A graphical way of visualizing the adjustment of the residuals in linear models are the so-called Q–Q plots, where the empirical quantiles of the standardized residuals of the sample are plotted with reference to the quantiles of the typical normal. If the fit is good, the graph will be expected to approximate the identity function. For the generalized linear model (see Augustin et al. (2012) and Li (2015)), the following procedure is performed
• The model is fitted using the Dirchlet distribution and its parameters are estimated. Their residuals are calculated and we determine the norm of each one of them. We call the norm of the residuals EmpiricQuantil.
• From the estimated parameters and the Dirichlet distribution, a sample of output vectors is simulated, which are re-modeled with the Dirichlet distribution and the norm of the new residuals is estimated, whose norm we call TeoQuantil.
If the model fits well the distributions of the norms of both residuals, they should have the same distribution. To do this, it is represented in a Q–Q plot of both variables to compare their distributions, see Figure 10. Graphs of the norm of the empirical quantiles with respect to the theoretical ones to test the goodness of fit.
The adjusted
Note that the goodness of fit of the model is weak. However, due to the difficulty of the problem and the way the data was collected, this is an expected result.
Remark 2: Effect size
On the other hand, the large sample size means that all the variables are significant. For this type of case, it is convenient to analyze the magnitude of effect that indicates the relevance of each variable (see for more detail chapters 8 and 9 in Cohen (2013)). As indicated in Sullivan and Feinn (2012), if the sample is large enough, the p-value depends on the sample size in general and will give a significant difference (the p-value below) even if this is meaningless or of no interest. Therefore, in these cases, the p-value is not an adequate value for the researcher to understand the phenomenon under study. In this sense, it is necessary to analyze the effect magnitude for generalized linear models, which is a challenge for future work.
Conclusions and future research
Compositional data analysis, initially applied in Chemistry and Geology (Aitchison (1982)), is a standard and useful tool for examining the relative importance of different parts of a unit. However, according to Coenders and Ferrer-Rosell (2020), its application in the academic literature applied to tourism analysis is very recent.
This article contributes to this branch of the empirical literature on tourism, proposing a CoDa analysis of the tourist expenditure in Uruguay and a Dirichlet model of the relative expenditure of tourists in terms of budget allocation as a function of a set of covariables. The empirical study considers incoming tourists to Uruguay whose expenditures are analyzed employing cross-sections of official statistics data (2014–2019).
This study shows that expenditure patterns in accommodation, in food, and other expense items vary (significantly) according to the season, the tourist destination, the type of accommodation, and the tourist’s nationality. In general terms, these results are consistent with those obtained by Brida et al. (2021) applying different methodologies. In this sense, this research provides inputs and tools to analyze the behavior of tourists (concerning spending patterns) as consumers visiting Uruguay.
In addition, the results of this research shows that there are significant differences in the proportion of spending patterns on accommodation, food and shopping (or other expenses) when tourism does not take place in the high season. This point is an important feature to be taken into account in the planning and marketing tourism policies, either in public or private fields. Additionally, when the tourist destination is neither Montevideo nor in the sun beach destinations, the expenditure concentration in accommodation lowered. Finally, note that the occupation and the type of employment of the tourist is a key variable in the configuration of the pattern of consumption of tourism goods and services. These empirical results can be particularly useful for making adjustments in tourism supply according to the behavior of demand.
As an initial contribution to the academic literature on the analysis of the tourism sector from this empirical approach, several points emerge for future research. In particular, the analysis can be extended to previous years using the CoDa and Dirichlet regressions to analyze the dynamics of the problem and how the characteristics of the tourists have evolved with time. An additional point for improving our research is to introduce alternative models (to substitute Dirichlet regression) and particularly, following Ferrer-Rosell et al. (2015) and Ferrer-Rosell et al. (2016), the transformation of compositions using log-ratio transformation. One of the weaknesses of the proposed model is the goodness of fit. We believe that a complementary modeling using log-ratio transformations could provide a solution to this problem.
Material for future research is the consideration of additional variables (i.e., the origin of tourists, gender of type of tour group) and the replication of the study to other regions to compare the results and analyze the robustness of the methodology.
Footnotes
Acknowledgements
Our research was supported by CSIC-UDELAR ( “Grupo de investigacion en Dinamica Economica”; ID 881928). A preliminary version of this paper was presented at the GAET seminar organized by FCEA-UdelaR (Uruguay), at the Coda Seminar, organized by Research Group on Compositional Data Analysis, University of Girona (Spain), and at the GSSI webinar, organized by Gran Sasso Science Institute-L’Aquila (Italy). The authors would like to acknowledge the many valuable suggestions made by the participants of these events.
Author’s Note
Earlier versions of this article were presented at the CoDa seminar (http://www.compositionaldata.com/) organized by the Faculty of Law, Economics and Tourism—Universidad de Girona (Spain) and at the GAET seminar (
) organized by the Research Group in Tourism Economics and Management (GAET)—UdelaR. The authors wish to thank the participants of the CoDa and GAET seminars for providing helpful and constructive comments on earlier versions of this article
Author Contributions
All authors are equally contributed. All authors have read and approved the final manuscript.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
