Abstract
The small area estimation (SAE) theory is widely used when local or domain-specific reliable estimates based on survey data are needed. Small area model-based estimates use a model that links the response variable to some auxiliary information borrowing strength from the related areas. When geographical information on the areas of interest is available, the specification of a spatial area level model can increase the estimates’ efficiency, depending on available auxiliary data. In this article, we first review the most popular area level spatial models, and we then compare their performance under two alternative scenarios of auxiliary information availability to estimate the average equivalized household income in Italian Local Labour Market Areas (LLMAs) using the EU-SILC (European Union Statistics on Income and Living Conditions) survey data. Our findings suggest that the spatial information can “fill the gap” when the covariates do not have a high predictive power, a crucial result when there is lack of auxiliary data.
Keywords
Introduction
Sample surveys have been established as the effective method to estimate parameters of interest referring to populations, sub-populations or domains (e.g., regions, young women). Therefore, surveys are designed to obtain reliable estimates with a predetermined level of precision for specific populations and sub-populations/domains, usually called planned domains (because they are planned in the design of the survey). Usually, estimation is carried out using direct estimators, which are obtained using only domain-specific survey data. However, stakeholders, policy makers, private and public agencies, governments often need estimates referring to sub-populations or domains not planned in the survey design, usually because of budget and time constraints. In many circumstances these unplanned domains are territorial domains identified by local administrative boundaries or other borders functional to statistical analysis as socio-economic districts and/or specific areas where people live, classified by their degree of urbanization. Notably, in these unplanned domains the survey sample sizes can be very small or even zero. In these cases, when cost constraints do not allow for additional surveys and/or oversampling of the study areas, it is mandatory to integrate existing information in order to harmonize it and produce credible statistics on the state and possibly the dynamics of change of the phenomenon at the local level (Giusti et al.[13]). In the literature, the methods that deals with small sample sizes are known as small area estimation (SAE) methods. For a detailed description of SAE theory, please refer to the book by Rao and Molina[38] or the reviews by Jiang and Lahiri[17] and Pfeffermann.[30]
Nowadays, employing indirect small area estimators is the most popular of the many available methods. Small area estimators are based on a model linking the study variable to available covariate information and specified to borrow strength from the related areas (Rao and Molina[38]; Chandra et al.[5]). Currently, several statistical agencies use empirical best linear unbiased predictors (EBLUPs) defined using linear mixed models (LMMs). Under LMMs the distribution of the study variable is function of area-specific random effects, which account into the model for the differences among the areas. Such predictors may be able to greatly increase the efficiency with respect to direct estimators, depending from the goodness of the model. In this article we consider area level models (Fay and Herriot[11]) that are a popular choice when data are available only in aggregated form, as it happens very often in socio-economic studies, due to confidentiality restrictions.
The use of spatial auxiliary information can be crucial in many applications of small area models, as this information can increase the estimates’ efficiency and effectiveness.
It is widely recognized that many environmental and socio-economic phenomena are characterized by a spatial distribution deriving both from nature and man’s action. For example, the spatial distribution of a pollution agent in the soil is a consequence not only of the soil geological characteristics, but also of man’s actions (e.g., through the construction of roads and buildings). The same combined action characterizes the distribution of crops in a region: man and nature work together and the result is the distribution of cultivated land that is scattered over the surface of a region (Petrucci et al.[29]). Also, socio-economic phenomena can follow spatially varying patterns as depicted in many studies on the spatial distribution of poverty and living conditions (Barbier and Hochard[2]; Curtis et al.[8]; Ezcurra et al.[10]).
A first to way to introduce spatial information in a small area model, is by specifying spatial auxiliary covariates that can be derived from administrative registers as well as from the geography of the territory under study, as it is represented by maps and spatial data (e.g., coverage, perimeters, extensions, and distances) often available through a geographic information system (GIS). In these applications, the spatial information concerning the land use usually constitutes a relevant auxiliary information; this is also true for developing countries, where official statisticians and other stakeholders can receive useful indications from satellite imagery providing maps of land use (e.g. on the existence and extent of forests, crops, grasses, sands, urban constructions, and quantity and yield of crops).
Moreover, spatial information in SAE modelling can also be included by extending the random effects model to allow for spatially correlated area effects, defined using a contiguity criterion. Among others, Cressie,[6] Singh et al,[44] Pratesi and Salvati[35], Pratesi and Salvati[36], Molina et al.,[25] Marhuenda et al.[24] and Porter et al.[31] have extended the Fay-Herriot (FH) model by including spatially correlated random effects using two different approaches: conditional autoregressive (CAR) and simultaneous autoregressive (SAR) specifications for the random effects (Anselin[1]). Both SAR and CAR allow for spatial correlation in the random area effects, while the fixed effects parameters are spatially invariant.
Another way to include spatial information in SAE is by using the hypothesis that the model for the expectation of the response variable varies spatially, given the covariates.
Two solutions are possible in this case: using P-spline bivariate smoothing (Giusti et al.[14]) and allowing the model coefficients to vary spatially (Chandra et al.[4]). Giusti et al.[14] model the spatial dependence by using P-splines bivariate smoothing that allows the auxiliary information to spatially vary. This approach is effective when data show spatial proximity effects. Chandra et al.[4] propose a nonstationary semiparametric model where coefficients change with the location of the units according to a specific distance metric.
In this paper we present a review of the most important area level models that consider spatial information for improving the efficiency of the estimates when the response variable is continuous. In particular, we focus on the effectiveness of small area estimators based on spatial models depending on the prediction power of the available auxiliary variables. In the Bayesian literature other spatial small area estimators have been proposed (see the book by Rao and Molina, Chapters 9 and 10)[38] but are not included in this article where the focus is on the models and predictors defined under the frequentist approach.
To motivate the potential benefits from using the spatial small area models depending on the availability of auxiliary variables, we use Italian data from the 2016 European Union Statistics on Income and Living Conditions (EU-SILC) survey with the aim of estimating the average equivalized income in the 611 Local Labour Market Areas (LLMAs) using auxiliary information from administrative registers (Salvati et al.[41]). In particular, the estimates are obtained using two nested sets of auxiliary variables: in one set the predictive power is higher than in the second one. The aim is to evaluate the efficiency of small area predictors that use spatial information under the two different sets of auxiliary variables.
Regarding the structure of the article, in Section 2, we review the FH model and introduce the notation. In Section 3 we present the spatial small area model where the area random effects are spatially correlated (Singh et al.[44]; Petrucci and Salvati[28]; Pratesi and Salvati[35, 36]). Section 4 is devoted to demonstrating the incorporation of the spatial structure in the data by means of a non-parametric spatial P-spline model for small area estimation (Opsomer et al.[27]; Giusti et al.[14]). Then, in Section 5 we describe the nonstationary area level linear mixed model and the corresponding small area mean estimator (Chandra et al.[4]). In Section 6 we compare the efficiency of the reviewed small area estimates of the reviewed methods in two different scenarios. In the first scenario we make use of a covariate with a high predictive power, while in the second scenario we exclude such covariate from the models. We use data from the 2016 Italian EU-SILC survey to estimate the mean equivalized income for Italian LLMAs. The covariate with the high predictive power is the area average (per-capita) taxable income, and the aim is to check how it affects the efficiency of the reviewed models. Finally, Section 7 presents concluding remarks and provides avenues for further research.
Background and Notation
The basic and most popular model to produce small area estimates using area level data is the FH model, originally proposed to estimate the median income in the United States (Fay and Herriot[11]). This approach combines the survey data with other data sources in a synthetic fitted regression with population area-level covariates.
Let
where
where
The combined model (Fay and Herriot[11]) can be written as:
and it is a special case of a linear mixed model, with the variance of
where
In practice, areas can have zero sample sizes, being referred to in this case as non-sampled areas or out-of-sample areas. In this situation estimation is typically carried out by using a synthetic approach, based on a model fitted to data coming from the sampled areas. Therefore, under model (2.3), the synthetic EBLUP predictor for the unknown population value of area d is
where
As the assumptions on the area random effects
Let the small area model be
where
where
The vector
From equation (3.1) and (3.2) we obtain the spatial area level model
Assuming independence between
and it depends on the unknown parameters
Under the model (3.3) we can derive the spatial BLUP of
where
Under the normality assumption for both
with
The prediction of the parameter of interest in out-of-sample areas is usually carried out using synthetic estimators. These estimators use the fixed part of the small area model that is fitted using the sampled areas. Therefore, under model (3.1), we define the synthetic EBLUP for the unknown target parameter of area d:
where
The analytical form of the MSE of the spatial BLUP is defined in Singh et al.,[44] and under certain conditions corresponds to the MSE of the FH BLUP derived by Prasad and Rao.[33] The MSE of the SEBLUP cannot be obtained in exact form because there is a non-linear relation between the SEBLUP and the data vector
The FH model (2.3) and its spatial version assume a linear function that links the direct survey estimators and the covariates. However, there are many cases where it is not possible to establish in advance the functional form of this relationship. This is often the case when spatial proximity effects are supposed to be present in the data. Giusti et al.[14] propose a P-spline bivariate smoothing model that can be used to specify a FH model that accounts for special effects among the areas in a straightforward manner. Moreover, the proposed semiparametric specification can be used to obtain estimates for out-of-sample areas, including those areas for which the auxiliary variables are not available. Opsomer et al.[27] propose a similar small area model at the unit level and Rao et al.[37] propose a modified version robust to the presence of outliers in the random small area effects and/or unit-level errors. The development of a robust estimator using a semiparametric regression approach could also be extended to area level models. Under a model-based direct estimation approach, Salvati et. al.[40] propose an alternative nonparametric specification for estimating small area means.
We revert to considering the FH model (1979) (2.3). The FH model assumes a linear function linking the direct survey estimators to the covariates; notably, when this assumption is not met, biased estimators of the small area target parameters can be obtained. By using P-splines an alternative semiparametric specification of the FH model that allows for non-linear functional relationships between the direct estimates and the covariates can be specified.
We now introduce the semiparametric additive model (hereafter, ‘semiparametric model’) where we consider one covariate
where
As shown by Ruppert et al.[39] and Opsomer et al.,[27] a Pspline model can be specified as a random effects model. Therefore, it is possible to combine it with the FH model to specify a semiparametric area level small area estimator that uses the well-known specification of linear mixed-model regression.
Corresponding to the
Following the same notation already introduced in the previous sections, we can write the mixed-model specification of the semiparametric Fay-Herriot model (SPFH) as:
By adding the
where
The best linear unbiased predictor (Henderson[16]) can be used to obtain model-based estimates of the small area target parameters. By using the assumption of normal distribution for the random effects and estimating
where
As already underlined, the SPFH model can be used when the geographical location of the areas is available and the final estimated are represented using maps. In these cases, the geographical information can be introduced in the analysis by using bivariate smoothing. As in Psplines the handling of nonlinear structures in the data depends on the specification of a set of basic functions, bivariate basis functions are used for bivariate smoothing. Following Giusti et al[14], in this study we assume the following model:
where
with
If there are some out-of-sample areas, a semiparametric predictor can be used under the assumption that the covariates
where
As concerns the MSE estimation of the EBLUP
In the spatial model (3.1) the fixed effects parameters are spatially invariant and the spatial relationship is captured by the error structure.
The assumption that the fixed effects associated with the model covariates do not vary spatially can be inappropriate in case of spatial nonstationarity (see Brunsdon et al., 1996[3] and the references therein). To adjust for this, Chandra et al.[4] describe a spatial nonstationary extension of the FH model (2.3).
The coordinates of an area d are its spatial location: for example, its centroid, which can be denoted by
where
The minimum MSE predictor of
where
Assuming that there are
where
In the application in Section 6 we assume a simple single parameter specification
Chandra et al.[4] propose: (a) an analytic estimation of the MSE of the
An alternative method for considering spatial nonstationarity is proposed by Benedetti et al. (2013), assuming that when there is spatial nonstationarity, the population is divided into heterogeneous latent subgroups. To identify the latent subgroups of small areas Benedetti et al. (2013) propose a simulated annealing algorithm that minimizes the sum of the estimated MSE of the small area estimates.
In this article, we present an analysis of Italian data from the 2016 EU-SILC with the aim of estimating the mean equivalized disposable income in the 611 LLMAs using auxiliary information from administrative registers. The LLMAs are unplanned domains for the EU-SILC survey. They are defined as clusters of municipalities in which the majority of the labour force lives and works, and where enterprises can find the largest amount of labour force. Being unplanned domains, many LLMAs have a small sample size that hinders the reliability of direct estimates; additionally, 292 LLMAs are out-of-sample areas (the sample size is 0). In terms of resident population, they range from approximately 3,000 (mountain areas, minor islands) to 3.8 million (Milan, Rome) of residents, a very skewed distribution of income among areas.
The equivalized disposable income is computed at the household level as the total disposable income (incomes, pensions, and other benefit after taxation). It is then divided by the equivalent household size, which is obtained using the OECD (Organisation for Economic Co-operation and Development) equivalent scale that accounts for economies of scale in the household. Under this scale the first adult in the household is assigned a weight of 1.0, other persons aged 14 or over living in the household a weight of 0.5, each child aged less than 14 a weight of 0.3. Within an household, the same value of the household equivalized disposable income is then assigned to all the members. Direct estimates of the mean equivalized disposable income (hereafter, mean income) for the LLMAs are obtained using the cross-sectional weights of the EU-SILC survey, which unfortunately do not add up to the in each LLMa to the total resident population, since these areas are not planned in the design of the survey. For this reason, as a direct estimator of the area mean
has been used, where
In our analysis, data from the EU-SILC survey are complemented with auxiliary information. We consider the two variables published by the Ministry of Economy and Finance, the total taxable income and the number of persons declaring taxable income (taxpayers). These data are available at the municipality level. Exploiting the hierarchical administrative division charactering Italy (i.e., provinces, LLMAs and municipalities), the average per capita taxable income was computed in each LLMA by using the available data on the total number of taxpayers (contributors) and the total amount of taxable incomes in each municipality. Moreover, the availability of the population size for each municipality published every year by the Istat enabled us to compute the percentage of persons aged more than 15 years who declare some income. It is important to underline that the data published by the Ministry of Economy and Finance are derived by the not yet validated taxpayers’ declarations. Therefore, they could be characterized by inconsistencies (for more methodological issues see
The average taxable income is a measure of affluence of income earners living in an area (LLMA) and indirectly measures labour market performances as well; notably, however, there are no data available at the LLMA level. The percentage of persons declaring income is influenced by the demographic structure, activity rate and incidence of income earners below a (low) threshold that do not declare any income.
We wish to point out that the mean income estimated from the EU-SILC and the average taxable income obtained from the tax registers—although strongly correlated as specified below—are different measures. The mean income from the EU-SILC is based on all income sources (i.e., taxable, and not taxable) and government transfers; moreover, it is computed considering economies of scale within households. The average taxable income is computed instead by aggregating single persons’ taxable income without taking into account government transfers and household composition.
A preliminary analysis indicated that both the auxiliary variables—the average taxable income and the percentage of persons declaring income—are highly correlated with the mean income at the LLMA level, with a linear correlation of about 0.64 and 0.61, respectively. Nevertheless, when used as covariates in the area level model, the predictive power of the two auxiliary variables is different since they are correlated (linear correlation 0.56). The average taxable income is the covariate that alone reduces the most the area level variance, with respect to the percentage of persons declaring income, in the FH model. Therefore, our aim is to test how the reviewed small-area estimators perform in two different scenarios: (a) when both the auxiliary variables are used in the model; and (b) when we use only the percentage of persons declaring income. We call the first scenario as ‘with average taxable income’ and the second scenario as ‘without average taxable income’. Specifically, we aim at ascertain if the spatial information boosts the efficiency in both the scenarios, and whether it is capable of offsetting the lack of the highly predictive auxiliary variable when this is omitted from the model.
We first present the model parameters. In Table 1 we report the estimated model coefficients and variance components obtained under the scenario with average taxable income for the FH model, the spatial Fay-Herriot model (SFH), the nonstationary Fay-Herriot model (NSFH) and the semiparametric Fay-Herriot model (SPFH). In Table 2 we report the same results for the scenario without average taxable income. All the regression coefficients are statistically different from zero, and their direct comparison between the two scenarios is not interesting because the auxiliary variables are different. We focus instead on the variance components of the different models. In the scenario with average taxable income,
Estimated Coefficients and Variance Components of the Area Level Small Area Models. Scenario with Average Taxable Income.
Estimated Coefficients and Variance Components of the Area Level Small Area Models. Scenario with Average Taxable Income.
Estimated Coefficients and Variance Components of the Area Level Small Area Models. Scenario Without Average Taxable Income.
The efficiency of the predictors based on the different small area models can be measured by the estimated coefficient of variations. Table 3 reports the distribution among the LLMAs of the coefficient of variation (CV)—a measure of efficiency—of the predictors using the different small area models under the two scenarios. As expected, the CVs of the FH predictors increase by about 13 per cent in mean from the scenario with average taxable income to the scenario without, and similarly for the NSFH predictors, with an increase of 15.6 per cent in mean. The estimated spatial clustering in the NSFH model (
Distribution of the Coefficient of Variation (per cent) of the Area Level Small Area Predictors.
In Figure 1 we map the predicted values of the mean income obtained under the different small area models considering the scenario with the average taxable income. Out-of-sample areas are predicted according to the predictors discussed in sections 2, 3, 4 and 5. Darker colours indicate higher mean income, and we use a common scale among the four small area predictors.
Maps of the Estimated Mean Income for each LLMA in Italy in 2015. Scenario with Average Taxable Income.
As expected, the maps appear similar because in this scenario, with the average taxable income, the spatial autocorrelation in the conditional distribution (of the target given the covariates) is trivial and the SFH, NSFH and SPFH models tend to be similar (or equal) to the FH model.
Figure 2 presents the same maps as in Figure 1, but for the scenario without average taxable income. In this case, the spatial information is able to mimic the behaviour of the omitted powerful covariate and plays a role in differentiating the predictions under the four small area models to an extent. The differences are expected and justified by the different structure and specification of the models. The out-of-sample areas predictions can benefit from the spatial locations of the LLMAs under the NSFH and SPFH models. Indeed, these two models allow the use of spatial data to build small area predictors that use spatial random effects, while the SPFH predictors are reduced to the synthetic predictor as under the FH model. Compared to the FH predictions, the SFH, NSFH and SPFH predictions show a higher mean income in the LLMAs of the north-east. It is worth noting that the SPFH predictions are exceptionally smooth, which is expected even if, in this case, this phenomenon seems to over-shrink the spatial distribution of the estimates: the high mean income in the LLMAs of the north (mostly north-east), which decreases coming down to the centre and further in the south. The degree of smoothness of the SPFH predictions is due to the choice of the P-spline and the number of knots; it can be adjusted and fine-tuned while working with them.
Maps of the Estimated Mean Income for each LLMA in Italy in 2015. Scenario without Average Taxable Income.
In conclusion, the spatial correlation (or clustering) in the conditional distribution of the target, given the auxiliary variables, is negligible when we use the average taxable, which has a high predictive power, as the auxiliary variable—which has a high predictive power. Moreover, the performances of the different small area predictors are quite similar and better than the direct estimator. Therefore, in this case, spatial information does not boost the performance—in terms of efficiency—of the small area SFH, NSFH, and SPFH predictors. When we do not include the average taxable income in the models, the spatial correlation allows the small area models SFH and SPFH to offset the lack of the predictive average taxable income auxiliary variable and leads to obtaining credible estimates.
In this article we first reviewed the most popular area level small area models for a continuous response variable that include spatial auxiliary information in the model specification. We then evaluated the effect of the spatial information on the small area estimates under two alternative scenarios depending on the predictive power of the models’ covariates. Specifically, the spatial pattern of the direct estimates among the areas was modelled in a parametric (Spatial FH, Non Stationary FH models) and in a semiparametric way (Semiparametric FH model), to better evaluate the use of the spatial information with or without high predictive covariates.
The results show that, when spatial information on the areas is available—and this is often the case—the specification of a spatial area level model can be crucial, especially when area-level covariates with a high predictive power are not available. In these situations, the conditional distribution of the target variable given the covariates can be characterized by a high spatial correlation/clustering that a spatial SAE model can exploit for increasing the efficiency with respect to a standard Fay-Herriot model. In our application, the spatial information included in the Semiparametric and Spatial Fay-Herriot models maintains about the same efficiency between the two compared scenarios, one where a high predictive covariate is present, and the second when this covariate is excluded from the models.
The findings presented in this article can support all those researchers that need to produce local estimates using data with a spatial structure, although they are of course not exhaustive of all the complexities that can emerge when analysing this kind of data.
The models considered in this article include the SFH model (Singh et al.[44]; Petrucci and Salvati[28]; Pratesi and Salvati[35, 36]), the SPFH model (Giusti et al.[14]) and the spatially nonstationary Fay-Herriot model (Chandra et al.[4]), as well as the Fay and Herriot[11] model.
We decided to focus on these models since area level estimation has received growing attention in the past few years in many fields, such as in socio-economic analyses, where the availability of local estimates is often crucial for policy purposes. In these fields, the lack of survey and/or population microdata with detailed geographical information—due to privacy issues—often makes it impossible to resort to unit-level small area models. In such cases, area level models usually represent a more flexible alternative. However, especially when the interest is in estimating sensitive and/or complex phenomena at the local level, such as income and poverty estimates, the lack of area-level covariates with a high predictive power can impact the efficiency of SAE estimates.
When the interest is in analysing spatial data, a relevant issue is the Modifiable Area Unit Problem (MAUP), which has been discussed in the spatial analysis literature since the 1930s (Unwin[45]). The MAUP is a possible source of bias that can drastically affect the results of statistical analyses. Specifically, the bias can be present when spatial characteristics measured in specific points are aggregated at a larger geographical level (e.g., larger areas). In these cases, the summary values of interest, for example the target variable totals, rates or proportions, can be strongly affected by the chosen specification of the areas’ borders. This is what occurs when we define covariates at the area level aggregating individual values.
Although MAUP is well known, few studies have been dedicated to the study of MAUP in the SAE context. Pratesi[34] provides some results that can be used to evaluate the robustness of SAE unit-level methods to alternative aggregations of point-based measures within the small areas of interest. Specifically, Pratesi[34] performed a simulation study to evaluate the so-called MAUP scale effect, the effect under which the results are different when changing the scale of aggregation of the data. Following this approach, further studies could be devoted to the evaluation of the MAUP in the context of spatial SAE models, as these could serve as a guide when different geographical small areas can be specified using the same data (e.g., aggregating microdata at different area levels).
Another relevant issue when applying area-level SAE models is the lack of population information to be used in building covariates (auxiliary variables) at the area level. Indeed, the Fay-Herriot model and all the spatial models we reviewed in this article assume that the auxiliary variables are measured without error—in other words, are based on censuses or population registers. Ybarra and Lohr[49] proposed a modified version of the FH model where the auxiliary variables can be affected by measurement error; for example, when they are estimates of population information affected by sampling variability. The Ybarra-Lohr model can therefore be useful when population information is unavailable but other data sources are available, including new ones (e.g., big data) that are not representative of all the target population (Marchetti et al.[23]). Future studies could be devoted to the specification of a spatial Ybarra-Lohr model, as the joint use of geographical information on the areas, and of covariates affected by any kind of measurement error could be of help in many relevant applications.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The authors received no financial support for the research, authorship and/or publication of this article.
