The Use of Spatial Information in Area-level Models: An Evaluation Based on Auxiliary Data Availability

Abstract

The small area estimation (SAE) theory is widely used when local or domain-specific reliable estimates based on survey data are needed. Small area model-based estimates use a model that links the response variable to some auxiliary information borrowing strength from the related areas. When geographical information on the areas of interest is available, the specification of a spatial area level model can increase the estimates’ efficiency, depending on available auxiliary data. In this article, we first review the most popular area level spatial models, and we then compare their performance under two alternative scenarios of auxiliary information availability to estimate the average equivalized household income in Italian Local Labour Market Areas (LLMAs) using the EU-SILC (European Union Statistics on Income and Living Conditions) survey data. Our findings suggest that the spatial information can “fill the gap” when the covariates do not have a high predictive power, a crucial result when there is lack of auxiliary data.

AMS Subject Classification: 62D05, 62G05, 62H11

Keywords

EU-SILC data Fay-Herriot model simultaneous autoregressive process semiparametric models spatial nonstationarity

Introduction

Sample surveys have been established as the effective method to estimate parameters of interest referring to populations, sub-populations or domains (e.g., regions, young women). Therefore, surveys are designed to obtain reliable estimates with a predetermined level of precision for specific populations and sub-populations/domains, usually called planned domains (because they are planned in the design of the survey). Usually, estimation is carried out using direct estimators, which are obtained using only domain-specific survey data. However, stakeholders, policy makers, private and public agencies, governments often need estimates referring to sub-populations or domains not planned in the survey design, usually because of budget and time constraints. In many circumstances these unplanned domains are territorial domains identified by local administrative boundaries or other borders functional to statistical analysis as socio-economic districts and/or specific areas where people live, classified by their degree of urbanization. Notably, in these unplanned domains the survey sample sizes can be very small or even zero. In these cases, when cost constraints do not allow for additional surveys and/or oversampling of the study areas, it is mandatory to integrate existing information in order to harmonize it and produce credible statistics on the state and possibly the dynamics of change of the phenomenon at the local level (Giusti et al.^[13]). In the literature, the methods that deals with small sample sizes are known as small area estimation (SAE) methods. For a detailed description of SAE theory, please refer to the book by Rao and Molina^[38] or the reviews by Jiang and Lahiri^[17] and Pfeffermann.^[30]

Nowadays, employing indirect small area estimators is the most popular of the many available methods. Small area estimators are based on a model linking the study variable to available covariate information and specified to borrow strength from the related areas (Rao and Molina^[38]; Chandra et al.^[5]). Currently, several statistical agencies use empirical best linear unbiased predictors (EBLUPs) defined using linear mixed models (LMMs). Under LMMs the distribution of the study variable is function of area-specific random effects, which account into the model for the differences among the areas. Such predictors may be able to greatly increase the efficiency with respect to direct estimators, depending from the goodness of the model. In this article we consider area level models (Fay and Herriot^[11]) that are a popular choice when data are available only in aggregated form, as it happens very often in socio-economic studies, due to confidentiality restrictions.

The use of spatial auxiliary information can be crucial in many applications of small area models, as this information can increase the estimates’ efficiency and effectiveness.

It is widely recognized that many environmental and socio-economic phenomena are characterized by a spatial distribution deriving both from nature and man’s action. For example, the spatial distribution of a pollution agent in the soil is a consequence not only of the soil geological characteristics, but also of man’s actions (e.g., through the construction of roads and buildings). The same combined action characterizes the distribution of crops in a region: man and nature work together and the result is the distribution of cultivated land that is scattered over the surface of a region (Petrucci et al.^[29]). Also, socio-economic phenomena can follow spatially varying patterns as depicted in many studies on the spatial distribution of poverty and living conditions (Barbier and Hochard^[2]; Curtis et al.^[8]; Ezcurra et al.^[10]).

A first to way to introduce spatial information in a small area model, is by specifying spatial auxiliary covariates that can be derived from administrative registers as well as from the geography of the territory under study, as it is represented by maps and spatial data (e.g., coverage, perimeters, extensions, and distances) often available through a geographic information system (GIS). In these applications, the spatial information concerning the land use usually constitutes a relevant auxiliary information; this is also true for developing countries, where official statisticians and other stakeholders can receive useful indications from satellite imagery providing maps of land use (e.g. on the existence and extent of forests, crops, grasses, sands, urban constructions, and quantity and yield of crops).

Moreover, spatial information in SAE modelling can also be included by extending the random effects model to allow for spatially correlated area effects, defined using a contiguity criterion. Among others, Cressie,^[6] Singh et al,^[44] Pratesi and Salvati^[35], Pratesi and Salvati^[36], Molina et al.,^[25] Marhuenda et al.^[24] and Porter et al.^[31] have extended the Fay-Herriot (FH) model by including spatially correlated random effects using two different approaches: conditional autoregressive (CAR) and simultaneous autoregressive (SAR) specifications for the random effects (Anselin^[1]). Both SAR and CAR allow for spatial correlation in the random area effects, while the fixed effects parameters are spatially invariant.

Another way to include spatial information in SAE is by using the hypothesis that the model for the expectation of the response variable varies spatially, given the covariates.

Two solutions are possible in this case: using P-spline bivariate smoothing (Giusti et al.^[14]) and allowing the model coefficients to vary spatially (Chandra et al.^[4]). Giusti et al.^[14] model the spatial dependence by using P-splines bivariate smoothing that allows the auxiliary information to spatially vary. This approach is effective when data show spatial proximity effects. Chandra et al.^[4] propose a nonstationary semiparametric model where coefficients change with the location of the units according to a specific distance metric.

In this paper we present a review of the most important area level models that consider spatial information for improving the efficiency of the estimates when the response variable is continuous. In particular, we focus on the effectiveness of small area estimators based on spatial models depending on the prediction power of the available auxiliary variables. In the Bayesian literature other spatial small area estimators have been proposed (see the book by Rao and Molina, Chapters 9 and 10)^[38] but are not included in this article where the focus is on the models and predictors defined under the frequentist approach.

To motivate the potential benefits from using the spatial small area models depending on the availability of auxiliary variables, we use Italian data from the 2016 European Union Statistics on Income and Living Conditions (EU-SILC) survey with the aim of estimating the average equivalized income in the 611 Local Labour Market Areas (LLMAs) using auxiliary information from administrative registers (Salvati et al.^[41]). In particular, the estimates are obtained using two nested sets of auxiliary variables: in one set the predictive power is higher than in the second one. The aim is to evaluate the efficiency of small area predictors that use spatial information under the two different sets of auxiliary variables.

Regarding the structure of the article, in Section 2, we review the FH model and introduce the notation. In Section 3 we present the spatial small area model where the area random effects are spatially correlated (Singh et al.^[44]; Petrucci and Salvati^[28]; Pratesi and Salvati^{[35, 36]}). Section 4 is devoted to demonstrating the incorporation of the spatial structure in the data by means of a non-parametric spatial P-spline model for small area estimation (Opsomer et al.^[27]; Giusti et al.^[14]). Then, in Section 5 we describe the nonstationary area level linear mixed model and the corresponding small area mean estimator (Chandra et al.^[4]). In Section 6 we compare the efficiency of the reviewed small area estimates of the reviewed methods in two different scenarios. In the first scenario we make use of a covariate with a high predictive power, while in the second scenario we exclude such covariate from the models. We use data from the 2016 Italian EU-SILC survey to estimate the mean equivalized income for Italian LLMAs. The covariate with the high predictive power is the area average (per-capita) taxable income, and the aim is to check how it affects the efficiency of the reviewed models. Finally, Section 7 presents concluding remarks and provides avenues for further research.

2. Background and Notation

The basic and most popular model to produce small area estimates using area level data is the FH model, originally proposed to estimate the median income in the United States (Fay and Herriot^[11]). This approach combines the survey data with other data sources in a synthetic fitted regression with population area-level covariates.

Let ϑ be the $D \times 1$ vector of the parameters of inferential interest (small area totals or means with $d = 1, \dots, D$ ). Moreover, we can assume that the design unbiased direct estimator $\hat{ϑ}$ is available and the following model holds:

\hat{ϑ} = ϑ + ε,

(2.1)

where ε is a vector of independent sampling errors with mean vector 0 and known diagonal variance matrix $R = d i a g (a_{d})$ , containing the direct estimators’ sampling variances of the area parameters of interest. Usually, $a_{d}$ is unknown and a generalized variance function approach based to the whole sample is used to estimate it (see Wolter, 1985, Chapter 5^[48], and Wang and Fuller, 2003^[46]). The area level random effects model implies a matrix of $D \times p$ area-specific auxiliary variables (including an intercept term), X, and the parameters of interest (vector ϑ) are linked by a linear relationship:

ϑ = X β + Z u,

(2.2)

where β is the $p \times 1$ vector of regression parameters, Z is a $D \times D$ matrix of known positive constants and u is a $D \times 1$ vector of independent random area effects with mean vector 0 and variance-covariance matrix $Σ_{u} = σ_{u}^{2} I_{d}$ , and $I_{d}$ is a $D \times D$ identity matrix.

The combined model (Fay and Herriot^[11]) can be written as:

\hat{ϑ} = X β + Z u + ε

(2.3)

and it is a special case of a linear mixed model, with the variance of $\hat{ϑ}$ being equal to $V = R + Z Σ_{u} Z^{T}$ . Under this model, the Best Linear Unbiased Predictor (BLUP) ${\tilde{ϑ}}^{F H} (σ_{u}^{2})$ is extensively used to obtain model-based indirect estimators of small area parameters $ϑ$ and associated measures of variability. The EBLUP of $ϑ_{d}$ —in other words, the parameter of interest in the small area $d$ , is:

{\hat{ϑ}}_{d}^{F H} ({\hat{σ}}_{u}^{2}) = x_{d} \hat{β} + z_{d} \hat{u},

(2.4)

where $\hat{β}$ the generalized least squares estimator of $β$ equal to $(X^{T} {\hat{V}}^{- 1} X)^{- 1} X^{T} {\hat{V}}^{- 1} \hat{ϑ}$ with $\hat{V} = R + Z {\hat{Σ}}_{u} Z^{T}$ , and $\hat{u} = {\hat{Σ}}_{u} Z^{T} {(R + Z {\hat{Σ}}_{u} Z^{T})}^{- 1} (\hat{ϑ} - X \hat{β})$ is the EBLUP of u with $z_{d}$ the d-th row of matrix X; ${\hat{σ}}_{u}^{2}$ is an asymptotically consistent estimator of $σ_{u}^{2}$ obtained by Maximum Likelihood (ML) or the Restricted Maximum Likelihood (REML) methods based on the normality assumption of the random effects (see Chapter 6 in Rao and Molina).^[38]

In practice, areas can have zero sample sizes, being referred to in this case as non-sampled areas or out-of-sample areas. In this situation estimation is typically carried out by using a synthetic approach, based on a model fitted to data coming from the sampled areas. Therefore, under model (2.3), the synthetic EBLUP predictor for the unknown population value of area d is

{\hat{ϑ}}_{d, o u t}^{S Y N} = x_{d, o u t} \hat{β},

(2.5)

where $x_{d, o u t}$ is the d-th row of the covariates matrix for out-of-sample areas. The mean squared error (MSE) of ${\hat{ϑ}}_{d}^{F H} ({\hat{σ}}_{u}^{2})$ is typically estimated using the analytic approach proposed by Prasad and Rao^[33] (see Rao and Molina, Chapter 6).^[38]

3. Spatial EBLUP

As the assumptions on the area random effects u as specified in the FH model (2.3) are violated at times, we analyse the cases where they are spatially correlated. Different approaches exist in literature to handle the spatial correlation. Among others, Singh et al.,^[44] Pratesi and Salvati^{[35, 36]} and Petrucci and Salvati^[28] consider the case when the random effects follow a SAR process. Other authors, such as Leroux et al.,^[19] MacNab,^[22] You and Zhou^[50] and Lawson^[18] use the CAR process to model the spatial dependence in the area random effects. Porter et al.^[32] specify that the CAR model is prevalent in the disease mapping literature. Therefore, considering that in this review, we compare different methods to account for spatial dependence in a socio-economic application, we focus on the SAR process to account for it in the area random effects.

Let the small area model be

\hat{ϑ} = X β + Z v + ε,

(3.1)

where $X$ , $β$ , $Z$ and $ε$ are the same quantity as defined in Section 2. Vector v is defined under the following SAR process:

v = ρ W v + u,

where $ρ \in [- 1,1]$ is the (unknown) autocorrelation parameter and W is a proximity matrix (Cressie^[7]), which is defined in row standardized form. If ${(I_{D} - ρ W)}^{- 1}$ is non-singular, then v is equal to

v = {(I_{D} - ρ W)}^{- 1} u .

(3.2)

The vector u has the same structure as in Section 2: in other words, u is normally distributed with mean 0 and diagonal covariance matrix $Σ_{u} = σ_{u}^{2} I_{D}$ . Under equation (3.2) the vector v has mean 0 and a covariance matrix that depends on $(σ_{u}^{2}, ρ) :$

Σ_{v} = σ_{u}^{2} {[{(I_{D} - ρ W)}^{T} (I_{D} - ρ W)]}^{- 1} .

From equation (3.1) and (3.2) we obtain the spatial area level model

\hat{ϑ} = X β + {(I_{D} - ρ W)}^{- 1} u + ε .

(3.3)

Assuming independence between v and $ε$ , the covariance matrix of $\hat{ϑ}$ is

V = Σ_{v} + R,

and it depends on the unknown parameters $σ_{u}^{2}$ and $ρ$ .

Under the model (3.3) we can derive the spatial BLUP of $ϑ_{d} = x_{d} β + v_{d}$ , namely

{\tilde{ϑ}}_{d}^{S F H} = x_{d}^{T} \tilde{β} + z_{d}^{T} {Σ ̃}_{v} V^{- 1} (\hat{ϑ} - X \tilde{β}),

(3.4)

where $\tilde{β} = {(X^{T} V^{- 1} X)}^{- 1} X^{T} V^{- 1} \hat{ϑ}$ is the generalized least square estimator of $β$ and $z_{d}$ is the dth column of the matrix Z, as defined in Section 2.

Under the normality assumption for both $ε$ and u, the parameters $σ_{u}^{2}$ and $ρ$ can be estimated using ML or REML. The estimated parameters ${\hat{σ}}_{u}^{2}$ and $\hat{ρ}$ can be plugged in equation (3.4) to obtain the Spatial Empirical BLUP (SEBLUP):

{\hat{ϑ}}_{d}^{S F H} = x_{d}^{T} \hat{β} + z_{d}^{T} {\hat{Σ}}_{v} {\hat{V}}^{- 1} (\hat{ϑ} - X \hat{β}),

(3.5)

with $\hat{β} = {(X^{T} {\hat{V}}^{- 1} X)}^{- 1} X^{T} {\hat{V}}^{- 1} \hat{ϑ}$ , $\hat{V} = {\hat{Σ}}_{v} + R$ and ${\hat{Σ}}_{v} = {\hat{σ}}_{u}^{2} {[{(I_{D} - \hat{ρ} W)}^{T} (I_{D} - \hat{ρ} W)]}^{- 1} .$ More details can be found in Singh et al.^[44] and Petrucci and Salvati.^[28]

The prediction of the parameter of interest in out-of-sample areas is usually carried out using synthetic estimators. These estimators use the fixed part of the small area model that is fitted using the sampled areas. Therefore, under model (3.1), we define the synthetic EBLUP for the unknown target parameter of area d:

{\hat{ϑ}}_{d, o u t}^{S S Y N} = x_{d, o u t} \hat{β},

(3.6)

where $x_{d, o u t}$ is the d-th row of the covariates matrix for the out-of-sample areas.

The analytical form of the MSE of the spatial BLUP is defined in Singh et al.,^[44] and under certain conditions corresponds to the MSE of the FH BLUP derived by Prasad and Rao.^[33] The MSE of the SEBLUP cannot be obtained in exact form because there is a non-linear relation between the SEBLUP and the data vector $\hat{ϑ}$ . A Taylor linearization, similar to the one used in Prasad and Rao,^[33] leads to an approximated form of the MSE of the SEBLUP, which can be estimated under normality assumptions for u and $ε$ (see Singh et al.^[44] and Petrucci and Salvati^[28]). Alternative bootstrap procedures to estimate the MSE of the SEBLUP have been proposed—among others—by Molina et al.^[25] When the normality assumptions are violated, a robust version of the SEBLUP has been proposed by Schmid and Münnich^[43] for unit level models. Warnholz^[47] extended the robust estimation procedure for linear mixed models proposed by Sinha & Rao (2009)^[42] to area-level models that also include spatial correlation. This model was considered in the application presented in section 6, but the results are not shown as they are consistent with those of the Spatial EBLUP. These results are available from the authors upon request.

4. Semiparametric Fay-Herriot (SPFH) Model Using Penalized Splines

The FH model (2.3) and its spatial version assume a linear function that links the direct survey estimators and the covariates. However, there are many cases where it is not possible to establish in advance the functional form of this relationship. This is often the case when spatial proximity effects are supposed to be present in the data. Giusti et al.^[14] propose a P-spline bivariate smoothing model that can be used to specify a FH model that accounts for special effects among the areas in a straightforward manner. Moreover, the proposed semiparametric specification can be used to obtain estimates for out-of-sample areas, including those areas for which the auxiliary variables are not available. Opsomer et al.^[27] propose a similar small area model at the unit level and Rao et al.^[37] propose a modified version robust to the presence of outliers in the random small area effects and/or unit-level errors. The development of a robust estimator using a semiparametric regression approach could also be extended to area level models. Under a model-based direct estimation approach, Salvati et. al.^[40] propose an alternative nonparametric specification for estimating small area means.

We revert to considering the FH model (1979) (2.3). The FH model assumes a linear function linking the direct survey estimators to the covariates; notably, when this assumption is not met, biased estimators of the small area target parameters can be obtained. By using P-splines an alternative semiparametric specification of the FH model that allows for non-linear functional relationships between the direct estimates and the covariates can be specified.

We now introduce the semiparametric additive model (hereafter, ‘semiparametric model’) where we consider one covariate $x_{1}$ written as $f (x_{1})$ , with the function $f (\cdot)$ being unknown and assumed to be approximated by this function:

f (x_{1}; α, g) = α_{0} + α_{1} x_{1} + \dots + α_{q} x_{1}^{q} + \sum_{k = 1}^{K} g_{k} {(x_{1} - κ_{k})}_{+}^{q},

(4.1)

where $α = {(α_{0}, α_{1}, \dots, α_{q})}^{T}$ is the $(q + 1) \times 1$ vector of the coefficients of the polynomial function, $g = {(g, g_{1}, \dots, g_{k})}^{T}$ is the vector of coefficient of the truncated Pspline basis and q is the degree of the spline, with ${(t)}_{+}^{q} = t^{q}$ if $t > 0$ , 0 otherwise. The final part of the model takes into account the possible departures from a q polynomial t in the relationship. Specifically, $κ_{k}$ for $k = 1, \dots, K$ are fixed knots and, considering large K values, the function (4.1) can be seen as a large class of functions that can approximate many smoothing functions. A guidance to the choice of the bases and knots can be found in Ruppert et al.^[39]

As shown by Ruppert et al.^[39] and Opsomer et al.,^[27] a Pspline model can be specified as a random effects model. Therefore, it is possible to combine it with the FH model to specify a semiparametric area level small area estimator that uses the well-known specification of linear mixed-model regression.

Corresponding to the $α$ and $g$ vectors, we obtain:

Χ_{1} = [\begin{matrix} 1 & x_{11} & \dots & x_{11}^{q} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 1 & x_{1 d} & \dots & x_{1 d}^{q} \end{matrix}], P = [\begin{matrix} {(x_{11} - κ_{1})}_{+}^{q} & \dots & {(x_{11} - κ_{k})}_{+}^{q} \\ ⋮ & ⋱ & ⋮ \\ {(x_{1 q} - κ_{1})}_{+}^{q} & \dots & {(x_{1 q} - k)}_{+}^{q} \end{matrix}] .

Following the same notation already introduced in the previous sections, we can write the mixed-model specification of the semiparametric Fay-Herriot model (SPFH) as:

\hat{ϑ} = [\begin{matrix} Χ \\ Χ_{1} \end{matrix}] [β, α] + P g + Z u + ε .

(4.2)

By adding the X₁ matrix of model (4.2) to the X matrix, then the model becomes:

\hat{ϑ} = X τ + P g + Z u + ε,

(4.3)

where $τ$ is a (p+q+1) vector of regression coefficients, the $g$ component can be considered as a K × 1 vector of independent and identically distributed random variables with mean 0 and K × K variance matrix $Σ_{g} = σ_{g}^{2} I_{k}$ , $Z$ and $u$ are defined as in (2.2). Therefore, we can see that the SPFH model is in this way specified as a linear mixed model with variance-covariance matrix $Σ (ξ) = P Σ_{g} P^{T} + Z Σ_{u} Z^{T} + R$ where $ξ = (σ_{g}^{2}, σ_{u}^{2})$ .

The best linear unbiased predictor (Henderson^[16]) can be used to obtain model-based estimates of the small area target parameters. By using the assumption of normal distribution for the random effects and estimating $σ_{g}^{2}$ and $σ_{u}^{2}$ by the ML or the REML (Prasad and Rao^[33]), the corresponding EBLUP for the SPFH is:

{\hat{ϑ}}_{}^{S P F H} (\hat{ξ}) = X \hat{β} (\hat{ξ}) + \hat{Λ} (\hat{ξ}) [\hat{ϑ} - X \hat{β} (\hat{ξ})],

(4.4)

where $\hat{β} (\hat{ξ}) = {(X^{T} \hat{Σ} {(\hat{ξ})}^{- 1} X)}^{- 1} X^{T} \hat{Σ} {(\hat{ξ})}^{- 1} \hat{ϑ}$ and $\hat{Λ} (\hat{ξ}) = (P {\hat{Σ}}_{g} P^{T} + Z {\hat{Σ}}_{u} Z^{T}) \hat{Σ} {(\hat{ξ})}^{- 1}$ .

As already underlined, the SPFH model can be used when the geographical location of the areas is available and the final estimated are represented using maps. In these cases, the geographical information can be introduced in the analysis by using bivariate smoothing. As in Psplines the handling of nonlinear structures in the data depends on the specification of a set of basic functions, bivariate basis functions are used for bivariate smoothing. Following Giusti et al^[14], in this study we assume the following model:

f (x_{1}, x_{2}; α, g) = α_{0} + α_{1} x_{1} + α_{2} x_{2} + p_{d} g

where $p_{d}$ is the d-th row of the $d \times K$ matrix:

P = {[A ({\tilde{x}}_{i} - κ_{k})]}_{\begin{matrix} 1 \leq i \leq d \\ 1 \leq k \leq K \end{matrix}} {[A (κ_{k} - κ_{k'})]}_{1 \leq k \leq K}^{- 1 / 2},

with $A (t) = ∥ t ∥^{2} \log ∥ t ∥, {\tilde{x}}_{d} = (x_{1 d}, x_{2 d}) and κ_{k}$ are the knots, $k = 1, \dots, K .$ In this model, for each area d, ${\tilde{x}}_{d}$ are the coordinates (that is, the latitude and longitude) of the centroid of the area (see Opsomer et al.^[27]; Ruppert et al.^[39] Chapter 13), Kammann and Wand^[26] and French et al.^[12]). Moreover, Baldermann et al. (2018) proposed a robust version of the SPFH predictor.

If there are some out-of-sample areas, a semiparametric predictor can be used under the assumption that the covariates $x_{d, o u t}$ are known, including the coordinates of the centroid of the out-of-sample areas ${\tilde{x}}_{d, o u t} = (x_{1 d, o u t}, x_{2 d, o u t})$ . In these cases, the predictor can use the geographical information on the out-of-sample areas to improve the prediction. Under model (4.3) the predictor for the out-of-sample areas is:

{\hat{ϑ}}_{d, o u t}^{S P S Y N} = x_{d, o u t} \hat{β} + P_{d, o u t} \hat{g},

where $x_{d, o u t}$ is the d-th row of the covariates matrix for out-of-sample areas, $P_{d, o u t}$ denotes the row d-th of the $P$ matrix referring to the out-of-sample areas and $\hat{g} = {\hat{Σ}}_{g} P^{T} {\hat{Σ}}^{- 1} (\hat{ϑ} - X \hat{β}) .$

As concerns the MSE estimation of the EBLUP ${\hat{ϑ}}_{d}^{S P F H} (\hat{ξ})$ , Giusti et al.^[14] present an analytic approximation based on the extension of the Prasad and Rao^[33] estimator originally proposed by Opsomer et al.^[27] to deal with FH models with general covariance structure—as it is the case when a spline random component is introduced. This is the estimator used in the application presented in Section 6. As an alternative, Giusti et al.^[14] also propose a non-parametric bootstrap procedure based on the proposals by González-Manteiga et al.,^[15] Opsomer et al.^[27] and Molina et al.^[25]

5. Spatially Nonstationary FH Model

In the spatial model (3.1) the fixed effects parameters are spatially invariant and the spatial relationship is captured by the error structure.

The assumption that the fixed effects associated with the model covariates do not vary spatially can be inappropriate in case of spatial nonstationarity (see Brunsdon et al., 1996^[3] and the references therein). To adjust for this, Chandra et al.^[4] describe a spatial nonstationary extension of the FH model (2.3).

The coordinates of an area d are its spatial location: for example, its centroid, which can be denoted by $l_{d}$ . The value $d i s t (l_{d}, l_{j})$ measures the distance between the spatial locations of areas d and j, and define the spatial contiguity of these two locations to be $ω_{d j} = {\{1 + d i s t (l_{d}, l_{j})\}}^{- 1}$ with $Ω = [ω_{d j}]$ a positive definite $m \times m$ matrix of spatial contiguities defined by the $l_{d}$ . Chandra et al.^[4] assume that this matrix is known. A spatially nonstationary FH model assumes the form:

\hat{ϑ} = X β + Ψ Γ + Z u + ε,

(5.1)

where $Ψ = d i a g (x_{d}^{T}; d = 1, \dots, D), Γ = {\{γ^{T} (l_{1}), \dots, γ^{T} (l_{D})\}}^{T} and γ (l) = \{γ_{k} (l); k = 1, \dots, p\}$ is a spatially varying multivariate random process of dimension p. The covariance matrix of the model (5.1) is $V = Ψ Σ Ψ^{T} + Σ_{u} + R,$ where $Σ = Ω \otimes C$ with $V = Ψ Σ Ψ^{T} + Σ_{u} + R, where Σ = Ω \otimes C with C = [c_{k g}]$ is a $p \times p$ covariance matrix that characterizes the correlations between the components of $γ$ at an arbitrary location l and $⨂$ represents the Kronecker product. The geographic scale and strength (or intensity) of the spatial correlation in the population of interest are defined by the components $Ω$ and C of the covariance matrix. Chandra et al.^[4] presents more information on the specification of covariance structure.

The minimum MSE predictor of $ϑ$ under (5.1) is the expected value of $ϑ$ given $\hat{ϑ}$ , $Ψ$ , and the location. Under a Gaussian errors assumption, and assuming the inverse of V exists, the nonstationary EBLUP (NSEBLUP) estimator of $ϑ_{d}$ is

{\hat{ϑ}}_{d}^{N S F H} ({\hat{σ}}_{u}^{2}, {\hat{c}}_{k g}) = x_{d} \hat{β} + Ψ_{d} \hat{γ} + z_{d} \hat{u},

(5.2)

where $\hat{β} = (X^{T} {\hat{V}}^{- 1} X)^{- 1} X^{T} {\hat{V}}^{- 1} \hat{ϑ}$ , $\hat{V} = Ψ \hat{Σ} Ψ^{T} + Z {\hat{Σ}}_{u} Z^{T} + R$ , and $x_{d}, ψ_{d},$ $z_{d}$ are the d-th row of matrix X, $Ψ$ and Z, respectively. The EBLUP of $γ$ and u are $\hat{γ} = \hat{Σ} Ψ^{T} {\hat{V}}^{- 1} (\hat{ϑ} - X \hat{β})$ and $\hat{u} = Σ_{u} Z^{T} {\hat{V}}^{- 1} (\hat{ϑ} - X \hat{β})$ .

Assuming that there are $D_{o u t}$ of non-sampled areas and that the covariates vectors $x_{d, o u t}$ and the spatial locations $l_{d, o u t}$ (e.g., the centroids) of these areas are known, for non-sampled areas under (4.1) the predictor assumes the form

{\hat{ϑ}}_{d, out}^{NSSYN} ({\hat{c}}_{k g}) = x_{d, out} \hat{β} + Ψ_{d,out} {\hat{γ}}_{out},

(5.3)

where $x_{d, o u t}$ is the d-th row of the covariates matrix for out-of-sample areas, $ψ_{d, o u t}$ denotes the row d-th of $Ψ_{o u t} = d i a g (x_{d, o u t}^{T}; d = 1, \dots, D_{o u t})$ and ${\hat{γ}}_{out} = Σ_{out / in} Ψ^{T} {\hat{V}}^{- 1} (\hat{θ} - X \hat{β})$ with $Σ_{out / in} = Ω_{out / in} \otimes \hat{C} .$ Here, $Ω_{o u t / i n}$ is the known $D_{o u t} \times D$ matrix of spatial contiguities between the non-sampled areas and the sampled areas. This predictor (5.3) is a nonstationary synthetic predictor (NSSYN, Chandra et al.^[4]) which could increase the performance of the traditional SYN predictor (2.5) by using the location of the non-sampled areas to ‘borrow strength’ from neighbouring sampled areas.

In the application in Section 6 we assume a simple single parameter specification $C = η I_{p}$ for the matrix C, where $I_{p}$ denotes the identity matrix of order p (Chandra et al.^[4]). That is, the parameter $η \geq 0$ reflects the ‘intensity’ of the data spatial clustering, and $η = 0$ corresponds to data that are spatially homogeneous.

Chandra et al.^[4] propose: (a) an analytic estimation of the MSE of the ${\hat{ϑ}}_{d}^{N S F H} ({\hat{σ}}_{u}^{2}, {\hat{c}}_{k g})$ following Prasad and Rao^[33] (see also Datta et al., 2005^[9]); (b) a parametric bootstrap procedure for estimating this MSE; (c) a diagnostic measure for spatial nonstationarity following Gonzalez-Manteiga et al. (2007)^[15] and Molina et al.^[25]

An alternative method for considering spatial nonstationarity is proposed by Benedetti et al. (2013), assuming that when there is spatial nonstationarity, the population is divided into heterogeneous latent subgroups. To identify the latent subgroups of small areas Benedetti et al. (2013) propose a simulated annealing algorithm that minimizes the sum of the estimated MSE of the small area estimates.

6. Application

In this article, we present an analysis of Italian data from the 2016 EU-SILC with the aim of estimating the mean equivalized disposable income in the 611 LLMAs using auxiliary information from administrative registers. The LLMAs are unplanned domains for the EU-SILC survey. They are defined as clusters of municipalities in which the majority of the labour force lives and works, and where enterprises can find the largest amount of labour force. Being unplanned domains, many LLMAs have a small sample size that hinders the reliability of direct estimates; additionally, 292 LLMAs are out-of-sample areas (the sample size is 0). In terms of resident population, they range from approximately 3,000 (mountain areas, minor islands) to 3.8 million (Milan, Rome) of residents, a very skewed distribution of income among areas.

The equivalized disposable income is computed at the household level as the total disposable income (incomes, pensions, and other benefit after taxation). It is then divided by the equivalent household size, which is obtained using the OECD (Organisation for Economic Co-operation and Development) equivalent scale that accounts for economies of scale in the household. Under this scale the first adult in the household is assigned a weight of 1.0, other persons aged 14 or over living in the household a weight of 0.5, each child aged less than 14 a weight of 0.3. Within an household, the same value of the household equivalized disposable income is then assigned to all the members. Direct estimates of the mean equivalized disposable income (hereafter, mean income) for the LLMAs are obtained using the cross-sectional weights of the EU-SILC survey, which unfortunately do not add up to the in each LLMa to the total resident population, since these areas are not planned in the design of the survey. For this reason, as a direct estimator of the area mean ${\hat{θ}}_{d}$ the ratio type estimator

{\hat{θ}}_{d} = \frac{\sum_{j = 1}^{n_{d}} w_{d j} y_{d j}}{\sum_{j = 1}^{n_{d}} w_{d j}},

has been used, where $y_{d j}$ denotes the equivalized income for the j-th individual in the d-th LLMA and $w_{d j}$ , the associated personal weight published by the Italian National Statistical Institute, Istat. Using personal weights, the mean income is per capita, and not per household. The variances of the direct estimator, $a_{d}$ , are estimated using the linearization method as implemented in the ‘survey’ package of R (Lumley, 2004^[20]; Lumley, 2017^[21]) after some simplification of the actual EU-SILC sampling design.

In our analysis, data from the EU-SILC survey are complemented with auxiliary information. We consider the two variables published by the Ministry of Economy and Finance, the total taxable income and the number of persons declaring taxable income (taxpayers). These data are available at the municipality level. Exploiting the hierarchical administrative division charactering Italy (i.e., provinces, LLMAs and municipalities), the average per capita taxable income was computed in each LLMA by using the available data on the total number of taxpayers (contributors) and the total amount of taxable incomes in each municipality. Moreover, the availability of the population size for each municipality published every year by the Istat enabled us to compute the percentage of persons aged more than 15 years who declare some income. It is important to underline that the data published by the Ministry of Economy and Finance are derived by the not yet validated taxpayers’ declarations. Therefore, they could be characterized by inconsistencies (for more methodological issues see http://www1.finanze.gov.it/pagina_dichiarazioni/dichiarazioni.html).

The average taxable income is a measure of affluence of income earners living in an area (LLMA) and indirectly measures labour market performances as well; notably, however, there are no data available at the LLMA level. The percentage of persons declaring income is influenced by the demographic structure, activity rate and incidence of income earners below a (low) threshold that do not declare any income.

We wish to point out that the mean income estimated from the EU-SILC and the average taxable income obtained from the tax registers—although strongly correlated as specified below—are different measures. The mean income from the EU-SILC is based on all income sources (i.e., taxable, and not taxable) and government transfers; moreover, it is computed considering economies of scale within households. The average taxable income is computed instead by aggregating single persons’ taxable income without taking into account government transfers and household composition.

A preliminary analysis indicated that both the auxiliary variables—the average taxable income and the percentage of persons declaring income—are highly correlated with the mean income at the LLMA level, with a linear correlation of about 0.64 and 0.61, respectively. Nevertheless, when used as covariates in the area level model, the predictive power of the two auxiliary variables is different since they are correlated (linear correlation 0.56). The average taxable income is the covariate that alone reduces the most the area level variance, with respect to the percentage of persons declaring income, in the FH model. Therefore, our aim is to test how the reviewed small-area estimators perform in two different scenarios: (a) when both the auxiliary variables are used in the model; and (b) when we use only the percentage of persons declaring income. We call the first scenario as ‘with average taxable income’ and the second scenario as ‘without average taxable income’. Specifically, we aim at ascertain if the spatial information boosts the efficiency in both the scenarios, and whether it is capable of offsetting the lack of the highly predictive auxiliary variable when this is omitted from the model.

We first present the model parameters. In Table 1 we report the estimated model coefficients and variance components obtained under the scenario with average taxable income for the FH model, the spatial Fay-Herriot model (SFH), the nonstationary Fay-Herriot model (NSFH) and the semiparametric Fay-Herriot model (SPFH). In Table 2 we report the same results for the scenario without average taxable income. All the regression coefficients are statistically different from zero, and their direct comparison between the two scenarios is not interesting because the auxiliary variables are different. We focus instead on the variance components of the different models. In the scenario with average taxable income, ${\hat{σ}}_{u}^{2}$ is smaller than that in the scenario withoutt average taxable income, model by model. This implies that models that use average taxable income fit better than models that do not use it. In the case of the FH model, the EBLUP under the scenario with average taxable income is more efficient than the EBLUP under the scenario without it. We cannot say the same for the other predictors. Indeed, the spatial information enters the game. Looking at the SFH model, we observe that the spatial autocorrelation parameter of the conditional distribution of mean income direct estimates given the auxiliary variables ( $ρ)$ ) is estimated as equal to 0.14 in the scenario with taxable income (i.e., a very small value) and is estimated equal to 0.5 in the scenario without it (i.e., a quite high value). Removing the average taxable income from the model turns out in a remarkable increase of the spatial autocorrelation that can improve the efficiency of the SEBLUP. A similar effect is observed with the robust version of the spatial EBLUP (results not shown here but they are available from the authors upon request). We observe a similar result also for the SPFH, where the variance component ${\hat{σ}}_{u}^{2}$ goes from 0 to 0.02. The fairly marginal increase, however, can partially offset the lack of the average taxable income. Looking at the NSFH model, the spatial clustering parameter $η$ , which measures the spatial strength, goes from 0.01—in other words, an absence of spatial clustering—to 3.82, which indicates the presence of spatial clustering. Nevertheless, it seems that the spatial clustering identified by the NSFH model does not affect the efficiency of its predictor.

Table 1.

Estimated Coefficients and Variance Components of the Area Level Small Area Models. Scenario with Average Taxable Income.

Coefficients	FH	SFH	NSFH	SPFH
Intercept	−7.95	−6.30	−7.66	−7.94
Percentage of persons declaring income	16.83	13.94	16.47	16.81
Average taxable income	0.73	0.74	0.73	0.73
$σ_{u}^{2}$	3.64	1.96	0.02	3.60
$ρ$		0.14
$η$			0.01
$σ_{g}^{2}$				0.00

Table 2.

Estimated Coefficients and Variance Components of the Area Level Small Area Models. Scenario Without Average Taxable Income.

Coefficients	FH	SFH	NSFH	SPFH
Intercept	−6.28	−6.23	−4.64	11.24
Percentage of residents filling tax form	34.26	34.40	31.78	7.29
$σ_{u}^{2}$	5.65	3.63	0.02	4.38
$ρ$		0.50
$η$			3.82
$σ_{g}^{2}$				0.02

The efficiency of the predictors based on the different small area models can be measured by the estimated coefficient of variations. Table 3 reports the distribution among the LLMAs of the coefficient of variation (CV)—a measure of efficiency—of the predictors using the different small area models under the two scenarios. As expected, the CVs of the FH predictors increase by about 13 per cent in mean from the scenario with average taxable income to the scenario without, and similarly for the NSFH predictors, with an increase of 15.6 per cent in mean. The estimated spatial clustering in the NSFH model ( $η$ ) does not allow the NSFH predictor to outperform the FH one. The CVs of the SPFH predictors increase only by 4.9 per cent between the two scenarios. The SFH predictors perform even better than the SPFH ones, performing the same way in terms of mean of CVs between the two scenarios. Therefore, the spatial information included in the SPFH and SFH models maintains about the same efficiency between the two scenarios, and the average taxable income is proven to have a good predictive power as the FH predictors lose about 13 per cent of efficiency in mean (about 17 per cent in median). Looking at the distribution of the CVs, the SFH predictor performs the best in both the scenarios.

Table 3.

Distribution of the Coefficient of Variation (per cent) of the Area Level Small Area Predictors.

	1st Q.	Median	Mean	3rd Q.
Direct	15.1	20.8	23.3	26.1
With average taxable income
FH	8.7	9.6	10.0	11.2
SFH	7.7	8.6	9.0	10.1
NSFH	8.9	9.6	9.6	10.3
SPFH	8.8	9.7	10.1	11.1
Without average taxable income
FH	10.1	11.2	11.3	12.3
SFH	7.3	8.4	9.0	10.0
NSFH	10.0	11.2	11.1	12.2
SPFH	9.3	10.2	10.6	11.9

In Figure 1 we map the predicted values of the mean income obtained under the different small area models considering the scenario with the average taxable income. Out-of-sample areas are predicted according to the predictors discussed in sections 2, 3, 4 and 5. Darker colours indicate higher mean income, and we use a common scale among the four small area predictors.

Figure 1.

Maps of the Estimated Mean Income for each LLMA in Italy in 2015. Scenario with Average Taxable Income.

As expected, the maps appear similar because in this scenario, with the average taxable income, the spatial autocorrelation in the conditional distribution (of the target given the covariates) is trivial and the SFH, NSFH and SPFH models tend to be similar (or equal) to the FH model.

Figure 2 presents the same maps as in Figure 1, but for the scenario without average taxable income. In this case, the spatial information is able to mimic the behaviour of the omitted powerful covariate and plays a role in differentiating the predictions under the four small area models to an extent. The differences are expected and justified by the different structure and specification of the models. The out-of-sample areas predictions can benefit from the spatial locations of the LLMAs under the NSFH and SPFH models. Indeed, these two models allow the use of spatial data to build small area predictors that use spatial random effects, while the SPFH predictors are reduced to the synthetic predictor as under the FH model. Compared to the FH predictions, the SFH, NSFH and SPFH predictions show a higher mean income in the LLMAs of the north-east. It is worth noting that the SPFH predictions are exceptionally smooth, which is expected even if, in this case, this phenomenon seems to over-shrink the spatial distribution of the estimates: the high mean income in the LLMAs of the north (mostly north-east), which decreases coming down to the centre and further in the south. The degree of smoothness of the SPFH predictions is due to the choice of the P-spline and the number of knots; it can be adjusted and fine-tuned while working with them.

Figure 2.

Maps of the Estimated Mean Income for each LLMA in Italy in 2015. Scenario without Average Taxable Income.

In conclusion, the spatial correlation (or clustering) in the conditional distribution of the target, given the auxiliary variables, is negligible when we use the average taxable, which has a high predictive power, as the auxiliary variable—which has a high predictive power. Moreover, the performances of the different small area predictors are quite similar and better than the direct estimator. Therefore, in this case, spatial information does not boost the performance—in terms of efficiency—of the small area SFH, NSFH, and SPFH predictors. When we do not include the average taxable income in the models, the spatial correlation allows the small area models SFH and SPFH to offset the lack of the predictive average taxable income auxiliary variable and leads to obtaining credible estimates.

7. Concluding Remarks

In this article we first reviewed the most popular area level small area models for a continuous response variable that include spatial auxiliary information in the model specification. We then evaluated the effect of the spatial information on the small area estimates under two alternative scenarios depending on the predictive power of the models’ covariates. Specifically, the spatial pattern of the direct estimates among the areas was modelled in a parametric (Spatial FH, Non Stationary FH models) and in a semiparametric way (Semiparametric FH model), to better evaluate the use of the spatial information with or without high predictive covariates.

The results show that, when spatial information on the areas is available—and this is often the case—the specification of a spatial area level model can be crucial, especially when area-level covariates with a high predictive power are not available. In these situations, the conditional distribution of the target variable given the covariates can be characterized by a high spatial correlation/clustering that a spatial SAE model can exploit for increasing the efficiency with respect to a standard Fay-Herriot model. In our application, the spatial information included in the Semiparametric and Spatial Fay-Herriot models maintains about the same efficiency between the two compared scenarios, one where a high predictive covariate is present, and the second when this covariate is excluded from the models.

The findings presented in this article can support all those researchers that need to produce local estimates using data with a spatial structure, although they are of course not exhaustive of all the complexities that can emerge when analysing this kind of data.

The models considered in this article include the SFH model (Singh et al.^[44]; Petrucci and Salvati^[28]; Pratesi and Salvati^{[35, 36]}), the SPFH model (Giusti et al.^[14]) and the spatially nonstationary Fay-Herriot model (Chandra et al.^[4]), as well as the Fay and Herriot^[11] model.

We decided to focus on these models since area level estimation has received growing attention in the past few years in many fields, such as in socio-economic analyses, where the availability of local estimates is often crucial for policy purposes. In these fields, the lack of survey and/or population microdata with detailed geographical information—due to privacy issues—often makes it impossible to resort to unit-level small area models. In such cases, area level models usually represent a more flexible alternative. However, especially when the interest is in estimating sensitive and/or complex phenomena at the local level, such as income and poverty estimates, the lack of area-level covariates with a high predictive power can impact the efficiency of SAE estimates.

When the interest is in analysing spatial data, a relevant issue is the Modifiable Area Unit Problem (MAUP), which has been discussed in the spatial analysis literature since the 1930s (Unwin^[45]). The MAUP is a possible source of bias that can drastically affect the results of statistical analyses. Specifically, the bias can be present when spatial characteristics measured in specific points are aggregated at a larger geographical level (e.g., larger areas). In these cases, the summary values of interest, for example the target variable totals, rates or proportions, can be strongly affected by the chosen specification of the areas’ borders. This is what occurs when we define covariates at the area level aggregating individual values.

Although MAUP is well known, few studies have been dedicated to the study of MAUP in the SAE context. Pratesi^[34] provides some results that can be used to evaluate the robustness of SAE unit-level methods to alternative aggregations of point-based measures within the small areas of interest. Specifically, Pratesi^[34] performed a simulation study to evaluate the so-called MAUP scale effect, the effect under which the results are different when changing the scale of aggregation of the data. Following this approach, further studies could be devoted to the evaluation of the MAUP in the context of spatial SAE models, as these could serve as a guide when different geographical small areas can be specified using the same data (e.g., aggregating microdata at different area levels).

Another relevant issue when applying area-level SAE models is the lack of population information to be used in building covariates (auxiliary variables) at the area level. Indeed, the Fay-Herriot model and all the spatial models we reviewed in this article assume that the auxiliary variables are measured without error—in other words, are based on censuses or population registers. Ybarra and Lohr^[49] proposed a modified version of the FH model where the auxiliary variables can be affected by measurement error; for example, when they are estimates of population information affected by sampling variability. The Ybarra-Lohr model can therefore be useful when population information is unavailable but other data sources are available, including new ones (e.g., big data) that are not representative of all the target population (Marchetti et al.^[23]). Future studies could be devoted to the specification of a spatial Ybarra-Lohr model, as the joint use of geographical information on the areas, and of covariates affected by any kind of measurement error could be of help in many relevant applications.

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The authors received no financial support for the research, authorship and/or publication of this article.

ORCID iD

Stefano Marchetti

References

Anselin

Spatial econometrics. Methods and models . Kluwer Academic, Boston 1992.

Barbier

, Hochard

JP.

Poverty and the spatial distribution of rural population. Working Paper No. 7101 . https://openknowledge.worldbank.org/handle/10986/20616License: CC BY 3.0 IGO. World Bank Group, Washington, DC 2014.

Brunsdon

, Fotheringham

, Charlton

. Geographically weighted regression: a method for exploring spatial nonstationarity. Geogr Anal 1996; 28: 281–298.

Chandra

, Salvati

, Chambers

A spatially nonstationary Fay-Herriot model for small area estimation. J Surv Stat Meth 2015; 3: 109–135.

Chandra

, Salvati

, Chambers

RR.

Small area prediction of counts under a non-stationary spatial model. Spat Stat 2017; 20: 30–56.

Cressie

Small-area prediction of undercount using the General Linear Model. In: Proceedings of the Statistic Symposium 90: Measurement and Improvement of Data Quality . Ottawa: Statistics Canada 1991, pp. 93–105.

Cressie

Statistics for spatial data . Wiley 1993.

Curtis

, Lee

, O’Connell

, Zhu

The spatial distribution of poverty and the long reach of the industrial makeup of places: New evidence on spatial and temporal regimes. Rural Soc 2019; 84: 28–65.

Datta

, Rao

JNK

, Smith

. On measuring the variability of small area estimators under a basic area level model. Biometrika 2005; 92: 183–196.

10.

Ezcurra

, Pascual

, Rapún

The spatial distribution of income inequality in the European Union. Env Plan A 2007; 39: 869–890.

11.

Fay

, Herriot

RA.

Estimates of Income for Small Places: an Application of James-Stein Procedures to Census Data. J Amer Stat Assoc 1979; 74: 269–277.

12.

French

, Kammann

, Wand

Comment on paper by Ke and Wang. J Amer Stat Assoc 2001; 96: 1285–1288.

13.

Giusti

, Marchetti

, Pratesi

, Salvati

Robust small area estimation and oversampling in the estimation of poverty indicators. Survey Research Methods 2012a; 6: 155–163.

14.

Giusti

, Marchetti

, Pratesi

, Salvati

Semiparametric fay-Herriot model using penalized splines. J Indian Soc Agri Stat 2012b; 66: 1–14.

15.

González-Manteiga

, Lombardía

, Molina

, Morales

, Santamaría

Estimation of the mean squared error of predictors of small area linear parameters under a logistic mixed model. Comput Stat Data Anal 2007; 51: 2720–2733.

16.

Henderson

CR.

Best linear unbiased estimation and prediction under a selection model. Biometrics 1975, 31, 423–447.

17.

Jiang

, Lahiri

Mixed model prediction and small area estimation. Test 2006, 15, 1–96.

18.

Lawson

AB.

Bayesian disease mapping: Hierarchical modeling in spatial epidemiology . Chapman & Hall/CRC, Boca Raton, FL 2009.

19.

Leroux

, Lei

, Breslow

Estimation of disease rates in small areas: a new mixed model for spatial dependence. In: Statistical Models in Epidemiology, the Environment and Clinical Trials , 116, eds. M. E. Halloran and D. Berry. Springer, New York 2000, pp. 135–178.

20.

Lumley

Analysis of Complex Survey Samples. J Stat Softw 2004; 9: 1–19.

21.

Lumley

Survey: Analysis of complex survey samples . R package version 3.31-5, 2017.74\5

22.

MacNab

YC.

Hierarchical Bayesian spatial modelling of small-area rates of non-rare disease. Stat Med 2003, 22, 1761–1773.

23.

Marchetti

, Giusti

, Pratesi

, Salvati

, Giannotti

, Pedreschi

, Rinzivillo

, Pappalardo

, Gabrielli

Small area model-based estimators using big data sources. J Off Stat 2015, 31, 263–281.

24.

Marhuenda

, Molina

, Morales

Small area estimation with spatio-temporal Fay–Herriot models. Comput Stat Data Anal 2013, 58, 308–325.

25.

Molina

, Salvati

, Pratesi

Bootstrap for estimating the MSE of the Spatial EBLUP. Comput Stat 2009, 24, 441–458.

26.

Kammann

, Wand

MP.

Geoadditive models. J Royal Stat Soc: Series C 2003, 52, 1–18.

27.

Opsomer

, Claeskens

, Ranalli

, Kauermann

, Breidt

FJ.

Non-parametric small area estimation using penalized spline regression. J Royal Stat Soc: Series B 2008; 70: 265–286.

28.

Petrucci

, Salvati

Small area estimation for spatial correlation in watershed erosion assessment. J Agri, Bio Env Stat 2006; 11: 169–182.

29.

Petrucci

, Pratesi

, Salvati

Geographic information in small area estimation: Small area models and spatially correlated random area effects. Stat Transit 2005; 7: 609–623.

30.

Pfeffermann

New important developments in small area estimation. Stat Sci 2013; 28: 40–68.

31.

Porter

, Holan

, Wikle

, Cressie

Spatial fay-Herriot models for small area estimation with functional covariates. Spat Stat 2014; 10: 27–42.

32.

Porter

, Wikle

, Holan

SH.

Small area estimation via multivariate fay–Herriot models with latent spatial dependence. Australian & New Zealand J Stat 2015; 57: 15–29.

33.

Prasad

NGN

, Rao

JNK.

The estimation of mean squared error of small-area estimators. J Amer Stat Assoc 1990; 85: 163–171.

34.

Pratesi

Spatial disaggregation and small-area estimation methods for agricultural surveys: solutions and perspectives. FAO Technical Report Series GO-07, 2015.

35.

Pratesi

, Salvati

Small area estimation: The EBLUP estimator based on spatially correlated random area effects. Stat Meth Appl 2008; 17: 113–141.

36.

Pratesi

, Salvati

Small area estimation in the presence of correlated random area effects. J Off Stat 2009; 25: 37–53.

37.

Rao

JNK

, Sinha

, Dumitrescu

Robust small area estimation under semi-parametric mixed models. Canadian J Stat 2014; 42: 126–141.

38.

Rao

JNK

, Molina

Small area estimation , 2nd ed. Wiley, New York 2015.

39.

Ruppert

, Wand

, Carroll

Semiparametric regression . Cambridge University Press, Cambridge, New York 2003.

40.

Salvati

, Chandra

, Ranalli

, Chambers

Small area estimation using a nonparametric model-based direct estimator. Comput Stat Data Anal 2010; 54: 2159–2171.

41.

Salvati

, Giusti

, Pratesi

The use of spatial information for the estimation of poverty indicators at the small area level. In: Poverty and social exclusion. New methods of analysis , Eds G Betti and A Lemmi. Routledge 2014.

42.

Sinha

, Rao

JNK

. Robust small area estimation. Canadian J Stat 2009; 37: 381–399.

43.

Schmid

, Münnich

RT.

Spatial robust small area estimation. Stat Paper 2014; 55: 653–670.

44.

Singh

, Shukla

, Kundu

Spatio-temporal models in small area estimation. Surv Meth 2005; 31: 183–195.

45.

Unwin

DJ.

GIS, spatial analysis and spatial statistics. Prog Human Geo 1996, 20, 540–551.

46.

Wang

, Fuller

. The mean squared error of small area predictors constructed with estimated area variances. J Amer Stat Assoc 2003; 98: 716–723

47.

Warnholz

Small area estimation using robust extensions to area level models . PhD thesis, Freie Universitat Berlin 2016.

48.

Wolter

Introduction to Variance Estimation . Springer-Verlag, New York 1985.

49.

Ybarra

LMR

, Lohr

SL.

Small area estimation when auxiliary information is measured with error. Biometrika 2008; 95: 919–931.

50.

You

, Zhou

Hierarchical Bayes small area estimation under a spatial model with application to health survey data. Wiley Series in Surv Meth 2011; 37: 25–36.