Sparsity inducing priors are widely used in Bayesian regression analysis, and seek dimensionality reduction to avoid unnecessarily complex models. An alternative to sparsity induction are discrete mixtures, such as spike and slab priors. These ideas extend to selection of random effects, either or structured (e.g. spatially structured). In contrast to sparsity induction in mixed models with random effects, in this paper we apply sparsity priors to spatial regression for area units (lattice data), and to spatial random effects in conditional autoregressive priors. In particular, we consider the use of global-local shrinkage to distinguish areas with average predictor effects from areas where the predictor effect is amplified or diminished because the response-predictor pattern is distinct from that of most areas. The operation and utility of this approach is demonstrated using simulated data, and in a real application to diabetes related deaths in New York counties.
Sparsity inducing priors are widely used in Bayesian regression analysis, and seek dimensionality reduction to avoid unnecessarily complex models that may predict poorly. This involves estimating relatively sparse models where some of the regression coefficients are effectively zero, or have credible intervals including zero. For example, the horseshoe prior (Carvalho et al., 2009) for a collection of regression coefficients is an example of a global-local shrinkage prior, with a global hyperparameter that tends to shrink all parameters towards zero, while predictor specific local hyperparameters (following heavy-tailed half-Cauchy priors) allow some coefficients to avoid the shrinkage (Piironen & Vehtari, 2017). An alternative to sparsity induction are discrete mixtures, or spike and slab priors, which place a positive prior probability on zero values of coefficients. Here binary retention indicators specific to predictors determine whether such predictors take zero values (Kuo & Mallick, 1998), or have priors that confine predictors to effectively zero values (George & McCulloch, 1993). These retention indicators are in turn defined by retention probabilities , which may be preset or unknowns.
These ideas extend to selection of random effects. For example, Frühwirth-Schnatter and Wagner (2011) propose discrete mixture selection to unit-specific random effects with the aim of identifying units that are “average” in the sense of not deviating significantly from the overall mean. This application involves random effects. By contrast, van der Pas et al. (2014) consider sparsity inducing priors as a way of selecting distinct random effects from average effects, again for random effects. To illustrate, suppose are normal random unit intercepts in a normal linear mixed model for univariate longitudinal responses for units and times , with subject level model
Here is a scalar, is a row -vector of predictors, is a column -vector of regression coefficients, is a normal random effect (with mean zero), and the residual variance is . For the full sample of units, and at time , the model may be stated as
where is a column -vector, is a matrix, and is a column -vector.
One may aim to seek out significant predictor or random effects by shrinking the relevant regression coefficients or random effects towards zero. Applied to the random effects, the sparsity inducing prior (which shrinks less relevant effects towards zero) for the above model is
with global shrinkage parameter , and local shrinkage parameters with half Cauchy priors,
For the global parameter in Eq. (3), a half Cauchy prior may also be used (Carvalho et al., 2009).
The scope of this study
Dimensionality reduction is also relevant to structured random effects, namely effects dependent over space, time, or space-time. Spatial random effects are the focus in this paper, and in particular sparsity priors aimed at spatially varying regression coefficients in area (spatial lattice) data, as opposed to continuous spatial data (Saveliev et al., 2007). The model framework is therefore considerably distinct from that in Eqs (1) to (2).
Regarding related papers, Choi and Lawson (2016) consider a disease mapping framework with disease count responses for sets of areas. They propose a discrete mixture selection approach in models allowing spatially varying regression coefficients (SVC models). Thus predictor selection probabilities for areas and predictor determine whether local regression coefficients are zero (or effectively zero). The follow a spatially structured prior.
By contrast, a continuous shrinkage prior is proposed by Jhuang et al. (2019), for a spatial process framework (Gelfand et al., 2005), with continuous outcomes and continuous space (e.g. point data), and so using spatial priors relevant to that form of spatial data. The shrinkage prior uses the method of Carvalho et al. (2009).
In this paper, we consider spatially structured random effects, sparsity selection, and a discrete spatial framework for spatial lattice data (e.g. disease counts for administrative small areas). Sparsity inducing priors are proposed to distinguish areas with average predictor effects from areas where the predictor effect is amplified or diminished because the response-predictor pattern is distinct from that of most areas. The operation and utility of this approach is demonstrated using simulated data, and a real application to diabetes related deaths in New York counties.
Spatially varying coefficients in regressions for spatial lattice data
For concreteness, consider a spatially defined disease outcome (for lattice data). The data often consist of disease counts (for areas ), which are assumed Poisson distributed with means , where are pre-calculated Poisson offsets for expected event totals, and the represent unknown relative disease risks in areas . The expected event totals are obtained using demographic methods, with national disease rates applied to the populations of the areas; for example, see Vranckx et al. (2019); Richardson et al., 2004 (p. 1017). The interest is often in the impacts of ecologically defined area predictors on relative risks, and the coefficients are usually assumed constant over space. However spatially varying coefficients may be assumed, this being an acknowledgement of one form of spatial heterogeneity (e.g. Graif & Sampson, 2009).
Assume a Poisson model with means and area-specific coefficients for a spatially varying coefficient (SVC) model (Assunçao, 2003). The relative risk is then defined as
where is an intercept. The same framework may be applied to binomial illness outcomes , where is the population at risk, and a logit regression links disease probabilities to a spatially varying coefficient specification:
In actual estimation, varying coefficients such as (for area and predictor ) may be represented, and estimated, as a sum of the central fixed effect (say ) and zero mean spatial random effects.
The coefficients are assumed to be spatially correlated following a suitable prior, typically of conditional autoregressive or CAR form (De Oliveira, 2012). A particular form of this type of prior is proposed by Besag and co-authors (Besag et al., 1991) – see Eq. (10) below. One may also assume spatially varying intercepts in Eqs (5) or (6), for example
where is a zero mean spatial residual, typically a CAR random effect. However, models which allow for heterogeneity in regression impacts may account for most of the spatial correlation in residuals (Fotheringham et al., 2002, p. 117).
Sparsity prior for spatially varying coefficients
The SVC approach of Assuncao (2003) assumes spatially smooth coefficient effects (with a homogenous smoothing mechanism across all areas). However, in some situations, one may have subregions with distinct outcome-predictor patterns, e.g. where there is ethnic or social segregation, localized to some areas, may affect predictor effects. In such situations the usual approaches to varying coefficients may be subject to oversmoothing, not allowing for spatial discontinuity (Xu et al., 2017). Lee and Mitchell (2013) refer to step changes in the spatial pattern, although their model applies to spatial residuals. Consider a predictor which is a positive risk factor (one that enhances area disease risk). Then unusual outcome-predictor patterns might be highly elevated risk combined with a below average predictor value; highly elevated disease risk combined with an unusually high predictor value (implying a stronger than average regression effect); or low relative disease risk, but a relatively high predictor value. A suitable shrinkage mechanism would shrink the regression effect to the average effect (areas and predictors ) for typical areas, but for atypical areas one might seek local adaptivity to express stronger or weaker effects.
Here we propose a model linking local shrinkage parameters to spatially structured effects to allow local adaptivity in regression coefficients. Thus for a single predictor () and areas , we assume a regression coefficient prior
where is a global shrinkage parameter, the are local shrinkage parameters, and is the region-wide predictor effect. The local parameters are half Cauchy according to
where is a vector of spatially structured random effects. Here the spatial effects follow the prior of Besag et al. (1991, p. 8) with conditional normal form
where is a conditional variance, is the number of areas adjacent to area (termed the locality of area ), and is the average of the effects in this locality.
Adaptivity to unusual outcome-predictor patterns implies larger in some areas (and hence a more diffuse prior for . Lower apply to areas following the typical regression pattern, with the regression coefficient shrunk towards . So the parameters add extra information to the usual inferences made with disease data and observed predictors, providing a measure of outlyingness in the outcome-predictor space.
Equivalently, using the normal/gamma representation of a Cauchy variable
with . Note that the assumed parameterisation of the gamma density with random variable is
Under a Bayesian hierarchical approach (Gelfand, 2012), the likelihood is the first stage of a three stage specification. For the Poisson approach with a single predictor, area relative risks defined by
and means , the likelihood for area conditions on the data , the hyperparameters , and random effects as in Eqs (8)–(10), namely
For binomial data with disease probability
the relevant conditional likelihood is
For multiple predictors () and areas , one may similarly specify global-local shrinkage parameters specific to predictor , and area-predictor specific local hyperparameters, namely
with sets of CAR spatial effects determining the level of local adaptivity. As above, the spatial effects follow the CAR prior of Besag et al. (1991) with conditional normal prior
where is a conditional variance, and is the average of the effects in the locality of area .
Application to simulated data
We consider a simulation based on the spatial structure of the 62 counties of New York state and a Poisson likelihood structure as in Eqs (15) and (16). Expected disease counts for counties are simulated as gamma with shape 30 and rate 2, and values of a single predictor are simulated as N(0, 1). The simulation introduces contamination into the generation of spatially varying coefficients. Thus the regression term defining area relative risks is obtained as
where , and the are CAR random spatial effects as per Besag et al. (1991), with mean and precision . Additive effects are uniform U (0.75, 0.75) and apply only to a random subsample of counties. So for areas not in the subsample, the relative risk is specified as . Two contamination rates are considered to obtain the random subsamples: 20% and 33%. So in this subset of counties, the impact of the covariate is shifted away from the normal CAR terms by additive uniform effects. Disease counts are then simulated as .
We simulate 100 datasets and compare three model formulations with regard to (a) fit to each simulated dataset, measured by the leave one out information criterion (LOO-IC) of Vehtari et al. (2017); (b) recovery of the underlying parameters, specifically the mean of the simulated area regression parameters ; thus the true value is taken as recovered if the 95% credible interval of the estimated parameter contains the true value; and (c) predictive discrepancies, namely tail values (over 0.95 or under 0.05) for where are posterior predicted counts (Marshall & Spiegelhalter, 2003).
In a first analysis of the simulated data, we consider three forms of model addressing regression coefficient estimation. The first is a simple Poisson regression (model A), with a homogenous predictor effect:
with regression coefficients assigned N(0, 10) priors. This option is denoted as HPE. The second (model B) allows spatial variation in the predictor effect, as per the usual spatially varying coefficient model (Assuncao, 2003) with varying coefficients , but with no sparsity or shrinkage mechanism:
Here the follow the CAR prior of Besag et al. (1991) with precision parameter assigned a Ga (0.5, 0.0005) prior. This is denoted as the SVC option.
The third formulation (model C) uses a sparsity inducing prior (SIP for short), as per:
with the are zero-mean spatial effects following the CAR prior of Besag et al. (1991) (BYM for the authors) – see Eq. (10) and is the precision parameter.
In a second form of analysis of the simulated datasets, we allow, as in Eq. (7), for varying intercepts using random effects. In the above options (HPE, SVC, SIP) the residuals follow a scheme proposed by Besag et al. (1991, p. 7) involving both a spatially correlated CAR effect and an unstructured () random error, with the latter mainly serving to account for Poisson overdispersion, and the former representing spatially correlated but unobserved risk factors. This scheme is known as the convolution prior (Rodrigues & Assunção, 2012). Thus model A (HPE) becomes:
where the have a Besag et al. (1991) CAR prior, with precision parameter assigned a Ga (0.5, 0.0005) prior. The are iid random effects, with , and precision assigned a Ga (0.5, 0.0005) prior. The model B (SVC) option becomes
where both the and follow the Besag et al. CAR (1991) prior, with precision parameters assigned Ga (0.5, 0.0005) priors. The model C (SIP) option becomes:
where and are zero-mean spatial effects following the Besag et al. (1991) CAR prior, and the are random normal with mean 0.
Table 1 compares the performance of the three model formulations (HPE, SVC, SIP), with coefficient estimation using spatially varying coefficient models (upper panel) and varying coefficient plus spatial residuals and iid random effects (lower panel). Estimation uses rstan (Carpenter et al., 2017; Morris et al., 2019), with two chains and 2000 iterations in estimation of each model in each of the 100 simulations.
Performance evaluation on simulated data
Models with spatially varying coefficients only
Contamination rate
20%
33%
Model
% of simulations with mean recovered
% of simulations with mean recovered
HPE
54
50
SVC
92
93
SIP
99
98
Average number of observations (from 62) with tail predictive -values
Average number of observations (from 62) with tail predictive -values
HPE
16.0
11.9
SVC
3.3
3.4
SIP
1.3
1.0
Leave one out information criterion (LOO-IC)
Leave one out information criterion (LOO-IC)
HPE
429.6
434.6
SVC
333.1
336.7
SIP
328.7
331.2
Varying coefficients and intercepts
Contamination rate
20%
33%
Model
% of simulations with mean recovered
% of simulations with mean recovered
HPE
96
96
SVC
95
95
SIP
99
97
Average number of observations (from 62) with tail predictive -values
Average number of observations (from 62) with tail predictive -values
HPE
1.4
1.0
SVC
1.6
1.2
SIP
0.7
0.7
Leave one out information criterion (LOO-IC)
Leave one out information criterion (LOO-IC)
HPE
334.3
337.8
SVC
331.9
335.3
SIP
328.7
331.4
Map of counties in New York state.
It can be seen that the SIP option performs relatively well on all evaluation criteria across the board, with better parameter recovery, fewer poorly predicted disease counts, and better fit as measured by LOO-IC criteria. The performance of the SVC and HPE models is much improved when varying residuals are added. But for models for spatially varying effects alone, as under Eqs (23)–(5), without the assistance of varying residuals or iid effects, the SIP model clearly performs better. The SIP option retains the lower LOO-IC even when varying residuals are added to the models, but for the SIP option there is virtually no change in performance between the varying coefficients option in Eq. (5) and the model with varying residuals in Eq. (5). For example, the LOO-IC is very similar. Hence the sparsity inducing prior, as applied only to regression coefficients, is competitive with more heavily parameterised model variants.
Real data application: Diabetes and poverty in New York counties
A real data application involves binomial data on diabetes related deaths (International Classification of Diseases, ICD10 E10-E14) in populations of 62 New York counties for the period 2013–2017 (see Fig. 1) (CDC, 2020). The predictor is the 2015 county percentage poverty rate. Two models are compared to explain diabetes mortality rate variations between counties. The first is an SVC model with
where and follow zero-mean CAR priors as per Besag et al. (1991), and the are random normal . The second uses the sparsity inducing prior with
Both models show a generally positive association between poverty and diabetes mortality over the 62 counties. The New York state-wide parameter has posterior mean (95% CRI) of 3.24 (1.80, 4.90) under the SIP model, and 2.87 (1.16, 4.75) under the SVC model. However, there is considerable variation in county specific coefficients .
Table 2 shows the estimated coefficients under the SIP and SVC models in according to descending order of the estimated in the SIP model. The are highest for counties where the outcome-predictor pattern is unusual compared to the regression for typical counties (e.g. a high diabetes rate coupled with a relatively low poverty, given the overall positive relationship between the disease and risk factor). Diabetes death rates in Table 2 are the ratios times 100,000. The respective LOO-IC for the SVC and SIP models are 559 and 554.
Model estimates compared, diabetes mortality in New York counties
Distinct predictor-outcome profiles are associated with spatial discontinuites in the pattern of the , corresponding to areas with high . Unusual predictor-outcome profiles can be assessed by comparing county poverty and mortality to the state wide averages: the average poverty percentage across all counties is 14.5%, and the average mortality rate is 24.5 per 100,000. The highest is for Tompkins county, where poverty is considerably above average, but despite the excess poverty, diabetes mortality is below average. The regression coefficient for Tompkins under the SIP model is accordingly shrunk towards zero, with posterior mean for of 0.94, as compared to a posterior mean under the SVC model of 2.67. The second highest is for Monroe county with around average poverty, but below average diabetes mortality. The third and fourth highest (for Herkimer and Genesee counties) have below average poverty, but much above diabetes mortality. Accordingly posterior means for are adjusted upward under the SIP model to reflect the observed diabetes levels in these counties. The LOO-IC casewise values, the individual area contributions to the LOO-IC (Vehtari et al., 2017), for the five counties with highest are all lower under the SIP model.
By contrast, low values of apply to “typical” counties (in terms of the predictor-outcome relationship), and for these counties values tend to be shrunk towards the global value under the SIP prior. For example, the counties with the lowest five have posterior mean under the SIP prior between 3.17 and 3.38, and the standard deviation of these five means is 0.08. By contrast, the SVC model has mean varying from 2.73 to 3.03 for these counties, and their standard deviation is 0.11. Figure 2 shows how values are spatially clustered, while Figs 3 and 4 show the contrasting pattern in the posterior mean under the SVC and SIP models.
Conclusion
Impacts of risk factors on spatial variations in disease may be important for policy purposes. For example, assume that the overall relationship between a disease and certain risk factor, such as poverty, is positive: high levels of poverty are associated with higher disease levels. In some areas, however, there may be high disease risk but relatively low poverty. Such areas may be disadvantaged if interventions are based simply on poverty rates.
In terms of implications for spatial models and disease mapping, spatially varying coefficient models with a single global shrinkage parameter may be insufficiently adaptive to such unusual patterning in outcome-risk factor associations. By contrast, an adaptive global-local shrinkage mechanism, as part of a sparsity inducing prior, has the potential to identify distinctive predictor-outcome associations not apparent under the usual spatially varying coefficients methodology. This paper has developed a methodology for small areas, applicable using rstan, that implements an adaptive procedure for such situations.
Footnotes
Conflict of interest
The authors declare no competing interests.
References
1.
AssunçaoR. (2003). Space varying coefficient models for small area data. Environmetrics, 14(5), 453-473.
2.
BanerjeeS.CarlinB., & GelfandA. (2014). Hierarchical modeling and analysis for spatial data. Second Edition. CRC Press/Chapman & Hall.
3.
BesagJ., & YorkJ.A.Mollié (1991). Bayesian image restoration with two applications in spatial statistics. Ann Inst Statist Math, 43, 1-59
4.
CarpenterB.GelmanA.HoffmanM.LeeD.GoodrichB.BetancourtM.BrubakerM.GuoJ.LiP., & RiddellA. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1).
5.
CarvalhoC.PolsonN., & ScottJ. (2009). Handling sparsity via the horseshoe. Proceedings of Machine Learning Research, 5, 73-80.
6.
Centers for disease control and prevention (CDC) (2020). National Center for Health Statistics. Underlying Cause of Death Tables, Accessed 01-02-2019.
7.
DeOliveiraV. (2012). Bayesian analysis of conditional autoregressive models. Annals of the Institute of Statistical Mathematics, 64(1), 107-133.
8.
FotheringhamA.BrundsonC., & CharltonM. (2002). Geographically weighted regression: The analysis of spatially varying relationships. West Sussex, England: Wiley.
9.
Frühwirth-SchnatterS., & WagnerH. (2011). Bayesian variable selection for random intercept modeling of Gaussian and non-Gaussian data, in: Bayesian StatisticsBernardoJ.BayarriJ.BergerJ.DawidA.HeckermanD.SmithA. & WestM., eds, 9, 165-200.
10.
GelfandA. (2012). Hierarchical modeling for spatial data problems. Spatial Statistics, 1, 30-39.
11.
GelfandA.BanerjeeS., & GamermanD. (2005). Spatial process modelling for univariate and multivariate dynamic spatial data. Environmetrics, 16(5), 465-479.
12.
GeorgeE., & McCullochR. (1993). Variable selection via Gibbs sampling. Journal of the American Statistical Association, 88, 881-889.
13.
JhuangA.FuentesM.JonesJ.EstevesG.FancherC.FurmanM., & ReichB. (2019). Spatial signal detection using continuous shrinkage priors. Technometrics, 61(4), 494-506.
14.
GraifC., & SampsonR. (2009). Spatial heterogeneity in the effects of immigration and diversity on neighborhood homicide rates. Homicide Studies, 13(3), 242-260.
15.
KuoL., & MallickB. (1998). Variable selection for regression models. Sankhyā: The Indian Journal of Statistics, Series B, 65-81.
16.
LeeD., & MitchellR. (2013). Locally adaptive spatial smoothing using conditional auto-regressive models. Journal of the Royal Statistical Society: Series C (Applied Statistics), 62(4), 593-608.
17.
LerouxB.LeiX.BreslowN. (2000). Estimation of disease rates in small areas: A new mixed model for spatial dependence. In M Halloran, D Berry (eds.), Statistical Models in Epidemiology, the Environment and Clinical Trials, 179-191.
18.
MarshallE., & SpiegelhalterD.J. (2003). Approximate cross-validatory predictive checks in disease mapping models. Statistics in Medicine, 22(10), 1649-1660.
19.
MorrisM., Wheeler-MartinK.SimpsonD.MooneyS.GelmanA., & DiMaggioC. (2019). Bayesian hierarchical spatial models: Implementing the Besag-York-Mollié model in stan. Spatial and Spatio-temporal Epidemiology, 31, 100301.
20.
PiironenJ., & VehtariA. (2017). Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics, 11(2), 5018-5051.
21.
RodriguesE., & AssunçãoR. (2012). Bayesian spatial models with a mixture neighborhood structure. Journal of Multivariate Analysis, 109, 88-102.
22.
RichardsonS.ThomsonA.BestN., & ElliottP. (2004). Interpreting posterior relative risk estimates in disease-mapping studies. Environmental Health Perspectives, 112(9), 1016-1025.
23.
SavelievA.MukharamovaS., & ZuurA. (2007). Analysis and modelling of lattice data. In Analysing Ecological Data, Springer, New York, NY, 321-339.
24.
Van der PasS.KleijnB., & Van Der VaartA. (2014). The horseshoe estimator: Posterior concentration around nearly black vectors. Electronic Journal of Statistics, 8(2), 2585-2618.
25.
VehtariA.GelmanA., & GabryJ. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5), 1413-1432.
26.
VranckxM.NeyensT., & FaesC. (2019). Comparison of different software implementations for spatial disease mapping. Spatial and Spatio-temporal Epidemiology, 31, 100302.
27.
WatanabeS. (2010). Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research, 11, 3571-3594.