Interpolation of DHS survey data at subnational administrative level 2

Abstract

Over the last several years and within the framework of the Sustainable Development Goals, there has been a need to improve the measurement and understanding of local geographic patterns to support more decentralized decision-making and more efficient program implementation. This requires more disaggregated data that are not currently available in a nationally representative household survey.

This study explores the potential of model-based geostatistics methodology to model DHS survey indicators. We implement a stacked ensemble modeling approach that combines multiple model algorithmic methods to increase predictive validity relative to a single modeling. The approach captures potentially complex interactions and non-linear effects among the geospatial covariates. Three submodels are fitted to six DHS indicator survey data using the geospatial covariates as exploratory predictors. The model prediction surfaces generated from the submodels are used as covariates in the final Bayesian geostatistical model, which is implemented through a stochastic partial differential equation approach in the integrated nested Laplace approximations.

The proposed approach can help to inform the allocation of resources and program implementation in areas that need more attention. Countries can use this approach to model other DHS survey indicators at much smaller spatial scales.

Keywords

DHS MBG geospatial Bayesian geostatistical model second subnational level Admin 2 INLA

1. Background and objectives

1.1 Background

The Demographic and Health Surveys (DHS) Program has been a leader in collecting and providing cluster-randomized survey data on various development and health indicators. In addition to the standard open-source data files in which household and individual survey results can be tabulated by first-order subnational units (states/provinces or regions) andurban/rural strata, more surveys are now providing georeferenced data for individual clusters. The availability of the Global Positioning System (GPS) coordinates for DHS, the Malaria Indicator Survey (MIS), and the AIDS Indicator Survey (AIS) clusters provides highly local scale information that can be linked with survey outputs for quantifying demographic and health status heterogeneities and inequities.

During the last several years and within the framework of the Sustainable Development Goals (SDGs), there has been an expressed need to improve the measurement and understanding of local geographic patterns in order to support more decentralized decision-making and more efficient program implementation [1]. This requires additional disaggregated data that are not currently available in a nationally representative household survey.

Analyses of the DHS survey indicators are conducted primarily at the national level, but also at the first subnational administrative level (Admin 1). Since estimates produced at the national level are more useful for making comparisons between nations and aggregating across large world regions, their natural audience includes international policymakers and donors [2]. However, the Admin 1 analysis does not provide comprehensive estimates at lower levels, such as the second subnational administrative level (Admin 2), where health programs are designed and implemented.

To better address the need for fine spatial and lower administrative level estimates, there are three possible options: (i) Scaling-up the nationally representative survey data collection process by increasing the sample size, survey costs, and survey time to create a representative sample at the desired administrative level. (ii) Using data derived from routine health management information systems (HMIS) from health facilities, communities, census, or other household surveys, such as data that can determine the vaccination coverage in a district. (iii) Creating spatially interpolated maps that use modeling techniques to predict values at non-surveyed locations.

Increasing DHS survey sample size to enable increased geographic disaggregation is both time consuming and expensive. Thus, the first option may not be feasible in an increasingly resource-constrained environment. With the second option, HMIS data quality is not always reliable, and the data are not easily accessed. The third option, which uses spatial modeling techniques that leverage existing survey data, spatial relationships between survey clusters, and relationships with geospatial covariates, has become increasingly popular in mapping key development indicators at high spatial resolution [3, 4].

The Bayesian spatial approach is increasingly recognized as an excellent geostatistical analysis method for addressing uncertainty in the model estimates and for being flexible and capable of handling missing data [5]. This approach has been widely used to predict and map various indicators such as those described in SAR 11 [6], poverty [7], and malaria [8, 9, 10, 11, 12, 13, 14, 15]. In these studies, environmental data layers that are thought to influence the indicators are used to explain some of the variations in prevalence across different areas. This can aid our understanding of the relationships between the indicator and the influence of climatic/environmental and socioeconomic factors [16].

The Markov Chain Monte Carlo (MCMC) algorithms have been the most common method for making Bayesian statistical inferences with generalized linear geostatistical models (GLGM) [17]. The MCMC has been developed for model estimation, but can be computationally expensive, especially with big data. There has been a recent increase in the application of integrated nested Laplace approximation (INLA) methodology and software (http://www.rinla.org) in Bayesian spatial models [18]. The choice of this method over MCMC is based on the speed of calculation and the ease with which model comparison can be performed [18].

1.2 Objectives

In this study, we explore the potential of model-based geostatistics (MBG) methodology (described in the methods section) to model DHS survey indicators. More specifically, we use the stacking and ensemble model approaches to predict the indicators at a high-resolution gridded pixel level, and produce estimates at the Admin 2 level. The INLA methodology is used to create a model for predicting the indicators based on the different geospatial covariates and spatially correlated random effects, and to produce prediction maps. The report will develop R code structure and workflow for routine interpolation of survey data at the second subnational administrative level (Admin 2).

2. Methods

A modeling framework for generating standardized modeled surfaces using DHS survey data has been described in SAR 11 [6] and SAR 14 [19]. In this analysis, we employed a new geospatial modeling approach similar to that used in mapping of child growth failure [20], education attainment [21], vaccine coverage [22], HIV [23], exclusive breastfeeding [24], and childhood diarrheal diseases [25]. We adopted this method because it has been shown to improve the prediction accuracy based on the stacked generalization that allows for multiple, non-linear algorithmic mean functions to be embedded within a Gaussian process framework [26]. We detail this approach in the next sections.

2.1 DHS indicators

The DHS indicators included in the analysis were extracted from two national DHS surveys, the Kenya 2014 DHS and Ethiopia 2016 DHS. We considered data only from those surveys that had described the total number of individuals examined, the proportion of positive cases, and the coordinates of their geographical locations. From these surveys, we obtained 1,583 and 620 clusters for Kenya and Ethiopia, respectively. Table 1 describes the indicators we modeled.

Table 1
Description of DHS indicators

Indicator	Definition
Antenatal visits for pregnancy: 4 $+$ visits	Percentage of women who had a live birth in the 5 years before the survey who had 4 $+$ antenatal care visits
Stunting in children	Percentage of children under age 5 stunted (below $-$ 2 SD of height-for-age according to the WHO standard)
Wasting in children	Percentage of under 5 children with a weight-for-height z-score (WHZ) more than two SD below the median WHO growth standards
Population living in household with an improved water source	Percentage of the de jure population living in households whose main source of drinking water is an improved source
Women age 15–49 with any anemia ${}^{**}$	Percentage of women classified as having any anemia ( $<$ 12.0 g/dl for non-pregnant women and $<$ 11.0 g/dl for pregnant women)
Diphtheria-tetanus-pertussis (DPT3) received	Percentage of children age 12–23 months who had received a third DPT dose

${}^{**}$ Data not collected for the Kenya DHS 2014.

Table 2

Geospatial covariates used to develop the models

Covariates	Spatial resolution	Temporal resolution	Source
Travel time to nearest settlement	5 $\times$ 5 km	Static	https://map.ox.ac.uk/research-project/accessibility_to_cities/
$>$ 50,000 inhabitants
Aridity	10 $\times$ 10 km	Annual	https://data.ceda.ac.uk/badc/cru/data/cru_ts/cru_ts_4.05/data
Diurnal temperature range	10 $\times$ 10 km	Annual	https://data.ceda.ac.uk/badc/cru/data/cru_ts/cru_ts_4.05/data
Precipitation	10 $\times$ 10 km	Annual	https://data.ceda.ac.uk/badc/cru/data/cru_ts/cru_ts_4.05/data
Potential evapotranspiration (PET)	10 $\times$ 10 km	Annual	https://data.ceda.ac.uk/badc/cru/data/cru_ts/cru_ts_4.05/data
Daily maximum temperature	10 $\times$ 10 km	Annual	https://data.ceda.ac.uk/badc/cru/data/cru_ts/cru_ts_4.05/data
Elevation	1 $\times$ 1 km	Static	http://webmap. ornl.gov
Enhanced vegetation index (EVI)	5 $\times$ 5 km	Annual	https://lpdaac.usgs.gov/products/vipphen_evi2v004/
Daytime land surface temperature (LST)	5 $\times$ 5 km	Annual	https://lpdaac.usgs.gov/products/myd11c3v006/
Diurnal difference in LST	5 $\times$ 5 km	Annual	https://lpdaac.usgs.gov/products/myd11c3v006/
Nighttime LST	5 $\times$ 5 km	Annual	https://lpdaac.usgs.gov/products/myd11c3v006/
Population distribution	1 $\times$ 1 km	Annual	https://www.worldpop.org/

2.2 Geospatial covariates

To model the DHS indicators, we assembled environmental and socioeconomic geospatial covariate data layers, which were obtained from publicly available remote sensing sources. These data included access (travel time to nearest settlement), aridity, diurnal temperature range, precipitation, potential evapo-transpiration (PET), daily maximum temperature, elevation, enhanced vegetation index (EVI), daytime land surface temperature, diurnal difference in land surface temperature, night land surface temperature (LST), and population categories (children under age 5, women age 15 to 49, and total population). Further description of each covariate can be obtained from the DHS Geospatial Covariate Report [27].

The geospatial covariates were selected for their potential to predict DHS indicators, and they have previously been shown to correlate with the development of indicators in different settings [6, 28]. Table 2 describes the spatial and temporal resolution of each geospatial covariate and the sources.

2.2.1 Geospatial covariates processing

The geospatial covariate data layers used in this analysis were acquired from a myriad of data sources, and therefore have different spatial references, projections, extents, and dimensions. For example, gridded population data and elevation had a 1 $\times$ 1 km spatial resolution, EVI was at 5 $\times$ 5 km, and temperature range at 10 $\times$ 10 km resolution. We used the ‘raster’ and ‘shapefiles’ packages in the R software [29] to (i) re-project to the same coordinate reference system (the standard-based World Geodetic System 1984), (ii) crop and mask to an extent encompassing the boundaries of the study area and (ii) resample with bilinear interpolation to the same spatial resolution used in the modeling, 5 $\times$ 5 km.

Figure 1.

Geospatial modeling flowchart.

2.3 Geostatistical model

2.3.1 Overview of the modeling approach

Figure 1 depicts the geospatial modeling framework used for modeling DHS indicators and the underlying covariates and producing the gridded pixel and subnational level estimates. The approach involved the following steps: Step 1 – We summarized the individual-level DHS survey data to the finest spatial resolution (latitude and longitude) that represented the location of the survey cluster. Step 2 – The processed geospatial covariates (from the previous section) and the cluster (point) level data were imported into the R environment for statistical computing [29]. We then applied the ‘raster’ package to extract the corresponding covariate pixel values at each survey cluster point. Step 3 – The point level data (from Step 2) and their associated geospatial covariates were used in the stacked (submodels) generalization ensemble model. The prediction surfaces generated from the stacked ensemble models were then used as covariates in the final geospatial (MBG) model. The outputs of the final model are pixel-level mean estimates at the 5 $\times$ 5 km resolution. Step 4 – We aggregated the prediction output from the final model (Step 3) to the Admin 2.

2.3.2 Covariate ensemble modeling using stacked generalization

Stacking (also called stacked generalization/regression) is an ensemble modeling approach that combines multiple model algorithmic methods to increase predictive validity relative to a single modeling approach. We employed this approach to capture the potential complex interactions and non-linear effects among the geospatial covariates. The ensemble approach has been shown to improve the predictive accuracy of the geostatistical models, as compared to prediction from any single method [26]. Numerous recent studies have implemented the stacking approach to derive continuous estimated surfaces of indicators of interest from DHS household surveys. These include mapping of HIV prevalence [23], vaccine coverage [22], exclusive breastfeeding [24], child growth failure [20], education attainment [21], and childhood diarrheal diseases [25].

In our analysis, we fitted three submodels to each set of the selected DHS indicator survey data using the geospatial covariates (described in Table 2) as exploratory predictors. These include (i) GAM: generalized additive model [30], (ii) LASSO: least absolute shrinkage and selection operator regression [31] and (iii) GBM: gradient-boosted trees [32]. The submodels were implemented in R statistical for the computing environment using packages ‘caret’, ‘mgcv’, ‘xgboost’, and ‘glmnet’. We selected these model algorithms because they have demonstrated high predictive accuracy in previous studies [33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43].

To make better predictions and avoid overfitting, each submodel was fit using five-fold cross-validation, which generated the out-of-sample predictions that were included as exploratory geospatial covariates when fitting the geostatistical model. In addition, each submodel was fit with a full dataset, which produced the in-sample predictions that were then used as covariates when generating predictions from the full geostatistical model. A logit transformation of the predictions was used to place the out-of-sample and in-sample predictions on the same scale as the linear predictor in the geostatistical model. This process has been described in detail [23, 26].

2.3.3 Model specification and development

As described in the previous section, the ensemble modeling approach allows for non-linear relationships and interactions between the geospatial covariates to better predict the DHS indicators. Since the approach does not explicitly account for spatial patterns in the data, we used the Bayesian geostatistical modeling framework in our analysis to account for the spatial dependence.

For each indicator of interest, we modeled $Y_{i}$ , the number of ‘positive’ individuals among those sampled at cluster location $s_{i}$ , $i=1,\ldots n$ , using a binomial spatial regression with a logit link function [44, 45]. Letting $N_{i}$ be the total number of individuals sampled at cluster $s_{i}$ , the model can be written as:

$\displaystyle Y_{i}\sim\textit{Binomial}({N_{i},p_{i}})$ $\displaystyle\textit{logit}(p_{i})=\beta_{0}+\beta X_{i}+\omega_{i}+% \varepsilon_{i}$ $\displaystyle\omega_{i}\sim GP({0,{\Sigma}})$

Where:

$\beta_{0}$ denotes the intercept,

$p_{i}$ is the probability, representing the underlying prevalence at cluster $s_{i}$ ,

$X_{i}=({X_{i1,}X_{i2},\ldots X_{im}})$ is the vector of logit-transformed covariates for location $s_{i}$ obtained from the submodels (GAM, LASSO, and GBM), generated from the stacked ensemble covariate modeling,

$\beta=({\beta_{1},\beta_{2},\ldots\beta_{m}})$ vector of regression coefficients on the submodels represent their respective predictive weighting and are constrained to sum to one [26],

$\omega_{i}$ is a correlated spatial error term, accounting for spatial autocorrelation between data points, and

$\varepsilon_{i}\sim N({0,\sigma_{\textit{nug}}^{2}})$ is an independent error term known as nugget effect.

The spatial error term $\omega_{i}$ is modeled as Gaussian process with a zero-mean and spatially structured covariance matrix $\sum$ .

The spatial covariance $\sum$ was modeled using a stationary and isotropic Matérn function [44], given by:

$\displaystyle\sum({s_{i},s_{j}})=\frac{\sigma^{2}}{{\Gamma}(\lambda)2^{\lambda% -1}}\left({\kappa d({s_{i},s_{j}})^{\lambda}K_{\lambda}({\kappa d({s_{i},s_{j}% })})}\right)$

Where $d({s_{i},s_{j}})$ is the distance between the two locations and $\sigma^{2}$ is the spatial process variance. The term $K_{\lambda}$ denotes the modified Bessel function of second kind and order $\lambda$ , which measures the degree of smoothness. Conversely, $\kappa$ is a scaling parameter related to the range $r$ , that is the distance at which the spatial correlation becomes almost null (i.e., smaller than 10%), and the definition for the range is given in equation below. See example by Lindgren et al. [46] for detail description.

$\displaystyle r=\frac{\sqrt{8\lambda}}{\kappa}$

The Bayesian geostatistical model analysis was implemented through a stochastic partial differential equations (SPDE) approach in the recently developed INLA algorithm as applied in the R-INLA package [18]. This algorithm provides an effective estimation and spatial prediction strategy for spatial data by specifying a spatial data process as well as a spatial covariance function depending on locations and time points at which infection and covariate data are collected [18]. The INLA approach offers an advantage of providing accurate and fast results as compared to the MCMC algorithms, which are known to have problems of convergence and dense covariate matrices that increase the computational time. Thus, for large datasets, spatial and spatiotemporal estimation could lead to several days of computing time [18, 47, 48].

Figure 2.

Predicted surfaces for stunting in Kenya generated from the three submodels (GAM, GBM, and LASSO).

Figure 3.

Predicted surfaces for stunting in Ethiopia generated from the three submodels (GAM, GBM, and LASSO).

Figure 4.

Pixel level prediction of prevalence of the indicators modeled by using the Kenya DHS 2014: (a) Stunting, (b) Wasting, (c) Vaccine DPT3, (d) ANC visits, and (e) Water sources.

Figure 5.

Pixel level prediction of prevalence of the indicators modeled by using the Ethiopia DHS 2016: (a) Stunting, (b) Wasting, (c) Vaccine DPT3, (d) ANC visits, (e) Water sources, and (f) Women’s anemia.

Figure 6.

Second subnational administrative level estimates for Kenya DHS 2014: (a) Stunting, (b) Wasting, (c) Vaccine DPT3, (d) ANC visits, and (e) Water sources.

2.3.4 Pixel level model estimates

The prediction surfaces generated from the ensemble submodels were used as input covariates in the geostatistical models implemented in INLA. The final estimates for each indicator were generated by taking $k=1,\ldots 1000$ samples from the posterior predictive distribution. Pixel level estimates that covered the modeling country (Kenya and Ethiopia) were produced at a high spatial resolution of 5 $\times$ 5 km.

2.3.5 Model estimates at admin level 2

In addition to the 5 $\times$ 5 km pixel level estimates, we overlaid the prediction prevalence surfaces with the relevant population layer (children under age 5, women age 15 to 49, and total population) for each indicator we modeled. We then constructed estimates of each indicator at the second subnational administrative level by calculating population-weighted averages of prevalence for all grid cells within a given administrative boundary. The procedure was performed for each of the 1,000 posterior predictive samples with final point estimates derived from the mean of these draws.

2.3.6 Model validation

For each of the indicator model outputs, we implemented a validation procedure and calculated a set of performance statistics. This involved using an out-of-sample cross-validation with a five-fold hold-out procedure and a comparison of the predicted values at the locations of the hold-out data with their observed values. This procedure was repeated five times without replacement so that every data point was omitted one time across the five validation runs. Standard validation statistics were then computed as measures of the predictive accuracy of the modeled estimates. This included mean absolute error (MAE), mean error (ME) or bias; root-mean-squared-error (RMSE, which summarizes the total variance); and 50%, 80%, and 95% coverage of our predictive intervals aggregated to the spatial holdout level. Each predictive metric was calculated by first simulating predictive draws using a binomial distribution. The predictive metric of interest was then calculated as a sample-size-weighted mean over the second administrative levels [22]. To complement the out-of-sample predictive validity metrics, we also calculated in-sample predictive validity metrics that used the same process but matched each data point to predictions from a model fitted with all data.

3. Results

3.1 Stacking results

Here we present results obtained from the individual submodels according to the environmental and socioeconomic predictor variables in our model. The submodels revealed that the prediction of the DHS indicators varies spatially in the different areas in the country. Figures 2 and 3 represent prediction areas with high and low prevalence of stunting for the Kenya DHS 2014 and Ethiopia DHS 2016 surveys that we generated from the three submodels.

Table 3
Predictive metrics for each indicator aggregated at admin 2 (Kenya)

Indicator		ME	MAE	RMSE	50% cov	80% cov	95% cov	Correlation
Stunting	In-sample	0.0019	0.0150	0.0183	0.7012	0.9272	0.9938	0.9742
	Out_of_sample	0.0025	0.0207	0.0254	0.6470	0.8819	0.9816	0.9501
Wasting	In-sample	$-$ 0.0003	0.0065	0.0083	0.8651	0.9680	0.9950	0.9849
	Out_of_sample	0.0002	0.0083	0.0104	0.8389	0.9460	0.9826	0.9769
Vaccine DPT3	In-sample	0.0026	0.0217	0.0260	0.8930	0.9792	0.9975	0.9624
	Out_of_sample	$-$ 0.0016	0.0276	0.0341	0.8538	0.9461	0.9911	0.9288
ANC Visit	In-sample	$-$ 0.0001	0.0149	0.0199	0.7129	0.9229	0.9863	0.9844
	Out_of_sample	$-$ 0.0004	0.0209	0.0288	0.6575	0.8838	0.9701	0.9643
Water_Sources	In-sample	$-$ 0.0099	0.0321	0.0399	0.7566	0.9541	0.9947	0.9766
	Out_of_sample	$-$ 0.0148	0.0453	0.0549	0.5953	0.8731	0.9644	0.9582

Figure 7.

Second subnational administrative level estimates for Ethiopia DHS 2016: (a) Stunting, (b) Wasting, (c) Vaccine DPT3, (d) ANC visits, (e) Water sources, and (f) Women’s anemia.

3.2 Prediction maps

Prediction prevalence maps for each indicator were created using the full geospatial Bayesian model. Figures 4 and 5 show the pixel level prediction surface maps for Kenya and Ethiopia, respectively. Areas with high and low estimated prevalence of each indicator can be seen clearly across all maps.

3.3 Admin level 2 estimates

Figures 6 and 7 show the second subnational administrative level estimates that highlight areas with high and low prevalence of each indicator we modeled.

Table 4
Predictive metrics for each indicator aggregated at admin 2 (Ethiopia)

Indicator		ME	MAE	RMSE	50% cov	80% cov	95% cov	Correlation
Stunting	In-sample	0.0010	0.0257	0.0346	0.7608	0.9513	0.9939	0.9505
	Out_of_sample	0.0024	0.0403	0.0513	0.6262	0.8787	0.9777	0.8808
Wasting	In-sample	0.0018	0.0247	0.0327	0.7268	0.9221	0.9863	0.8010
	Out_of_sample	0.0009	0.0242	0.0325	0.7242	0.9221	0.9854	0.7981
Vaccine DPT3	In-sample	$-$ 0.0015	0.0628	0.0898	0.9229	0.9963	1.0000	0.9545
	Out_of_sample	$-$ 0.0076	0.0706	0.1023	0.8935	0.9786	1.0000	0.9364
ANC Visit	In-sample	$-$ 0.0054	0.0299	0.0441	0.8339	0.9705	0.9978	0.9823
	Out_of_sample	$-$ 0.0078	0.0493	0.0764	0.6513	0.9002	0.9884	0.9358
Water_Sources	In-sample	$-$ 0.0217	0.0510	0.0734	0.7663	0.9607	0.9955	0.9505
	Out_of_sample	$-$ 0.0318	0.0767	0.1156	0.6518	0.8773	0.9724	0.8535
Women_anemia	In-sample	0.0002	0.0271	0.0381	0.6792	0.9155	0.9944	0.9569
	Out_of_sample	0.0016	0.0385	0.0564	0.4965	0.7694	0.9095	0.9087

Figure 8.

Comparison of predictions for each indicator, aggregated to the second subnational administrative level with 95% uncertainty intervals, plotted against data observations from the same area aggregated to the second subnational administrative level for Kenya.

Figure 9.

3.4 Model validation metrics

Model validation was performed by calculating bias (mean error); mean absolute error (MAE); variance (RMSE); 50%, 80%, and 95% data coverage within prediction intervals; and the correlation between observed data and predictions.

Results from the validation indicated the best performance for each indicator, where correlation increased with decreased MAE and RMSE values. The coverage values for some indicators (vaccine) were too high, which was most likely a result of high uncertainties that arise from the small sample at the cluster locations (Tables 3 and 4).

3.4.1 Comparison of model estimates versus DHS estimates

Figures 8 and 9 show the comparison estimates for each indicator produced by the models in our analysis and the equivalent estimates from the observed DHS survey data. The results indicate a high correlation between MBG and DHS estimates for most indicators.

4. Discussion and conclusion

In recent years, there has been a need for district (Admin 2) estimates currently not available in a DHS survey. In an increasingly resource-constrained environment, high resolution maps of key health indicators and development derived from cluster point data through spatial interpolation methods offer an attractive solution.

In this analysis, we developed a methodological framework for estimating DHS indicators at the Admin 2 level. We took advantage of the advancement in geospatial technologies, availability of free and open-source spatial data, and geospatial software tools relevant for spatial modeling. This framework used an ensemble modeling approach that combines multiple model algorithmic methods to increase predictive validity relative to a single modeling approach. We employed this approach to capture potential complex interactions and non-linear effects among the geospatial covariates. We fitted three submodels (GAM, LASSO, and GBM) to each of the selected DHS survey indicator data using the geospatial covariates as exploratory predictors. The submodels were selected because they were available in standalone packages that required minimal data preparation after the predictor variables had been produced; they are particularly useful in cases where presence-only data are available [35, 49, 50] and have been successfully used in previous analyses that have implemented the stacking approach to derive continuous estimated surfaces of indicators of interest from the DHS household surveys. These include mapping of malaria [26], HIV prevalence [23], vaccine coverage [22], exclusive breastfeeding [24], child growth failure [20], education attainment [21], and childhood diarrheal diseases [25].

We found variability in the individual model prediction outputs. For example, the LASSO model output indicated a low prediction of stunting in the northeast areas of Kenya (Fig. 2) as compared to the other algorithms. This could be explained by the prediction uncertainties in the individual submodels [35, 49]. Using different models and combining them in an ensemble model could improve these uncertainties [51, 52, 53]. Our findings demonstrated that the predictions from the stacking ensemble model approach were more accurate than those from the individual model algorithms. The results suggest that use of an ensemble model approach is more adequate than predictions from any single modeling methods. These findings are consistent with other studies [26, 52], which showed the ensemble model approach to be the best. By developing the methodological framework within the R statistical computing environment, we have created a tool that can be used to model other health indicators.

The results from the Admin 2 level, generated from the full geostatistical model, show that the estimated prevalence of each indicator varied across the different areas in the country.

Although we have estimated the prevalence for each indicator at the pixel-level and the Admin 2 level, our study has some limitations. Our analysis used a suite of standard geospatial covariates that included those that are not directly related to the indicators we modeled. To improve predictions, further studies should restrict the model input data to those covariates associated with the indicators of interest. Due to computational limitations, we did not quantify uncertainty in the covariates and submodel estimates. Thus, further analysis should develop methods that are capable of propagating uncertainty in both the covariates and submodel estimates [23, 54].

We generated maps showing estimates of high-risk areas for each indicator we modeled. Our approach in this analysis can help inform the allocation of resources and program implementation in areas that need more attention. Interventions and programs that can be implemented and directed at much smaller spatial scales using the MBG estimates such as the one described in our analysis could enable better programmatic decisions.

Footnotes

Acknowledgments

The authors would like to thank the external reviewers for the careful review and thoughtful comments.

This work was supported by the United States Agency for International Development (US) (USAID) through The DHS Program (#720-OAA-18C-00083). Views expressed are those of the authors and do not necessarily reflect the views of the USAID or the United States government.

References

United Nations General Assembly. Transforming our World: The 2030 Agenda for Sustainable Development. United Nations. 2015; A/RES/70/1.

Hsiao

Godwin

Martin

Wakefield

Clark

, et al. Changes in the spatial distribution of the under-five mortality rate: Small-area analysis of 122 DHS surveys in 262 subregions of 35 countries in Africa. PloS One. 2019 01/22; 14(1): e0210645-e0210645.

Utazi

Thorley

Alegana

Ferrari

Takahashi

Metcalf

CJE

, et al. High resolution age-structured mapping of childhood vaccination coverage in low and middle income countries. Vaccine. 2018; 36: 1583-1591.

Gething

Burgert-Brucker

. The DHS Program modeled map surfaces: understanding the utility of Spatial Interpolation for generating indicators at subnational administrative levels. DHS Spatial Analysis Reports. 2017; 15.

Cressie

Wikle

. Statistics for Spatio-Temporal Data. USA: Wiley; 2011.

Gething

Tatem

Bird

Burgert

. Creating spatial interpolation surfaces with DHS Data. DHS Spatial Analysis Reports. 2015; 11.

Steele

Sundsøy

Carla

Alegana

Bird

Joshua

, et al. Mapping poverty using mobile phone and satellite data. Journal of The Royal Society Interface. 2017; 14(127): 20160690.

Gething

Patil

Smith

Guerra

Elyazar

Johnston

, et al. A new world malaria map: Plasmodium falciparum endemicity in 2010. Malar J. 2011; 10: 378.

Gosoniu

Vounatsou

Sogoba

Smith

. Bayesian modelling of geostatistical malaria risk data. Geospat Health. 2006; 1: 127-139.

10.

Gosoniu

Msengwa

Lengeler

Vounatsou

. Spatially explicit burden estimates of malaria in Tanzania: Bayesian Geostatistical Modeling of the malaria indicator survey data. PLoS ONE. 2012; 7.

11.

Gosoniu

Veta

Vounatsou

. Bayesian geostatistical modeling of malaria indicator survey data in angola. PLoS ONE. 2010; 5(3): e9322.

12.

Kazembe

Kleinschmidt

Holtz

Sharp

. Spatial analysis and mapping of malaria risk in Malawi using point-referenced prevalence of infection data. International Journal of Health Geographics. 2006; 5(1): 41.

13.

Raso

Schur

Utzinger

Koudou

Tchicaya

Rohner

, et al. Mapping malaria risk among children in Cote d’Ivoire using Bayesian geo-statistical models. Malaria Journal. 2012; 11(1): 160.

14.

Riedel

Vounatsou

Miller

Gosoniu

Chizema-Kawesha

Mukonka

, et al. Geographical patterns and predictors of malaria risk in Zambia: Bayesian geostatistical modelling of the 2006 Zambia national malaria indicator survey (ZMIS). Malaria Journal. 2011; 9(37).

15.

Hay

Guerra

Gething

Patil

Tatem

Noor

, et al. A world malaria map: Plasmodium falciparum endemicity in 2007. PLoS Med. 2009; 6: e1000048.

16.

Noor

Gething

Alegana

Patil

Hay

Muchiri

, et al. The risks of malaria infection in Kenya in 2009. BMC Infectious Diseases. 2009; 9(180).

17.

Gilks

Richardson

Spiegelhalter

, editors. Markov chain Monte Carlo in practice. London: Chapman & Hall/CRC; 1996.

18.

Rue

Martino

Chopin

. Approximate Bayesian inference for latent Gaussian models by using integrated Laplace approximations. J R Stat Soc Ser B Stat Methodol. 2009; 71: 319-392.

19.

Burgert

. Spatial Interpolation with Demographic and Health Survey Data: Key Considerations. DHS Spatial Analysis Reports. 2014; 9.

20.

Osgood-Zimmerman

Millear

Stubbs

Shields

Pickering

Earl

, et al. Mapping child growth failure in Africa between 2000 and 2015. Nature. 2018; 555(7694): 41.

21.

Graetz

Friedman

Osgood-Zimmerman

Burstein

Biehl

Shields

, et al. Mapping local variation in educational attainment across Africa. Nature. 2018; 555(7694): 48.

22.

Mosser

Gagne-Maynard

Rao

Osgood-Zimmerman

Fullman

Graetz

, et al. Mapping diphtheria-pertussis-tetanus vaccine coverage in Africa, 2000–2016: A spatial and temporal modelling study. The Lancet. 2019; 393(10183): 1843-1855.

23.

Dwyer-Lindgren

Cork

Sligar

Steuben

Wilson

Provost

, et al. Mapping HIV prevalence in sub-Saharan Africa between 2000 and 2017. Nature. 2019; 570(7760): 189.

24.

Bhattacharjee

Schaeffer

Marczak

Ross

Swartz

Albright

, et al. Mapping exclusive breastfeeding in Africa between 2000 and 2017. Nat Med. 2019; 25(8): 1205-1212.

25.

Reiner

Graetz

Casey

Troeger

Garcia

Mosser

, et al. Variation in childhood diarrheal morbidity and mortality in africa, 2000–2015. N Engl J Med. 2018 09/20; 379(12): 1128-1138.

26.

Bhatt

Cameron

Flaxman

Weiss

Smith

Gething

. Improved prediction accuracy for disease risk mapping using Gaussian process stacked generalization. J R Soc Interface. 2017; 14(134).

27.

Mayala

Fish

Eitelberg

Dontamsetti

. The DHS Program Geospatial Covariate Datasets Manual. 2018; Second Edition.

28.

Alegana

Atkinson

Pezzulo

Sorichetta

Weiss

Bird

, et al. Fine resolution mapping of population age-structures for health and development applications. Journal of the Royal Society Interface. 2015 04/06; 12(105).

29.

R Core Team. R: A language and environment for statistical computing. 2018.

30.

Wood

. Generalized Additive Models: An introduction with R. New York: Chapman and Hall/CRC; 2017.

31.

Zou

Hastie

. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B (Statistical Methodology). 2005; 67: 301-320.

32.

Friedman

. Greedy function approximation: a gradient boosting machine. Annals of Statistics. 2001: 1189-1232.

33.

Barbet-Massin

Jiguet

Albert

Thuiller

. Selecting pseudo-absences for species distribution models: How, where and how many? Methods in Ecology and Evolution. 2012; 3(2): 327-338.

34.

Elith

Graham

. Do they? How do they? WHY do they differ? On finding reasons for differing performances of species distribution models. Ecography. 2009; 32: 66-77.

35.

Elith

Graham

Anderson

Dudik

Ferrier

Guisan

, et al. Novel methods improve prediction of species’ distributions from occurrence data. Ecography. 2006; 29(129-151).

36.

Franklin

. Mapping Species Distributions: Spatial inference and prediction. New York: Cambridge University Press; 2009.

37.

Giovanelli

JGR

Siqueira

MFd

Haddad

CFB

Alexandrino

. Modeling a spatially restricted distribution in the Neotropics: How the size of calibration area affects the performance of five presence-only methods. Ecological Modelling. 2010; 221: 215-224.

38.

Lobo

Jiménez-Valverde

Hortal

. The uncertain nature of absences and their importance in species distribution modelling. Ecography. 2010; 33(1): 103-114.

39.

Mateo

Croat

Felicı’simo

Munoz

. Profile or group discriminative techniques? Generating reliable species distribution models using pseudo-absences and target-group absences from natural history collections. Diversity and Distributions. 2010; 16: 84-94.

40.

Pearson

Raxworthy

Nakamura

Peterson

. Predicting species’ distributions from small numbers of occurrence records: A test case using cryptic geckos in Madagascar. Journal of Biogeography. 2007; 34: 102-111.

41.

Peterson

Soberon

Pearson

, et al. editors. Ecological Niches and Geographic Distributions. United Kingdom: Princeton University Press; 2011.

42.

Phillips

Anderson

Schapire

. Maximum entropy modeling of species geographic distributions. Ecological Modelling. 2006; 190: 231-259.

43.

Wisz

Guisan

. Do pseudo-absence selection strategies influence species distribution models and their predictions? An information-theoretic approach based on simulated data. BMC Ecology. 2009; 9(1): 8.

44.

Banerjee

Carlin

Gelfand

. Hierarchical modeling and analysis for spatial data. Second Edition ed. Boca Raton, FL: Chapman and Hall/CRC; 2014.

45.

Diggle

Giorgi

. Model-based Geostatistics for Global Public Health: Methods and Applications. New York: Chapman and Hall/CRC; 2019.

46.

Lindgren

Rue

Lindstrom

. An explicit link between Gaussian fields and Gaussian Markov random fields: The stochastic partial differential equation approach. Journal of Royal Statistical Society Series B. 2011; 73(4): 423-498.

47.

Blangiardo

Cameletti

. Spatial and Spatio-temporal Bayesian Models with R-INLA. United Kingdom: Wiley; 2015.

48.

Cameletti

Lindgren

Simpson

Rue

. Spatio-temporal modeling of particulate matter concentration through the SPDE approach. Adv Stat Anal. 2013; 97: 109-131.

49.

Peterson

Papes

Eaton

. Transferability and model evaluation in ecological niche modeling: A comparison of Garp and Maxent. Ecography. 2007; 30: 550-560.

50.

Phillips

Dudik

. Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation. Ecography. 2008; 31: 161-175.

51.

Araujo

New

. Ensemble forecasting of species distributions. Trends in Ecology and Evolution. 2007; 22: 42-47.

52.

Marmion

Parviainen

Luoto

Heikkinen

Thuiller

. Evaluation of consensus methods in predictive species distribution modelling. Diversity and Distributions. 2009; 15: 59-69.

53.

Jones-Farrand

Fearer

Thoghmartin

Thompson

. Comparison of statistical and theoretical habitat models for conservation planning: The benet of ensemble prediction. Ecological Applications. 2011; 21(6): 2269-2282.

54.

Wakefield

Fuglstad

Riebler

Godwin

Wilson

Clark

. Estimating under-five mortality in space and timein a developing world context. Stat Methods Med Res. 2019; 28(9): 2614-2634.

Interpolation of DHS survey data at subnational administrative level 2

Abstract

Keywords

1. Background and objectives

1.1 Background

1.2 Objectives

2. Methods

2.1 DHS indicators

Table 1 Description of DHS indicators

2.2.1 Geospatial covariates processing

2.3.1 Overview of the modeling approach

2.3.2 Covariate ensemble modeling using stacked generalization

2.3.3 Model specification and development

2.3.5 Model estimates at admin level 2

2.3.6 Model validation

3. Results

3.1 Stacking results

Table 3 Predictive metrics for each indicator aggregated at admin 2 (Kenya)

3.3 Admin level 2 estimates

Table 4 Predictive metrics for each indicator aggregated at admin 2 (Ethiopia)

3.4.1 Comparison of model estimates versus DHS estimates

4. Discussion and conclusion

Footnotes

Acknowledgments

References

Table 1
Description of DHS indicators

Table 3
Predictive metrics for each indicator aggregated at admin 2 (Kenya)

Table 4
Predictive metrics for each indicator aggregated at admin 2 (Ethiopia)