Improving survey based estimates of malnutrition using small area estimation

Abstract

A survey is typically designed to produce reliable estimates of target variables of the population at national and regional levels. For unplanned zones with small sample sizes, reliable estimates are needed in many ways but the direct survey estimates are unreliable. The purpose of the study is to improve the direct survey estimates of the $z$ scores of malnutrition for unplanned zones by borrowing auxiliary variables from the census. We applied small area estimations under Fay Herriot (FH) model to overcome the problem of generating reliable estimates by linking the Ethiopian demographic and health survey (DHS) with the census data. According to the results of diagnostic measures, the FH model assumptions are satisfactorily confirmed. And also the results of model-based estimates confirmed that the EBLUPs of $z$ scores of malnutrition are produced more reliable, efficient and precise estimates than the direct survey estimates for small sample sizes in all zones. Therefore, direct survey estimates of malnutrition were highly improved by the EBLUPs in all zones. Zones are important domains for planning and monitoring purposes in the country and therefore $z$ scores of malnutrition estimates for under-five children at the zonal level can be helpful for resource allocation, policymakers, and planners.

Keywords

Fay-Herriot model survey census coefficient of variation precision malnutrition

1. Introduction

“Small Area Estimation” (SAE) was designed to improve the sample survey estimates of small sample size by borrowing auxiliary data from related sources [1]. In the general context, an area (domain) is regarded as “small” if the domain-specific sample size is not large enough to support direct estimates of adequate precision [1]. Sample survey estimates were a cost-effective means of obtaining information for large areas or domains. Thus, sample survey data are used to provide reliable and efficient direct estimates for large areas or domains at the regional and national levels [1]. Direct estimates are estimates of large domains based on the design-based approach for the sample survey. Data obtained from sample surveys can be used to derive reliable direct estimates for large domains. However, it produced large sampling variability since sample sizes are too small (even zero).

Indirect estimates based on a model-based approach are an alternative approach that is now widely used in small area estimation. Indirect estimates borrowed auxiliary variables from census for the small areas to strengthen the relationship between the target variables and the survey variables [1, 2]. In making estimates for small areas with an adequate level of precision, it is often necessary to use auxiliary variables to increase the effectiveness of the sample size and the precision [1]. The model-based estimators are based on explicit models that provide a link to related small areas through the census data [1]. Nowadays, SAE is commonly used in planning health, social, and other services and for allocating government funds [1]. Small area models can be area-level models that relate direct small area estimates to area-specific variables (aggregated variables) named Fay-Herriot (FH) [1, 3] model.

The performance measures of SAE have been measured by the coefficient of variation (CV %) and mean square prediction error (MSPE). The CVs (%) show the sampling variability as a percentage of estimates. Small area estimates with large CVs (%) are considered as unreliable [4]. The most common method of parameter estimators under the FH model were the methods of moments used by [3, 5], maximum likelihood (ML), and restricted maximum likelihood (REML) used by [6].

The study variables were the $z$ scores of malnutrition (stunting, wasting and underweight) children under five from the 2016 Ethiopian demographic health survey (EDHS). The standardized measures of stunting (height-for-age), wasting (weight-for-age), and underweight (weight-for-height) were calculated in the new child growth standards released by the World Health Organization (WHO) [7].

According to the report in [8], globally, an estimated 144 million and 47 million children under five were stunted and wasted respectively. Most of the world’s stunted, underweight and wasted children under five were lived in Asia and Africa [8]. In Ethiopia, 38%, 10% and 24% of children under five were stunted, wasted and underweight, respectively [9].

Many researchers studied malnutrition at the regional level in the country [10, 11, 12, 13, 14, 15, 16, 17]. These studies were survey data for only planned domains at national and regional levels. However, unplanned domains of malnutrition in Ethiopia at zones (the third administration layers) need to be estimated. At zonal level, the standard direct estimation methods cannot be used due to the small sample size.

In the Ethiopian health system, the health service uses decentralization as the most influential administrative determinant [18]. The federal ministry of health decentralized the health service in parallel to the government structures (regions, zones and districts). These administrative hierarchies are the key institutions involved in health care delivery in the country [18, 19]. Among these, the zonal governments are the bridging (milestones) between the regional and the districts governments. The zonal health department is responsible for the monitoring and evaluation of health activities in the districts [18]. Therefore, estimating malnutrition at the zonal level is an invaluable advantage for the zonal governments and also for all governmental structures. The main objective of this study was improving the direct survey estimates of $z$ scores of malnutrition for unplanned domains (zones) with small sample size by using auxiliary variables from the census data.

The subsequent sections of this paper are organized as follows: Section 2 describes the data sources, sampling design, and presents methodologies of the Fay Herriot model, Section 3 contains the results and discussions, and finally, Section 5 presents conclusions.

2. Methods and materials

2.1 Sampling design

The 2016 EDHS used a sampling frame designed for the Ethiopian population and housing census which was carried out in 2007 by the Ethiopian central statistical agency (CSA). The 2016 sample survey was designed to provide reliable estimates of key indicators at the national and regional levels. Similarly, the sample survey was designed to provide estimates for urban and rural areas [9].

A two-stage-stratifying sampling technique was used for the 2016 EDHS sample to all the nine regions and two administration cities and also for urban and rural areas. The stratification produced 21 sampling strata. From each stratum samples of enumeration areas (EAs) were selected independently in two stages. Implicit stratification and proportional allocation were achieved at each of the lower administrative levels by sorting the sampling frame within each sampling stratum before sample selection, according to administrative units at different levels, and by using a probability proportional to size selection at the first stage of sampling [9].

In the first stage, 645 EAs were independently selected in each stratum with probability proportional to the EAs size. Among 645 EAs, 202 EAs were for urban and 443 EAs were for rural areas. In the second stage, an equal probability systematic sampling was used to select 28 households per cluster from the newly created household lists. The height and weight measurements were collected from children 0–59 months, women aged 15–49 years, and men aged 15–59 years [9] in all the selected households.

2.1.1 Data source and study variables

Ethiopia has nine regions and two administrative cities which in turn, are divided into many different zones and special zones. Of these, 95 zones, special zones, and special districts were studied as domains in this research. However, 87 zones were sampled and 8 zones were non-sampled zones in the 2016 EDHS survey.

The target variables were the $z$ -scores of malnutrition children under five (stunting, wasting and underweight) as standardized by WHO standards [7]. These variables were obtained from the 2016 EDHS data [9]. The auxiliary variables were the zonal level aggregated (proportions) variables obtained from the 2007 census. The set of auxiliary variables that were in the census and also available in the survey [1, 20]. This is important to be defined and measured in a consistent way in both data sources (survey and census) [20].

The gap between the 2007 population census and 2016 EDHS data was wide since the census was not conducted within the scheduled time in 2017. This is because of the country’s political instability and the covid-19 pandemic. However, to manage this gap, we used the 2016 census projection figures for urban and rural residence and sex auxiliary variables at all zones [21].

The auxiliary variables have been taken in two ways from the census data via variables related to children under five and parents. For instance, sex (male and female) [12] and ages (below one year, 1–2 years and 4–5 years) have been taken from children under age five [13]. On the other hand, the auxiliary variables of parents were sex (male and female), place of residence (urban and rural), age (15–24, 25–34, 35–44 and 45–49) [12], source of drinking water (improved and unimproved) [13], educational levels (non-educated, primary and secondary and above) [15, 22], literacy (literate and illiterate) [12, 23], marital status (married, never married and others), type of toilet facility (have toilet facility and doesn’t have toilet facility) [11], the number of sons died (no died, one died and two and more died) , the number of daughters died (no died, one died and two and more died), the number of families in the household (less than five, and five and more) [11, 12], and disability (disabled and not disabled) [13, 16], and employment status (government-employed, private employed, self-employed, employer, unemployed and other employment).

2.2 Fay Herriot small area estimation

Because of the lack of sample data within small areas, models are needed to link all areas through some common parameters so as to borrow strength from census data and then to improve efficiency and reliability of estimation. The FH model was first introduced by [3]. The FH model is often used to obtain efficient estimators of the area means when the sample sizes within areas are small.

In this study, the domains are zones in Ethiopia. Basic area level small area estimations of $z$ scores of malnutrition will be estimated at the zonal level. Area (zonal) level models relate direct estimators of the study variable of the small area to the corresponding area-specific auxiliary variables. Let the large and finite population $U$ is assumed to be partitioned into m mutually exclusive and exhaustive domains or small areas (zones) ( $U_{i},\ldots,U_{m}$ ) with the population ( $N_{i},\ldots,N_{m}$ ). Let $y_{ij}$ be the $z$ scores of malnutrition for the individual level $j$ within zone $i$ , and $N_{i}$ and $n_{i}$ be the population and sample sizes in the area (zone) $i$ respectively, ( $i=1,2,\ldots,m$ ), where m is the number of small areas (zone) in the population.

Now the aim is to estimate the parameters of the small area mean at zone level $\bar{{y}}_{i}(i=1,2,\ldots,m),$ which is given by $\bar{\bm{y}}_{i}=\sum_{j=1}^{N_{i}}{\frac{y_{ij}}{N_{i}}}$ for $i=1,\ldots,m$ . areas (zones). EDHS surveys are usually planned at a higher or national level; hence, whenever more detailed information is required, the sample size may be not large enough to guarantee to release direct survey estimates and, in some cases, smaller areas may happen to be with very small sample size [24, 25, 26, 27]. This means that additional information from different sources is exploited data.

The Fay-Herriot model [3] is a basic area level model widely used in small area estimation to improve the direct survey estimates. In area level models, the area-specific auxiliary information comes in the aggregated values of some explanatory variables at the domains.

Let $y_{1},y_{2},\ldots,y_{m}$ be the direct survey estimates of $z$ scores of malnutrition for the m small areas (zones) independently, and $x_{1},x_{2},\ldots,x_{m}$ be the values of variables associated with these small areas. Generally, the Fay-Herriot model may be written as the following two level model:

Level 1 (Sampling model) $y_{i}/\theta_{i}\sim^{iid}N(\theta_{i},D_{i})$ and

Level 2 (Linking model) $\theta_{i}/A\sim^{ind}N(x_{i}`\beta,A)$ , where $D_{i}$ is the known variance of the sampling error, $\bm{\beta}=$ ( $\beta_{1},\ldots,\beta_{p})^{T}$ is a vector of unknown regression coefficients to be estimated, and A is the unknown variance of the area-specific random effect to be estimated. Level 1 accounts for the sampling variability of the survey estimates $y_{i}$ of $\theta_{i}$ , whereas Level 2 links $\theta_{i}$ to the vector of $p$ known area-specific auxiliary variables [28]. Here, we can observe from the above level one and level two models that two types of parameters are presented. The first parameter $\theta_{i}$ is high dimensional parameter and the second parameters are $\bm{\beta}$ and A which are low dimensional usually referred to as hyper parameters. In small area estimation, $\theta_{i}$ is the main objective of inference which involves estimation of the unknown hyper parameters. Therefore, the model is defined as in [3] is given as

$\displaystyle y_{i}=x_{i}\beta+\nu_{i}+\epsilon_{i}$ (1)

In this model $\nu_{i}$ the random small area effect (zones effect) and $\epsilon_{i}$ the sampling error associated with $y_{i}$ . $y_{i}$ denotes the direct survey estimator of the $i^{\rm th}$ small area mean $\theta_{i}$ , $x_{i}$ is the corresponding area level supplementary data with $p\times 1$ vectors. It is assumed that ( $\nu_{1},\ldots,\nu_{m}$ ) and ( $\epsilon_{1},\ldots,\epsilon_{m}$ ) are mutually independent random variables with $\nu_{i}\sim N(0,A=\sigma_{v}^{2})$ and $\epsilon_{i}\sim N(0,D_{i}=\sigma_{ei}^{2})(i=1,\ldots,m)$ , where the model variance, A is unknown, and the sampling variances $D_{i}$ are assumed to be known.

The popularity of the Fay-Herriot model (1) for small area estimation stems from the fact that it produces reliable small area statistics by building linking models for the direct estimators through auxiliary data and then by borrowing strength from other domains. By modeling the direct estimators, the Fay-Herriot model uses the design weights to produce design consistent small area estimators. To develop a model-based estimator of $\theta_{i}$ , we need to estimate the unknown model variance A [29]. SAE for the FH models are designed to estimate the $i^{\rm th}$ small area means $\theta_{i}$ given by model

$\displaystyle\theta_{i}=x_{i}\bm{\beta}+\nu_{i},i=1,\ldots,m$ (2)

Note that this model is a linking model that links the target quantity $\theta_{i}$ of all the areas through the common the common regression parameter $\bm{\beta}$ . By combining the model Eqs (1) and (2), we can rewrite as matrix notations given below

$\displaystyle\bm{y}={\bm{x\beta}}+\bm{\nu}+\epsilon$ (3)

where $\bm{x}=(x_{1},\ldots,x_{m})^{T}$ , $\bm{\nu}=(\nu_{1},\nu_{2},\ldots,\nu_{m})^{T}$ , $\epsilon$ $=$ $(\epsilon_{1},\ldots,\epsilon_{m})^{T}$ , $\bm{y}=(y_{1},\ldots,y_{m})^{T}$ .

The variance covariance matrix of $\bm{y}$ is $\bm{\Sigma}=\bm{D}+\bm{AI}_{m}$ , where $\bm{D}=\textit{diag}(D_{1},D_{2},\ldots,D_{m})$ , $\bm{I}_{m}$ is $m\times m$ identity matrix. We will assume that $\textit{rank}(X)=p$ , $p$ is the number of parameters, and $\bm{\Sigma}$ are implicitly dependent on the model variance component A [29, 30, 31].

2.2.1 Empirical best linear unbiased predictor

The prediction of random effects best linear unbiased predictor (BLUP) was proposed by Henderson in 1950 to find “maximum likelihood estimates” of the random effects [32]. The expression of the BLUP involves model variance components A (random effect variance), which is typically unknown in practice. It is customary to replace A with a consistent estimator, $\hat{{A}}$ . The resulting predictor is often called empirical BLUP (EBLUP).

EBLUP methods are extensively discussed in the small area estimations [20, 28, 29, 30, 31, 33]. In a survey application, the values $y_{i}$ are the direct survey estimators of the target small area means $\theta_{i}$ in the sampled areas but may be unacceptable variables in some or all the small areas because of a small sample size. The $D_{i}$ ’s represents the sampling variance of the $y_{i}$ . If the hyper parameters $\hat{{A}}$ and $\hat{\bm{\beta}}$ were known, the best linear unbiased prediction (BLUP) of $\theta_{i}$ , under the general the FH model (1), is given by

$\displaystyle\hat{{\theta}}_{i}(\bm{y};\hat{{A}})=B_{i}(\hat{{A}})y_{i}+(1-B_{% i}(\hat{{A}}))x_{i}^{T}\hat{\bm{\beta}}(\hat{{A}})$ (4)

where $B_{i}(\hat{{A}})=\frac{\hat{{A}}}{\hat{{A}}+D_{i}}$ $(i=1,\ldots,m)$ with the weight range in $0<B_{i}<1$ and $\hat{\bm{\beta}}(\hat{\bm{A}})=(\bm{x}^{{T}}\bm{\Sigma}^{-{1}}(\hat{\bm{A}})% \bm{x})\bm{x}^{{T}}\bm{\Sigma}^{-{1}}(\hat{\bm{A}})\bm{y}$ is the best linear unbiased estimator of $\bm{\beta}$ . The EBLUP is a weighted combination of a direct area-specific estimator and a regression synthetic estimator that uses all the data. The EBLUP estimator $\hat{{\theta}}_{i}(\bm{y}:\hat{{A}})$ is indeed the weighted average of a direct estimates $y_{i}$ and a model estimates $x_{i}^{T}\hat{\bm{\beta}}(\hat{{A}})$ . The weight $B_{i}$ depends on the estimate of the ratio between sampling variance $D_{i}$ and model variance $\hat{{A}}$ .

For the non-sampled zones (zones with zero sample size), we used the synthetic regression estimates $\hat{\bm{\theta}}_{1}=\bm{x}_{l}^{T}\hat{\bm{\beta}}$ based on the corresponding covariates observed from the non-sampled areas, $l=1,2,\ldots,M$ , where, $M$ is the number of non-sampled areas (zones) [1, 20, 34].

Where, $\bm{x}_{1}$ the zonal level auxiliary variables for non-sampled zones. Therefore, the malnutrition $z$ scores for the non-sampled zones were estimated using only the census data (known as synthetic regression estimates).

2.2.2 Estimations of the model variance components

The variance component estimations used in Fay-Herriot model for small areas are studied by different scholars [1, 24, 26, 31, 35]. The common model fitting methods delivering consistent estimators for the model variance component ( $\hat{{A}}$ ) are the method of moments used by [3, 5], ML and REML methods used by [6], and also, further studied by [6, 29]. Note that the EBLUP depends on the way how the model variance component ( $\hat{{A}}$ ) is estimated. In those cases, when the estimate of model variance component ( $\hat{{A}}$ ) takes a negative value, [5, 31] suggested to truncate the negative estimate at zero. They also showed that the probability of having a negative estimate goes to zero as m is large [35]). As an alternative, the ML method has been widely used in small area estimation [1, 28].

A simple method-of-moments proposed by [5] to estimate model variance component of estimator $\hat{{A}}$ is given

$\displaystyle\hat{{A}}^{\textit{PR}}=\frac{1}{m-p}(l^{T}l-\sum D_{i}(1-h_{ii}))$ (5)

here $h_{ii}=x_{i}(X^{T}X)^{-1}x_{i},l=y-x^{T}\hat{\bm{\beta}}$ , and $\hat{\bm{\beta}}=(\bm{X}^{T}\bm{X})^{-1}\bm{X}^{T}\bm{y}$ , and also m is the number of small area (zones) and p is the dimension vector of auxiliary variables $x_{i}$ [6, 20, 31].

The other moment estimator of is based on the weighted least square residual sum of squares is Fay Herriot methods ( $\hat{{A}}^{\textit{FH}}$ ) [3]. We can obtain $A^{\textit{FH}}$ estimator by simplifying the following equation iteratively.

$\displaystyle 0=\frac{1}{m-p}\sum_{i=1}^{m}{\frac{(y_{i}-x^{T}\hat{\bm{\beta}}% )^{2}}{D_{i}+\hat{{A}}^{\textit{FH}}}}-1$ (6)

where $\hat{\bm{\beta}}=(\bm{x}^{T}\bm{\Sigma}^{-1}\bm{x})^{-1}\bm{x}^{T}\bm{\Sigma}^% {-1}\bm{y}$ which is the best linear unbiased estimator of $\bm{\beta}$ .

Models are typically compared based on goodness-of-fit measures such as the log-likelihood, the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). In general terms, the value of AIC for a model M is defined as $\textit{AIC}(M)=-2\log(l(M))+2(p+1)$ , where $l(M)$ is the likelihood function of model M. The model with the lowest value of AIC is selected. It is also usual to use the degrees of freedom DF instead of the parameter $p$ : DF coincides with $p$ for simple models. But this is not always simple for more complex models like the FH model [36]. The BIC with the respective likelihood model are also obtained as $\textit{BIC}(M)=-2\log(l(M))+(p+1)\log(d),$ where $l(M)$ is the likelihood of the FH model and $d$ is a penalty term, which was originally equal to the number of parameters in the model.

2.2.3 Mean square error under Fay Herriot model

One of the main reasons for relying on small area methods is to reduce the variability of the random area predictions. The MSE estimates of EBLUP under FH model is mean square prediction error (MSPE). Estimating MSPE is one of the challenging problems in model based small area estimation [4, 27] because the most common practical problem in small area estimation is measuring the variability associated with the EBLUP. The MSPE is generally used as a measure of variability under the EBLUP estimator. Nevertheless, estimation of the MSPE of the EBLUP is of significant practical interest. We can measure the uncertainty of the EBLUP by its MSPE is defined as $\textit{MSPE}[\hat{\theta}_{i}(\bm{y};\hat{{A}})]=E[\hat{\theta}_{i}(\bm{y};% \hat{{A}})-\theta_{i}]^{2}$ and is the most commonly used measure of uncertainty of $\hat{\theta}_{i}$ , where $E$ is expectations, and $\hat{\theta}_{i}(\bm{y};\hat{{A}})$ is the BLUP of $\theta_{i}$ [29, 31]. Thus,

$\displaystyle\textit{MSPE}[\hat{\theta}_{i}(\bm{y};\hat{{A}})]=g_{1i}(\hat{{A}% })+g_{2i}(\hat{{A}})+g_{3i}(\hat{{A}})+O(m^{-1})$ (7)

The second order unbiased estimator of MPSE of EBLUP is available when $\hat{{A}}$ is estimated by REML, ML, PR and FH estimators [6, 20, 29, 35]. The mathematical details of Eq. (7) were present in [6, 29, 31, 35].

The MSPE of non-sampled zones were managed by the following formulas via the synthetic regressions $\widehat{\textit{MSE}}_{(\hat{{\theta}}_{l})}=x_{l}^{T}[\sum_{i}^{m}{\frac{x_{% i}x_{i}^{T}}{(Di+\hat{{A}})}}]^{-1}x_{l}+\hat{{A}}$ , Where $i=1,2,\ldots,m$ is the sampled zones and $l=1,2,\ldots,M$ is the non-sampled zones [1, 20]. One of the advantages of small area estimation is estimating the random areas (zones) effect which didn’t have any sample size (zero sample size) by borrowing information’s from related sources.

2.2.4 Generalized variance function

In most articles [25, 31, 37] the sampling variance $D_{i}$ was assumed to be known from the survey. Even though the sampling variance $D_{i}$ was calculated from survey data, in practical cases $D_{i}$ might not be available. Thus, Fay Herriot [3] employed the generalized variance function (GVF) to estimate the sampling variance $D_{i}$ [38]. According to [38] the GVF was important to smooth estimates of the sampling variance $D_{i}$ . The GVF method fits the model:

$\displaystyle\textit{Var}(y_{i})=\beta_{0}+\beta_{1}y_{i}+\epsilon_{i}$ (8)

where, $\textit{Var}(y_{i})$ are as dependent variable, $y_{i}$ are as independent variables, $\beta_{0}$ and $\beta_{1}$ are the least square estimates and $\epsilon_{i}\sim N(0,\sigma^{2}),i=1,2,\ldots,m$ . In order to estimate the GVF the predicted values are obtained from the above model (8) using the following simple linear regression model:

$\displaystyle\widehat{\textit{GVF}}=\exp\bigg{(}\frac{\hat{{\sigma}}^{2}}{2}% \bigg{)}\exp(\hat{{\beta}}_{0}+\hat{{\beta}}_{1}y_{i})$ (9)

where, $\exp(\frac{\hat{{\sigma}}^{2}}{2})$ is the bias correction term. When this term is ignored, the GVF method tends to be underestimating the variances [37].

Table 1

Comparisons of model variance estimators using information criteria (IC)

	Stunting			Wasting			Underweight
IC	$\hat{{A}}^{\textit{ML}}$	$\hat{{A}}^{\textit{RML}}$	$\hat{{A}}^{\textit{FH}}$	$\hat{{A}}^{\textit{ML}}$	$\hat{{A}}^{\textit{RML}}$	$\hat{{A}}^{\textit{FH}}$	$\hat{{A}}^{\textit{ML}}$	$\hat{{A}}^{\textit{RML}}$	$\hat{{A}}^{\textit{FH}}$
$-$ 2LLl	$-$ 8.92	$-$ 8.14	$-$ 7.74	$-$ 53.34	$-$ 52.56	$-$ 51.76	$-$ 106.48	$-$ 105.96	$-$ 105.88
AIC	13.048	13.84	14.25	$-$ 29.55	$-$ 28.57	$-$ 27.74	$-$ 88.48	$-$ 87.97	$-$ 87.87
BIC	40.21	40.96	41.38	0.0396	1.011	1.84	$-$ 66.29	$-$ 65.78	$-$ 65.68

Figure 1.

Quantile-Quantile (Q-Q) plot for FH model residuals.

3. Results

There were 41 auxiliary variables selected from the 2007 census to consider in the model. However, appropriate variables were selected using forward and step wise regression analysis. Thus, age groups of 15–24 years, children aged 4–5 years, non-disabled parents, other marital status, (divorced or widowed), unemployed, illiterate, and no death of daughters in a family were chosen for $z$ scores of stunting. Age groups of 15–24 years, children aged 4–5 years, other marital status (divorced or widowed), children aged 2–3 years, government employed and death of only one daughter in a family were selected as variables for $z$ scores of wasting. The number of family size less than five years of age, other marital status (divorced or widowed), married, employer, government employed and improved water facility were selected for $z$ scores of underweight.

Estimating the hyper parameters of $\hat{\bm{\beta}}$ and $\hat{{A}}$ are the first task to estimate the EBLUPs and MSPE of the FH model. PR, FH, ML, and REML are the most common estimators used to estimate the model variance components of $\hat{{A}}$ . Therefore, all estimators ( $\hat{{A}}^{\textit{FH}}$ , $\hat{{A}}^{\textit{PR}}$ , $\hat{{A}}^{\textit{ML}}$ , $\hat{{A}}^{\textit{REML}})$ were produced non-negative estimates for the $z$ scores of malnutrition. Among these estimators, ML was relatively the best method since it had the smallest information criteria (log likelihood, AIC and BIC) (Table 1) to estimate all response variables (stunting wasting and underweight). In addition, the CV and MSPE of ML estimators of $\hat{{A}}$ are the smallest from other methods of estimators. Thus, the maximum likelihood, which had relatively the smallest information criteria, was selected for this analysis.

The model parameter estimates of the $z$ scores of malnutrition were computed under the FH model through model (4). The model variance parameter ( $\hat{{A}}^{\textit{ML}}$ ) were 0.035, 0.0216 and 0.007 for the $z$ scores of stunting, wasting and underweight, respectively. Therefore, the resulting estimated weighted parameters $\hat{{B}}_{i}$ were in between 0 and 1. Then the weighted estimated parameters $\hat{{B}}_{i}$ were in the ranges from 0.018 (for Harari) to 0.864 (for Yem special woreda), 0.081 (for Harari) to 0.823 (for Shinilie) and 0.18 (for Harari) to 0.93 (for Yem special woreda) for $z$ scores of stunting, wasting and underweight, respectively. Therefore, the resulting EBLUPs of $\theta_{i}$ cannot be synthetic and must always be weighted using direct and regression estimators. Consequently, the FH model was the weighted combinations of the direct survey estimators and the synthetic regression estimators.

3.1 Model diagnostics

The research considered the bias diagnostics, the coefficients of variation (CV) and MSPE to validate the reliability of the small area estimates under FH model.

We applied the diagnostic measures to examine the model assumption. The normality assumption of the model diagnostics test is examined using the Q-Q normal probability test and the Kolmogorov-Smirnov test [39].

As we have seen in Fig. 1, the dots are very close to the line. If the residuals are normally distributed, the dots will be plotted along the line. The distribution of residuals in the Fay-Herriot model looks like normal. This is confirmed by the Kolmogorov-Smirnov normality test of residuals, with the $p$ -values being 0.1391, 0.5533 and 0.3591 for stunting, wasting and underweight, respectively. Therefore, there is no enough evidence to reject the null hypothesis. Consequently, the normality assumptions of the residuals under the FH model were confirmed. The small area estimates under the FH model diagnostic measures clearly depict that the model based estimates were reliable and more stable than the corresponding direct estimates (Figs 1 and 2).

Figure 2.

Residuals vs Model based estimates under Fay Herriot model.

Figure 3.

In the plots of GVF vs direct variance estimates.

Table 2

Regression coefficient estimates of auxiliary variables under Fay Herriot model

Variables	$\hat{\bm{\beta}}.s$	std.s	tvalue.s	P_value.s
(Intercept)	$-$ 9.0489	4.1201	$-$ 2.1936	0.0281
Ages 15–24	$-$ 1.9133	0.3029	$-$ 6.3171	0.0000
Ages 4–5	1.6953	1.9810	0.8558	0.3921
Not disabled	6.5614	3.7776	1.7369	0.0820
Divorced or widowed	$-$ 2.3637	0.6105	$-$ 3.8718	0.0001
Unemployed	$-$ 0.4001	0.2593	$-$ 1.5427	0.1229
Under age 1	5.7075	1.7512	3.2591	0.0011
Illiterate	1.5534	0.6128	2.5349	0.0112
One daughter death	$-$ 3.2690	1.5725	$-$ 2.0788	0.0376
Variables	$\hat{\bm{\beta}}.w$	std.w	tvalue.w	$p$ -value.w
(Intercept)	$-$ 0.0519	2.8210	$-$ 0.0184	0.9853
Female	$-$ 0.3240	0.4163	$-$ 0.7785	0.4363
Ages 15–24	$-$ 1.9361	0.2637	$-$ 7.3415	0.0000
Ages 4–5	$-$ 3.3422	0.5503	$-$ 6.0737	0.0000
Divorced or widowed	$-$ 1.9918	0.5027	$-$ 3.9624	0.0001
Employer	$-$ 0.2400	0.3037	$-$ 0.7900	0.4295
Not disabled	3.4650	3.0069	1.1523	0.2490
Ages 2–3	$-$ 4.8451	1.4140	$-$ 3.4266	0.0006
Government employed	$-$ 0.8710	0.3823	$-$ 2.2783	0.0227
One daughter death	$-$ 2.1758	1.1469	$-$ 1.8970	0.0578
Variables	$\hat{\bm{\beta}}.u$	std.u	tvalue.u	$p$ -value.u
Intercepts	$-$ 0.3541	0.1483	$-$ 2.3876	0.0170
Less than 5 family	$-$ 0.8764	0.1332	$-$ 6.5807	0.0000
Divorced or widowed	$-$ 1.2792	0.3581	$-$ 3.5727	0.0004
Other-employment	$-$ 0.2613	0.2066	$-$ 1.2646	0.2060
Married	$-$ 0.2326	0.1823	$-$ 1.2759	0.2020
Government employed	$-$ 0.5750	0.2623	$-$ 2.1917	0.0284
Improved water	0.1769	0.0925	1.9123	0.0558

NB: In the coefficients of $\hat{{\beta}}.s$ , $\hat{{\beta}}.w$ and $\hat{{\beta}}.u$ ; $s$ for stunting, $w$ for wasting and $u$ for underweight.

Table 3

Summary of direct and model based estimates of malnutrition using FH model

Statistic	$N$	Mean	St. Dev.	Min	Pctl (25)	Pctl (75)	Max
Direct estimate	87	1.034	0.190	0.660	0.910	1.165	1.610
Eblup_stunting	87	1.763	0.325	0.640	1.620	1.985	2.310
Eblup_wasting	87	1.371	0.259	0.750	1.225	1.530	1.940
Eblup_underweight	87	1.029	0.142	0.750	0.940	1.120	1.400

3.2 The estimates of generalized variance function

The estimates of the generalized variance function (GVF) and the sampling variance $\hat{{D}}_{i}$ versus the direct $z$ score estimates of stunting, wasting and underweight was presented in Fig. 3. In Fig. 3, we observed that the GVF estimated variances are the smoothed representations of the sampling estimated variance in all stunting, wasting and underweight since. As we have seen from the plot the GVF estimate smoothed out the unreliable and noisy estimated variance [14, 20, 33, 37].

3.3 Interpretations for synthetic regression coefficients

The best linear unbiased estimators $(\hat{\bm{\beta}}^{T}s)$ with their respective $p$ -values and other statistic value were presented in Table 2. Thus, parents in 15–24 age range, divorced or widowed marital status, and parents who experienced the death of one daughter only as were negatively affected for the $z$ -scores of stunting since their estimated corresponding coefficients were negative. By contrasts, children under the age one and illiteracy of parents were positively affected for the $z$ -scores of stunting.

Parents in the 14–15 age range, children in the 4–5 age range, divorced or widowed marital status, children in 2–3 years of age, government employed and parents having the death of one daughter only were significant variables for wasting target variable. For the underweight target variable, less than five family size, divorced or widowed marital status, and government employed was in significant at 0.05 significant levels.

The summarized direct and EBLUP estimates of $z$ scores of malnutrition among the Ethiopian zones were presented in Table 3 with the minimum, maximum, mean and percentile values. According to EBLUP estimates under the FH model, the 75 percentile value is 1.985 (approximated to 2). Therefore, 25% of the zones (nearly 22 zones) show estimated $z$ -scores of stunting greater than or equal to 1.985.

Table 4
Improvements of estimates under Fay Herriot model

Statistic	$N$	Mean	St. Dev.	Min	Pctl (25)	Pctl (75)	Max
CV_Direct stunting	87	9.448	5.554	3.260	5.410	11.145	32.930
CV_EBLUP stunting	87	6.667	2.263	3.170	4.420	7.840	12.830
CV_gain stunting	87	21.887	14.088	2.760	11.255	29.640	61.300
CV_Dir wasting	87	9.542	5.781	3.400	5.790	10.905	41.600
CV_EBLUP wasting	87	6.712	2.088	3.290	5.110	7.870	12.240
CV_gain wasting	87	21.762	14.240	2.940	11.500	26.265	70.580
CV_Direct underweight	87	11.370	6.533	3.840	6.925	13.470	34.020
CV_EBLUP underweight	87	6.346	1.464	3.510	5.235	7.400	9.610
CV.gain Underweight	87	6.346	16.436	8.590	22.855	45.970	72.100

Figure 4.

Zones (sorted by increasing CVs (%) of direct estimators).

3.4 Performance measures and improvement estimates under Fay Herriot model

The performance measure of SAE for $z$ scores of malnutrition was measured by CV. The CV (%) for the direct survey estimates and the EBLUP estimates for all respective target variables are presented (Fig. 4). In all target variables, in Fig. 4, the CV (%) of EBLUP was smaller than the CV (%) of the direct survey estimates. This means that EBLUP estimates were more acceptable and reliable estimates than the direct survey estimates [4]. This happened due to the additional auxiliary variables that were borrowing from the census data. We show that the EBLUP estimates exhibited better performance relative to the direct survey estimates, in terms of average MSPE and CV (%). The CV (%) and MSPE estimates of the $z$ scores of malnutrition, based on the FH model, were smaller than the direct survey estimates in almost all zones. We also show that, due to borrowing strength from the census data, the EBLUP estimates exhibited significantly better performance relative to the direct survey estimates.

The results of efficiency gain in CV (%) due to the use of EBLUP over direct survey estimates were reported in Table 4. The maximum improvement of EBLUP estimates in the efficiency gain were recorded 61.30% for stunting at Bahir Dar City, 70.58% for wasting at Hawasa City and 72.1% for underweight at Shaka zone, respectively. On the other hand, the minimum efficiency was gained in Dire Dawa, recorded 2.76% for the $z$ scores of stunting. Similarly, the minimum efficiency was gained in Dire Dawa with results recorded at 2.94% for the $z$ scores of wasting, and also 8.56% efficiency was registered in Hareri for $z$ scores of underweight direct. This efficiency gain was one of advantages of using FH model for small areas (zones) of Ethiopia. On averages, of all zones, the efficiency gains of using EBLUP estimates over the estimates were recorded in Table 4. Therefore, the average estimated $z$ score for the $z$ scores of stunting performance efficiency gained 21.887%, while wasting and underweight were gained 21.762% and 35.377% efficiency, respectively [31]. In general, EBLUP estimate was more efficient over direct estimates in all Ethiopian zones.

The $z$ scores of malnutrition of the non-sampled areas (zones) (Adama special zone, Amaro special woreda, Argoba special woreda, Basketo special woreda, Burayu special zone, Fiq and Jimma special zones) were estimated using synthetic regressions.

These non-sampled zones were didn’t estimate using survey data because of the zero sample size. However, we estimated them by borrowing auxiliary variables under FH model, and also the MSPE and CV (%) were estimated for non-sampled zones (zero sample size).

4. Discussions

This study provided the Ethiopian zonal level estimates of the $z$ scores malnutrition using the 2016 EDHS and 2007 census. The main concept of the study was improving the direct survey estimates of malnutrition for unplanned domains (zones) with small sample size by using auxiliary variables from the census data. Small area estimations under FH model was appropriate methods for linking the survey data with the auxiliary variables which are taken from census data [1]. The appropriate performance measures indicated that the EBLUP of malnutrition for children under five were highly improved the direct survey estimates for almost all zones (small areas). Therefore, the hypothesis of the study was accepted.

Unlike previous studies by [10, 17], both of them were survey studies with planned domains (large areas), in the current study small area estimation with unplanned domains (small sample size) of $z$ scores of malnutrition has been estimated via auxiliary variables from the census data. The FH model for the $z$ scores of malnutrition were applied to obtain zonal estimates. In this research, some of the auxiliary variables in the synthetic regression results were significant. Among these, significant variables rural areas for the target variable stunting ( $z$ score); literate levels of mothers, non-educated mothers, ages of mothers in the 15–24 and 35–44 age ranges for variables of wasting ( $z$ score); and married mothers and mothers living in rural areas for target variable underweight ( $z$ score) were significant factors.

Unlike the previous studies by [20, 31], that produced negative estimates of the model variance component of $\hat{{A}}$ , in the current study, the model variance components of $\hat{{A}}$ estimators produced non-negative estimates for all targets variables. The residuals against Ethiopian zones (sub regions) indicated that the normality assumptions of random effects were satisfied [1, 37]. Similarly, the residuals against EBLUP estimates indicated that the normality assumptions of residual errors were satisfied [1, 37].

The main aim of this study was increasing the efficiency of direct survey estimates by linking auxiliary variables in the sampling model. Thus, the findings showed that the $z$ scores of stunting, wasting and underweight under five were obtained from model based estimates. These estimates were more efficient and precise than the direct survey based estimates [4, 34, 35]. The EBLUP results, based on the survey data, clearly indicated that using auxiliary variables from census can bring significant gains in the efficiency under SAE. This result agrees with previous studies by [31, 40].

The CVs in the model-based estimates are less than the CVs in the direct estimates almost in all the zones. It is apparent that the MSE of the direct estimates are larger than the MSPE of the EBLUP estimates under the FH model [4, 26]. This means that the small area estimates under FH model are more efficient and precise than the direct survey estimates. Thus, EBLUP estimates are more efficient estimates than the direct survey estimates due to additional auxiliary variables under the FH model [27].

5. Conclusion

In this study, we applied the SAE techniques under the FH model to estimate the zonal level statistics of the $z$ scores of malnutrition by combining the survey with the census data. The diagnostic procedures clearly confirmed the model based zonal level estimates with reasonably good precision. Based on the CV (%) and MSPE estimates of the $z$ scores of malnutrition, the EBLUP under the FH model have been efficient and reliable estimates in small areas (zones). The direct survey estimates are improved by model based small area estimates because of the additional auxiliary covariates. The non-sampled domains (zero sample size) were also estimated under the synthetic model. It is recommended that further analysis needs to be conducted in woredas (districts) below zonal level governmental structures in Ethiopia.

Footnotes

Acknowledgments

We would like to thank Bahir Dar University and Debre Tabor University.

References

Rao

JNK

Molina

. Small area estimation. 2nd Ed. Hoboken. 2015.

Molina

Rao

JNK

. Small area estimation of poverty indicators. Can J Stat. 2010; 38: 369-385.

Fay

Herriot

. Estimates of income for small places: An application of james-stein procedures to census data. J Am Stat Assoc. 1979; 74: 269-277. doi: 10.1080/01621459.1979.10482505.

Datta

Kubokawa

Molina

Rao

JNK

. Estimation of mean squared error of model-based small area estimators. Test. 2011; 20: 367-388. doi: 10.1007/s11749-010-0206-2.

Prasad

NGN

Rao

JNK

. The estimation of the mean squared error of small-area estimators. J Am Stat Assoc. 1990; 85: 163-171.

Datta

Lahiri

. A unified measure of uncertainity of estimated best linear unbiased predictor in small area estimation problems. Stat Sin. 2000; 10: 613-627.

De Onis

. WHO child growth standards. 2006.

UNICEF, WHO, World Bank. Levels and trends in child malnutrition: Key findings of the 2020 edition of the joint child malnutrition estimates. Geneva WHO. 2020; 24: 1-16.

CSA, ICF. Ethiopia demographic health survey. Addis Ababa, Ethiopia, and Rockville, Maryland, USA: CSA and ICF. 2016.

10.

Amare

Negesse

Tsegaye

Assefa

Ayenie

. Prevalence of undernutrition and its associated factors among children below five years of age in bure town, west gojjam zone, amhara national regional state, northwest ethiopia. Adv Public Heal. 2016; 2016: 8. doi: 10.1155/2016/7145708.

11.

Endris

Asefa

Dube

. Prevalence of malnutrition and associated factors among children in rural ethiopia. Biomed Res Int. 2017; 2017: 6. doi: 10.1155/2017/6587853.

12.

Gebre

Reddy

Mulugeta

Sedik

Kahssay

. Prevalence of malnutrition and associated factors among under-five children in pastoral communities of afar regional state, northeast ethiopia: A community-based cross-sectional study. J Nutr Metab. 2019; 2019: 13. doi: 10.1155/2019/9187609.

13.

Tadesse

Alemu

. Urban-rural differentials in child undernutrition in ethiopia. Int J Nutr Metab. 2015; 7: 15-23. doi: 10.5897/IJNAM2014.0171.

14.

Teshome

Kogi-Makau

Getahun

Taye

. Magnitude and determinants of stunting in children underfive years of age in food surplus region of ethiopia: The case of west gojam zone. Ethiop J Heal Dev. 2010; 23: 98-106. doi: 10.4314/ejhd.v23i2.53223.

15.

Woodruff

Wirth

Bailes

Matji

Timmer

Rohner

. Determinants of stunting reduction in ethiopia 2000-2011. Matern Child Nutr. 2017; 13. doi: 10.1111/mcn.12307.

16.

Yeshaw

Kebede

Liyew

Tesema

Agegnehu

Teshale

, et al. Determinants of overweight/obesity among reproductive age group women in ethiopia: Multilevel analysis of ethiopian demographic and health survey. BMJ Open. 2020; 10: e034963. doi: 10.1136/bmjopen-2019-034963.

17.

Tekile

Woya

Basha

. Prevalence of malnutrition and associated factors among under-five children in ethiopia: Evidence from the 2016 ethiopia demographic and health survey. BMC Res Notes. 2019; 12: 1-6. doi: 10.1186/s13104-019-4444-4.

18.

Woldie

Jirra

Azene

. Presence and use of legislative guidelines for the distribution of decentralized decision making authority in the jimma zone health system, southwest ethiopia. Ethiop J Health Sci. 2011; 21: 29.

19.

Kitaw

Teka

G-E

Meche

Damen

Fentahun

. The evolution of public health in ethiopia. 2nd Ed. 2012.

20.

Shiferaw

Galpin

. Improved confidence intervals for a small area mean under the fay-herriot model. University of the Witwatersrand. 2016.

21.

Ethiopia Central Statistical Agency. Federal demographic republic of population projection of ethiopia from 2014–2017. 2013; 1-118.

22.

Wirth

Rohner

Petry

Onyango

Matji

Bailes

, et al. Assessment of the WHO stunting framework using ethiopia as a case study. Matern Child Nutr. 2017; 13: e12310. doi: 10.1111/mcn.12310.

23.

Gizaw

Woldu

Bitew

. Acute malnutrition among children aged 6–59 months of the nomadic population in hadaleala district, afar region, northeast ethiopia. Ital J Pediatr. 2018; 44: 1-10. doi: 10.1186/s13052-018-0457-1.

24.

Lehtonen

Veijanen

. Design-based methods of estimation for domains and small areas. Sample Surv Inference Anal. 2009; 29B: 219-249. doi: 10.1016/S0169-7161(09)00231-4.

25.

Molina

Marhuenda

. Sae: An R package for small area estimation. R J. 2015; 7: 81. doi: 10.32614/rj-2015-007.

26.

Pfeffermann

Ben-Hur

. Estimation of randomisation mean square error in small area estimation. Int Stat Rev. 2018; 87: 1-19. doi: 10.1111/insr.12289.

27.

Pfeffermann

. New important developments in small area estimation. Stat Sci. 2013; 28: 40-68. doi: 10.1214/12-STS395.

28.

Jiang

. Mixed model prediction and small area estimation. Test. 2016; 15: 1-96. doi: 10.1007/BF02595419.

29.

Datta

Rao

JNK

Smith

. On measuring the variability of small area estimators under a basic area level model. Biometrika Trust. 2005; 92: 183-196.

30.

Diao

Smith

Datta

Maiti

Opsomer

. Accurate confidence interval estimation of small area parameters under the fay-herriot model. Scand J Stat. 2014; 41: 497-515. doi: 10.1111/sjos.12045.

31.

. Small area estimation: An empirical best linear unbiased prediction approach. University of Maryland. 2007.

32.

Henderson

. Estimation of genetic parameters. Biometrics. 1950; 6: 186-187.

33.

Shiferaw

Galpin

. A corrected confidence interval for a small area parameter through the weighted estimator under the basic area level model. J Iran Stat Soc. 2019; 18: 17-51. doi: 10.29252/jirss.18.1.17A.

34.

Mukhopadhyay

McDowell

. Small area estimation for survey data analysis using SAS software. SAS Glob. Forum. 2011; 2011: 96.

35.

Datta

. Model-based approach to small area estimation. Sample Surv Inference Anal. 2009; 29B: 251-288. doi: 10.1016/S0169-7161(09)00232-6.

36.

Lombardía

López-Vizcaíno

Rueda

. Mixed generalized akaike information criterion for small area models. J R Stat Soc Ser A Stat Soc. 2017; 180. doi: 10.1111/rssa.12300.

37.

Shiferaw

. Analysis of the spatial distribution of under-5 mortality rate in local areas of south africa. Stat J IAOS. 2020; 36: 1161-1173. doi: 10.3233/SJI-200650.

38.

Wolter

. Introduction to variance estimation. 2nd Ed. Chicago. 2007.

39.

Brown

Chambers

Heady

Heasman

. Evaluation of small area estimation methods – an application to unemployement estimats from the UK LFS. Proc. Stat. Canada Symp. 2001; 1-10. doi: 10.1201/9780203166314.ch1.

40.

Islam

Chandra

. Small area estimation combining data from two surveys. Commun Stat Comput. 2019; 36: 1-22. doi: 10.1080/03610918.2019.1588308.

Improving survey based estimates of malnutrition using small area estimation

Abstract

Keywords

1. Introduction

2. Methods and materials

2.1 Sampling design

2.1.1 Data source and study variables

2.2 Fay Herriot small area estimation

3.1 Model diagnostics

3.3 Interpretations for synthetic regression coefficients

Table 4 Improvements of estimates under Fay Herriot model

4. Discussions

5. Conclusion

Footnotes

Acknowledgments

References

Table 4
Improvements of estimates under Fay Herriot model