Abstract
A survey is typically designed to produce reliable estimates of target variables of the population at national and regional levels. For unplanned zones with small sample sizes, reliable estimates are needed in many ways but the direct survey estimates are unreliable. The purpose of the study is to improve the direct survey estimates of the
Introduction
“Small Area Estimation” (SAE) was designed to improve the sample survey estimates of small sample size by borrowing auxiliary data from related sources [1]. In the general context, an area (domain) is regarded as “small” if the domain-specific sample size is not large enough to support direct estimates of adequate precision [1]. Sample survey estimates were a cost-effective means of obtaining information for large areas or domains. Thus, sample survey data are used to provide reliable and efficient direct estimates for large areas or domains at the regional and national levels [1]. Direct estimates are estimates of large domains based on the design-based approach for the sample survey. Data obtained from sample surveys can be used to derive reliable direct estimates for large domains. However, it produced large sampling variability since sample sizes are too small (even zero).
Indirect estimates based on a model-based approach are an alternative approach that is now widely used in small area estimation. Indirect estimates borrowed auxiliary variables from census for the small areas to strengthen the relationship between the target variables and the survey variables [1, 2]. In making estimates for small areas with an adequate level of precision, it is often necessary to use auxiliary variables to increase the effectiveness of the sample size and the precision [1]. The model-based estimators are based on explicit models that provide a link to related small areas through the census data [1]. Nowadays, SAE is commonly used in planning health, social, and other services and for allocating government funds [1]. Small area models can be area-level models that relate direct small area estimates to area-specific variables (aggregated variables) named Fay-Herriot (FH) [1, 3] model.
The performance measures of SAE have been measured by the coefficient of variation (CV %) and mean square prediction error (MSPE). The CVs (%) show the sampling variability as a percentage of estimates. Small area estimates with large CVs (%) are considered as unreliable [4]. The most common method of parameter estimators under the FH model were the methods of moments used by [3, 5], maximum likelihood (ML), and restricted maximum likelihood (REML) used by [6].
The study variables were the
According to the report in [8], globally, an estimated 144 million and 47 million children under five were stunted and wasted respectively. Most of the world’s stunted, underweight and wasted children under five were lived in Asia and Africa [8]. In Ethiopia, 38%, 10% and 24% of children under five were stunted, wasted and underweight, respectively [9].
Many researchers studied malnutrition at the regional level in the country [10, 11, 12, 13, 14, 15, 16, 17]. These studies were survey data for only planned domains at national and regional levels. However, unplanned domains of malnutrition in Ethiopia at zones (the third administration layers) need to be estimated. At zonal level, the standard direct estimation methods cannot be used due to the small sample size.
In the Ethiopian health system, the health service uses decentralization as the most influential administrative determinant [18]. The federal ministry of health decentralized the health service in parallel to the government structures (regions, zones and districts). These administrative hierarchies are the key institutions involved in health care delivery in the country [18, 19]. Among these, the zonal governments are the bridging (milestones) between the regional and the districts governments. The zonal health department is responsible for the monitoring and evaluation of health activities in the districts [18]. Therefore, estimating malnutrition at the zonal level is an invaluable advantage for the zonal governments and also for all governmental structures. The main objective of this study was improving the direct survey estimates of
The subsequent sections of this paper are organized as follows: Section 2 describes the data sources, sampling design, and presents methodologies of the Fay Herriot model, Section 3 contains the results and discussions, and finally, Section 5 presents conclusions.
Methods and materials
Sampling design
The 2016 EDHS used a sampling frame designed for the Ethiopian population and housing census which was carried out in 2007 by the Ethiopian central statistical agency (CSA). The 2016 sample survey was designed to provide reliable estimates of key indicators at the national and regional levels. Similarly, the sample survey was designed to provide estimates for urban and rural areas [9].
A two-stage-stratifying sampling technique was used for the 2016 EDHS sample to all the nine regions and two administration cities and also for urban and rural areas. The stratification produced 21 sampling strata. From each stratum samples of enumeration areas (EAs) were selected independently in two stages. Implicit stratification and proportional allocation were achieved at each of the lower administrative levels by sorting the sampling frame within each sampling stratum before sample selection, according to administrative units at different levels, and by using a probability proportional to size selection at the first stage of sampling [9].
In the first stage, 645 EAs were independently selected in each stratum with probability proportional to the EAs size. Among 645 EAs, 202 EAs were for urban and 443 EAs were for rural areas. In the second stage, an equal probability systematic sampling was used to select 28 households per cluster from the newly created household lists. The height and weight measurements were collected from children 0–59 months, women aged 15–49 years, and men aged 15–59 years [9] in all the selected households.
Data source and study variables
Ethiopia has nine regions and two administrative cities which in turn, are divided into many different zones and special zones. Of these, 95 zones, special zones, and special districts were studied as domains in this research. However, 87 zones were sampled and 8 zones were non-sampled zones in the 2016 EDHS survey.
The target variables were the
The gap between the 2007 population census and 2016 EDHS data was wide since the census was not conducted within the scheduled time in 2017. This is because of the country’s political instability and the covid-19 pandemic. However, to manage this gap, we used the 2016 census projection figures for urban and rural residence and sex auxiliary variables at all zones [21].
The auxiliary variables have been taken in two ways from the census data via variables related to children under five and parents. For instance, sex (male and female) [12] and ages (below one year, 1–2 years and 4–5 years) have been taken from children under age five [13]. On the other hand, the auxiliary variables of parents were sex (male and female), place of residence (urban and rural), age (15–24, 25–34, 35–44 and 45–49) [12], source of drinking water (improved and unimproved) [13], educational levels (non-educated, primary and secondary and above) [15, 22], literacy (literate and illiterate) [12, 23], marital status (married, never married and others), type of toilet facility (have toilet facility and doesn’t have toilet facility) [11], the number of sons died (no died, one died and two and more died) , the number of daughters died (no died, one died and two and more died), the number of families in the household (less than five, and five and more) [11, 12], and disability (disabled and not disabled) [13, 16], and employment status (government-employed, private employed, self-employed, employer, unemployed and other employment).
Fay Herriot small area estimation
Because of the lack of sample data within small areas, models are needed to link all areas through some common parameters so as to borrow strength from census data and then to improve efficiency and reliability of estimation. The FH model was first introduced by [3]. The FH model is often used to obtain efficient estimators of the area means when the sample sizes within areas are small.
In this study, the domains are zones in Ethiopia. Basic area level small area estimations of
Now the aim is to estimate the parameters of the small area mean at zone level
The Fay-Herriot model [3] is a basic area level model widely used in small area estimation to improve the direct survey estimates. In area level models, the area-specific auxiliary information comes in the aggregated values of some explanatory variables at the domains.
Let
Level 1 (Sampling model)
Level 2 (Linking model)
In this model
The popularity of the Fay-Herriot model (1) for small area estimation stems from the fact that it produces reliable small area statistics by building linking models for the direct estimators through auxiliary data and then by borrowing strength from other domains. By modeling the direct estimators, the Fay-Herriot model uses the design weights to produce design consistent small area estimators. To develop a model-based estimator of
Note that this model is a linking model that links the target quantity
where
The variance covariance matrix of
The prediction of random effects best linear unbiased predictor (BLUP) was proposed by Henderson in 1950 to find “maximum likelihood estimates” of the random effects [32]. The expression of the BLUP involves model variance components A (random effect variance), which is typically unknown in practice. It is customary to replace A with a consistent estimator,
EBLUP methods are extensively discussed in the small area estimations [20, 28, 29, 30, 31, 33]. In a survey application, the values
where
For the non-sampled zones (zones with zero sample size), we used the synthetic regression estimates
Where,
The variance component estimations used in Fay-Herriot model for small areas are studied by different scholars [1, 24, 26, 31, 35]. The common model fitting methods delivering consistent estimators for the model variance component (
A simple method-of-moments proposed by [5] to estimate model variance component of estimator
here
The other moment estimator of is based on the weighted least square residual sum of squares is Fay Herriot methods (
where
Models are typically compared based on goodness-of-fit measures such as the log-likelihood, the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). In general terms, the value of AIC for a model M is defined as
One of the main reasons for relying on small area methods is to reduce the variability of the random area predictions. The MSE estimates of EBLUP under FH model is mean square prediction error (MSPE). Estimating MSPE is one of the challenging problems in model based small area estimation [4, 27] because the most common practical problem in small area estimation is measuring the variability associated with the EBLUP. The MSPE is generally used as a measure of variability under the EBLUP estimator. Nevertheless, estimation of the MSPE of the EBLUP is of significant practical interest. We can measure the uncertainty of the EBLUP by its MSPE is defined as
The second order unbiased estimator of MPSE of EBLUP is available when
The MSPE of non-sampled zones were managed by the following formulas via the synthetic regressions
In most articles [25, 31, 37] the sampling variance
where,
where,
Comparisons of model variance estimators using information criteria (IC)
Quantile-Quantile (Q-Q) plot for FH model residuals.
There were 41 auxiliary variables selected from the 2007 census to consider in the model. However, appropriate variables were selected using forward and step wise regression analysis. Thus, age groups of 15–24 years, children aged 4–5 years, non-disabled parents, other marital status, (divorced or widowed), unemployed, illiterate, and no death of daughters in a family were chosen for
Estimating the hyper parameters of
The model parameter estimates of the
Model diagnostics
The research considered the bias diagnostics, the coefficients of variation (CV) and MSPE to validate the reliability of the small area estimates under FH model.
We applied the diagnostic measures to examine the model assumption. The normality assumption of the model diagnostics test is examined using the Q-Q normal probability test and the Kolmogorov-Smirnov test [39].
As we have seen in Fig. 1, the dots are very close to the line. If the residuals are normally distributed, the dots will be plotted along the line. The distribution of residuals in the Fay-Herriot model looks like normal. This is confirmed by the Kolmogorov-Smirnov normality test of residuals, with the
Residuals vs Model based estimates under Fay Herriot model.
In the plots of GVF vs direct variance estimates.
Regression coefficient estimates of auxiliary variables under Fay Herriot model
NB: In the coefficients of
Summary of direct and model based estimates of malnutrition using FH model
The estimates of the generalized variance function (GVF) and the sampling variance
Interpretations for synthetic regression coefficients
The best linear unbiased estimators
Parents in the 14–15 age range, children in the 4–5 age range, divorced or widowed marital status, children in 2–3 years of age, government employed and parents having the death of one daughter only were significant variables for wasting target variable. For the underweight target variable, less than five family size, divorced or widowed marital status, and government employed was in significant at 0.05 significant levels.
The summarized direct and EBLUP estimates of
Improvements of estimates under Fay Herriot model
Improvements of estimates under Fay Herriot model
Zones (sorted by increasing CVs (%) of direct estimators).
The performance measure of SAE for
The results of efficiency gain in CV (%) due to the use of EBLUP over direct survey estimates were reported in Table 4. The maximum improvement of EBLUP estimates in the efficiency gain were recorded 61.30% for stunting at Bahir Dar City, 70.58% for wasting at Hawasa City and 72.1% for underweight at Shaka zone, respectively. On the other hand, the minimum efficiency was gained in Dire Dawa, recorded 2.76% for the
The
These non-sampled zones were didn’t estimate using survey data because of the zero sample size. However, we estimated them by borrowing auxiliary variables under FH model, and also the MSPE and CV (%) were estimated for non-sampled zones (zero sample size).
Discussions
This study provided the Ethiopian zonal level estimates of the
Unlike previous studies by [10, 17], both of them were survey studies with planned domains (large areas), in the current study small area estimation with unplanned domains (small sample size) of
Unlike the previous studies by [20, 31], that produced negative estimates of the model variance component of
The main aim of this study was increasing the efficiency of direct survey estimates by linking auxiliary variables in the sampling model. Thus, the findings showed that the
The CVs in the model-based estimates are less than the CVs in the direct estimates almost in all the zones. It is apparent that the MSE of the direct estimates are larger than the MSPE of the EBLUP estimates under the FH model [4, 26]. This means that the small area estimates under FH model are more efficient and precise than the direct survey estimates. Thus, EBLUP estimates are more efficient estimates than the direct survey estimates due to additional auxiliary variables under the FH model [27].
Conclusion
In this study, we applied the SAE techniques under the FH model to estimate the zonal level statistics of the
Footnotes
Acknowledgments
We would like to thank Bahir Dar University and Debre Tabor University.
