Abstract
The 2
Keywords
Introduction
Food security exists when all people, always, have physical and economic access to sufficient, safe, and nutritious food that meets their dietary needs and food preferences for an active and healthy life (FAO, 2010). Inversely food insecurity is the situation when people do not have adequate physical, social or economic access to food. United Nations’ (UN) Sustainable Development Goal-2 is to eliminate all sorts of hunger, malnutrition issue and achieve food security. Government of India has also set it as one of the highest priorities to meet UN goal by 2030 aiming at ‘Leaving No One Behind’. The major data source to produce the estimates of food insecurity in India is the Household Consumer Expenditure Survey (HCES) of National Sample Survey Office (NSSO), Ministry of Statistics and Program Implementation. The HCES is designed to generate representative estimates for various indicators at macro level i.e., state and national level for both rural and urban sectors and combinedly. However, such macro level estimates can not reflect the heterogeneities that are available at local, micro or regional levels In spite of high importance, the estimates of food insecurity indicators are not available for lower administrative units e.g. district or lower level in the country. Lack of sufficient and representative food insecurity measures at the local or disaggregated level often put constraints for the policy planners, government and public agencies in designing targeted interventions and policy developments related to beating the inequality and disparity in food insecurity. Hence, for aiming inclusive developments with zero food insecurity it is necessary to obtain statistical summaries of relevant paramter of interest for smaller domains or small areas. The small areas are created by cross classifying demographic and geographic variables such as small geographic areas (e.g. districts) or small demographic groups (e.g. age-sex groups, social groups) or a cross classification of both. The sample sizes for such small domains in the survey data are usually very small, negligible or even zero. The SAE methodology provides a viable and cost effective solution to obtain precise estimates for small areas, see for example, Chandra et al. (2011) and Rao and Molina (2015). The SAE methods basically invokes the idea of borrowing strength from data of other areas, other time periods or both to faciliate the estimation process.
The SAE methods are generally based on model-based methods. The idea is to use statistical models to link the variable of interest with auxiliary information, e.g. Census and Administrative data, for the small areas to define model-based estimators for these areas. Based on the level of auxiliary information available, the models used in SAE are categorized as area level or unit level. Area-level modelling is typically used when unit-level data are unavailable, or, as is often the case, where model covariates (e.g. census variables) are only available in aggregate form. The Fay-Herriot model is a widely used area level model in SAE (Fay & Herriot, 1979). This model is an area level linear mixed model (Chandra, 2013; Chandra et al., 2015; Chandra & Chandra, 2015 & 2020). Standard SAE methods based on linear mixed models for continuous data can produce inefficient and sometime invalid estimates when the variable of interest is binary. If the variable of interest is binary and the target of inference is a small area proportion, then the generalized linear mixed model with logit link function, also referred as the logistic linear mixed model (LLMM) is generally used. An empirical plug-in predictor (EPP) under a LLMM is commonly used for the estimation of small area proportions, see for example, Chandra et al. (2012), Rao and Molina (2015) and references therein An alternative to EPP is the empirical best predictor (EBP, Jiang, 2003). This predictor does not have a closed form and can only be computed via numerical approximation. This is generally not straightforward, however, and so national statistical agencies favour computation of an approximation like the EPP. In this context, when only area level data are available, an area level version of a LLMM is used for SAE, see for example, Johnson et al. (2010), Chandra et al. (2011), Chandra et al. (2017), Chandra et al. (2018) and references therein Unlike the Fay-Herriot model, this approach implicitly assumes simple random sampling with replacement within each area and ignores the survey weights. Unfortunately, this has the potential to seriously bias the estimates if the small area samples are seriously unbalanced with respect to key population charcteristics, and consequently use of the survey weights appears to be inevitable for if one wishes to generate representative small area estimates. Recently, Chandra et al. (2019) introduce an approach to model the survey weighted estimates as binomial proportions, with an “effective sample size” chosen to match the binomial variance to the sampling variance of the estimates. This article considers Chandra et al. (2019) approach to model survey weighted small area proportions under a LLMM and attempts to produce the district level estimates of proportion of food insecurity (also refers as food insecurity prevalence or incidence of food insecurity) for rural areas of Uttar Pradesh. Note that if unit level data is available, the three suitable SAE methods based on unit-level model are the ELL (Elbers et al., 2003), the empirical Bayes (Best) prediction method (Molina & Rao, 2010) and the M-Quantile method (Tzavids et al., 2008). Das and Haslett (2019) described the performances of these methods in terms of their underlying model assumptions.
Rest of the paper is organized as follows. Next Section describes the data from the 2011–12 HCES of the NSSO and the 2011 Population Census that will be used to estimate the district-wise proportion of household food insecurity for rural areas of the State of Uttar Pradesh in India. The target variable of interest, the auxiliary variables and model specifications for SAE analysis are illustrated in Section 3. Section 4 presents a brief overview of SAE methodology. The empirical results and a map showing district-level inequalities in the distribution of food insecure households in Uttar Pradesh along with various diagnostic measures are reported in Section 5. Finally, Section 6 provides concluding remarks.
Data description and study area
The state of Uttar Pradesh inhabits around 16.16 percent of India’s population and is the most populous state in the country. It covers 243,290 square km, equal to 6.88% of the total area of the country. The analysis in the paper has been focused to rural areas because about 78% of the population of the State live in rural areas according to 2011 Population Census. The sources of the study and auxiliary variables used in SAE application are 2011–12 HCES of the NSSO in India and the 2011 Population Census respectively. Data obtained from these sources has been used to estimate the district level incidences of food insecurity in rural Uttar Pradesh. In 2011–12 HCES the stratified multi-stage random sampling design was used with districts being the strata, villages as first stage units and households as second stage units. In the 2011–12 HCES, a total of 5916 households were surveyed from the 71 districts of Uttar Pradesh. The district sample sizes ranged from 32 to 128 with an average of 83. It is evident that these district level sample sizes are relatively small, with average sampling fraction of 0.0002 (Table 1). Consequently, it is difficult to generate reliable district level direct survey estimates with associated standard errors from this survey. We address this small sample size problem in the 2011–12 HCES data by implementing the SAE methodology and using auxiliary information from the 2011 Population Census to strengthen the limited sample data from the districts to produce district level estimates.
Summary of sample size, sample count (i.e., number of food insecure households) and sampling fraction in 2011 HCES data
Summary of sample size, sample count (i.e., number of food insecure households) and sampling fraction in 2011 HCES data
The target variable
The auxiliary variables are taken from the 2011 Population Census of India. These auxiliary variables are only available as counts at district level, and so SAE methods based on area level small area models must be employed to derive the small area estimates. There are approximately 20–25 such auxiliary variables that are available for use in SAE analysis. We therefore carried out an exploratory data analysis to choose few auxiliary variables to determine appropriate covariates for SAE modelling. We also employed Principal Component Analysis (PCA) to derive composite scores for some selected groups of variables. In particular, we did PCA separately on two groups of variables, all measured at district level and identified as S1 and S2 below. The first group (S1) consisted of the proportions of main worker by gender, proportions of main cultivator by gender and proportions of main agricultural labourer by gender. The first principal component (S11) for this first group explained 44% of the variability in the S1 group, while adding the second component (S12) increased explained variability to 69%. The second group (S2) consisted of proportions of marginal cultivator by gender and proportions of marginal agriculture labourers by gender. The first principal component (S21) for this second group explained 52% of the variability in the S2 group, while adding the second component (S22) increased explained variability to 90%.
We fitted a generalised linear model using direct survey estimates of proportions of food insecure households as the response variable and the four principal component scores S11, S12, S21, S22 and some selected auxiliary variables from the 2011 Population Census as potential covariates. The final selected model included five covariates namely proportional scheduled caste population (SC), literacy rate (Lit), proportion of working population (WP), index for main worker population (S11) and index for marginal worker population (S21), with Akaike Information Criterion (AIC) value of 636.34. For this model, null deviance is 430.88 on 70 degrees of freedom and including the five independent variables has decreased the deviance to 294.72 on 65 degrees of freedom, a significant reduction in deviance. The residual deviance has reduced by 136.16 with a loss of five degrees of freedom. We use Hosmer Lemeshow goodness of fit test to examine the fitted model (i.e., model fits depend on the difference between the model and the observed data). The
Small area estimation methodology
Let us assume that a finite population
If we ignore the sampling design, the sample count
with
An estimate of the corresponding proportion in area
The theoretical development is given in Chandr et al. (2019). Let define by
with
The model Eq. (1) is based on unweighted sample counts, and hence it assumes that sampling within areas is non-informative given the values of the contextual variables and the random area effects. The small area predictor based on Eq. (2) therefore ignores the complex survey design used in NSSO data. The sampling design used in 2011 HCES is informative. Using the effective sample size rather the actual sample size allows for the survey weights under complex sampling. Furthermore, the precision of an estimate from a complex sample can be higher than for a simple random sample, because of the better use of population data through a representative sample drawn using a suitable sampling design. Following Chandra et al. (2019), we model the survey weighted probability estimate for an area as a binomial proportion, with an “effective sample size” that equates the resulting binomial variance to the actual sampling variance of the survey weighted direct estimate for the area. Hence, in our analysis we replaced the “actual sample size” and the “actual sample count” with the “effective sample size” and the “effective sample count” respectively.
Firstly, the sampling design in HCES sample data were examined whether informative or not. The sampling design is called informative if the distribution in the sample is different from the distribution in the population. In this case, sampling design used in survey data collected must be incorporated in making the valid analytical inference about the population. Such sampling design is also referred as non-ignorable designs. For this purpose, the effective sample sizes and the effective sample counts for the 2011 HCES data were computed to illustrate whether the 2011 HCES data is informative sample or not. Refer Chandra et al. (2019) for details about calculation of the effective sample sizes and the effective sample counts. Figure 1 plots the effective sample sizes against the observed sample sizes. The effective sample counts and observed sample counts are shown in Fig. 2. It is evident from Fig. 1 that the effective sample size is smaller than the observed sample sizes in almost all the districts. Similarly, in Fig. 2 the effective sample counts is lower than the observed sample counts. This indicates that the sampling design results in a loss in information, when compared with simple random sampling, in all the districts. Figure 3 presents the district-wise survey weighted and unweighted direct estimates of proportion of household food insecurity. It can be seen from Fig. 3 that the unweighted direct estimates underestimate the proportion of food insecure households, in majority of the districts. These examples are evident that the sampling design is informative and therefore must be accounted in SAE. Hence, the SAE analysis reported in paper uses effective sample sizes and effective sample counts in replace of observed sample sizes and observed sample counts respectively to incorporate the sampling design of HCES data.
Effective sample size versus observed sample size in 2011 HCES data.
Effective sample count versus observed sample count in 2011 HCES data.
District-wise survey weighted direct estimates versus unweighted direct estimates of proportion of food insecure households.
The estimates of food insecurity prevalence (i.e., incidence of food insecurity) at district level for rural areas in the state of Uttar Pradesh is generated from the EPP method described in Sections 4 using 5 significant covariates described in Section 3. We now describe some important diagnostics to examine the assumptions of the underlying models, and to validate the empirical performances of the EPP method. Generally, two types of diagnostics measures are advised in SAE applications. These are (i) the model diagnostics, and (ii) the diagnostics for the small area estimates. See Chandra et al. (2011). The model diagnostics are applied to verify model assumptions. The other diagnostics are used to validate reliability of the model-based small area estimates of incidence of food insecurity generated by the EPP method. In model Eq. (1) the random district specific effects are assumed to have a normal distribution with mean zero and fixed variance. If the model assumptions are satisfied, then the district level residuals are expected to be randomly distributed around zero. Histogram and normal probability (q-q) plot can be used to examine the normality assumption. Figure 4 shows the histogram (left plot), the normal probability (q-q) plot (centre plot) and the distribution of the district-level residuals (right plot). We also use the Shapiro-Wilk test (implemented using the shapiro.test() function in R) to examine the normality of the district random effects. The Shapiro-Wilk test with
Histograms (left plot), normal q-q plots (centre plot) and distributions of the district-level residuals (right plot).
We consider three commonly used diagnostics measures for assessing the validity and the reliability of the model-based small area estimates: the bias diagnostic, the percent coefficient of variation (CV) diagnostic and the 95 percent confidence interval diagnostic. The first diagnostics assesses the validity and last two assess the reliability or improved precision of the model based small area estimates. In addition, we implemented a calibration diagnostic where the model-based estimates are aggregated to higher level and compared with direct survey estimates at this level. See for example, Chandra et al. (2011). Note that here direct estimates are defined as the survey weighted direct estimates.
The bias diagnostic is based on following idea. The direct estimates are unbiased estimates of the population values of interest (i.e., true values), their regression on the true values should be linear and correspond to the identity line. If model-based small area estimates are close to these true values the regression of the direct estimates on these model-based estimates should be similar. We therefore plot direct estimates (y-axis) vs. model-based small area estimates (x-axis) and we looked for divergence of the fitted least squares regression line from the line of equality. In Fig. 5 we provide a bias diagnostic plot, defined by plotting direct survey estimates (
We now illustrate the second set of diagnostics to assess the extent to which the EPP estimates improve in precision compared to the direct estimates. The percent coefficient of variation (CV) is the estimated sampling standard error as a percentage of the estimate. District level estimates with large CVs are considered unreliable. Table 2 provides a summary of CVs of the direct and the EPP estimates Fig. 6 presents the district-wise values of CV for the direct and EPP methods. In one of the 71 districts, smaller CV (2.16%) of direct estimate is due to extreme value of proportion. Sample size and sample count for this district are 64 and 58 respectively while and direct estimate of proportion of food insecurity is 0.967. Note that the effective sample size and effective sample count for these districts are 25 and 24 respectively. In Table 2, we therefore presented the summary based on 70 districts (excluding one district extreme value of proportion). In further discussion we refer summary based on 70 districts only. The CVs of the direct estimates are larger than the EPP estimates. Table 2 and Fig. 6 show that direct survey estimates of incidence food insecurity are unstable with CVs that vary from 5.53 to 45.52 % with average of 14.59 %. In contrast, the CV values of EPP range from 5.12 to 24.29% with average of 10.65%. The relative performance of the EPP as compared to the direct survey estimates improve with decreasing district specific observed sample sizes. That is, the estimates computed from the EPP are more reliable and provide a better indication of food insecurity incidence in Uttar Pradesh. The district-wise plot of the 95 percent confidence intervals (CIs) generated by direct and EPP methods are displayed in Fig. 7. The width of CIs are given Fig. 8. Figures 7 and 8 show that the 95% CIs for the direct estimates are wider than the 95% CIs for the EPP.
Summary of area distributions of percentage coefficients of variation (CV, %) for the direct and EPP methods applied to HCES data
Bias diagnostic plot with 
We inspect the aggregation or calibration property of the model-based district-level estimates generated by EPP at higher (e.g. State or Region) level. Let
Aggregated level estimates of incidence of food insecurity generated by direct and EPP method in different regions in Uttar Pradesh
District-wise percentage coefficient of variation (CV, %) for the direct (dotted line, 
District-wise 95 percentage nominal confidence interval (95% CI) for the direct (solid line) and EPP (thin line) methods. Direct (dotted point) and EPP estimates (dash point) for the food insecurity prevalence in Uttar Pradesh are shown in the 95% CI.
District-wise width of 95 percentage nominal confidence interval for the direct (solid line) and EPP (thin line) methods. Direct estimate (dotted point) and EPP estimate (dash point) are plotted in the 95% CI.
Map showing location of state of Uttar Pradesh in India.
Direct and model-based (EPP) estimates along with 95% confidence interval (95% CI) and percentage coefficient of variation (CV) of the incidence of food insecurity by District in rural areas of Uttar Pradesh
Nr: Nagar.
EPP estimates showing the spatial distribution of incidence of food insecurity by District in Uttar Pradesh.
Figure 9 shows the location of state of Uttar Pradesh in India. In Fig. 10 we present a map showing the estimated proportion of food insecurity in different districts in rural areas of Uttar Pradesh produced by the EPP method. This map provides the district-wise degree of inequality with respect to distribution of extent of food insecurity in rural areas of Uttar Pradesh. This map is supplemented by the results set out in Table 4, where we report the district-wise estimates along with CVs and 95% confidence intervals generated by direct and EPP. The results indicate an east-west divide in the distribution of food insecurity. For example, in the western part of Uttar Pradesh there are many districts with low level of incidence of food insecurity. Similarly, in the eastern part and in the Bundelkhand region (north-east) we see districts with high incidence of food insecurity. This should prove useful for policy planners and administrators aiming to take effective financial and administrative decisions.
Concluding remarks
In this paper we first summarise a plug-in empirical predictor (EPP) for small area proportions under an area level logistic linear mixed model. Then the EPP method in the 2011–12 HCES data collected by the NSSO of India has been applied to estimate the incidence of food insecurity and to produce a spatial map of the different districts of rural areas of the state of Uttar Pradesh in India. The auxiliary variables used in this analysis were taken from the 2011 India’s Population Census. The effective sample sizes in place of the observed sample sizes were used to account for the sampling design information of the 2011–12 HCES. The use of survey information through effective sample size leads to more representative and realistic estimates of the incidence of food insecurity. The empirical results were also evaluated through several diagnostic measures and revealed that the model-based SAE method defined by EPP provide significant gains in efficiency for generating district level estimates of proportions of food insecurity. Spatial map produced from the estimates generated by the EPP method provides an evidence of inequality in distribution of incidence food insecurity across different districts of the State of Uttar Pradesh in India.
Availability of reliable district level estimates can definitely be useful for various Organizations and Ministries in Government of India as well as International organizations for their policy research and strategic planning. These estimates will also be useful for budget allocation and to target welfare interventions by identifying the districts/regions with high food insecurity incidence. This application clearly demonstrates the advantage of using SAE technique to cope up the small sample size problem in producing the cost effective and reliable disaggregate level estimates and confidence intervals from existing survey data by combining auxiliary information from different published sources with direct survey estimates.
Footnotes
Acknowledgments
This paper is a tribute to
