Analysis of unit level models for small area estimation in crop statistics assisted with satellite auxiliary information

Abstract

Crop statistics for a small area, such as the community development block, are an increasingly important topic in agricultural statistics. Under normality assumptions, the classic Empirical Best Linear Unbiased Prediction (EBLUP) technique is effective for predicting small area means, however the Small Area Estimation (SAE) model can be heavily affected by the incidence of outliers or deviations from the expected distribution. The purpose of this study was to estimate variance, predict block-level wheat crop yield in the Hisar and Sirsa district of Haryana by classical SAE method and a robust random-effect predictor using a slight generalization of Huber’s Proposal 2. In the case of Sirsa district, the results of classical and robust unit level SAE were very close, but not in the case of Hisar district. This could be due to the influential observation found in the Hisar data set. More accurate EBLUP wheat yield estimates are obtained when the Huber-type M-estimation method is initialized by the least square regression estimator.

Keywords

EBLUP Huber-type M-estimation Maximum likelihood Mean squared prediction error NDVI Small area estimation

1. Introduction

The bedrock of any agricultural statistics system is crop area and crop production. In India, crop area data is based on entire listing whereas the crop yield is estimated based on sample survey approach. The General Crop Estimation Surveys (GCES) program’s crop-cutting experiments (CCEs) serve as the foundation for the yield rate estimates. A CCE involves selecting a field that is currently producing a particular crop at random, finding and designating a plot of a specific size and shape there according to preset instructions. The harvested grain is weighed after drying to decide the salable form of produce because it has moisture on the day it is harvested. For yield estimates, more than 800,000 samples of CCEs are collected annually. Estimates of crop production or yield by CCE are recorded at district level and are collated at state and country level. Although this method can provide reliable estimates at district level, the method is time-consuming and time ands resource consuming-consuming, as a result of which the CCE procedure is not followed in many places by an enumerator which leads to incorrect data reporting. In order to enhance the accuracy of the data obtained under the GCES, the ’Improvement of Crop Statistics (ICS)’ program has been initiated. Under this program, quality assurance of GCES’ field service is conducted by overseeing about 30,000 CCEs. The conclusions of the ICS study indicate that CCEs are often not carried out precisely causing the shortage of expected accuracy in the data. Given the limitations of resources and infrastructure, the GCES sample size needs to be significantly reduced so that the enumerator ’s work volume is minimized and, therefore, better supervision of the CCE ’s operation can lead to improved data quality. But reduction in sample size will directly impact the estimator’s standard error. The reduced sample size is more alarming to produce estimates at lower administrative levels, because estimators may be unreliable based on sample data. In addition, (Tikkiwal & Tikkiwal, 2000) and (Tikkiwal & Ghiya, 2004) reported that CCE direct estimates are almost accurate at both national and state levels but are not valid at lower levels. However, the computation of crop production statistics for a small area, such as the Community Development Block, has received increasing interest in India in recent times. This is partly due to the fact that regional planning is mostly carried out on a local or regional basis and to the allocation of central funds.

Ministry of Agriculture and Farmers Welfare, Government of India (GOI) emphasizes the use of advanced technology to answer the issue of the accuracy of the CCEs as well as their validity and speed of operation. This will ensure fair assessment and prompt payment of farmers’ claims. Since the advent of a range of satellites with high resolution images orbiting the Earth, there has been a major increase in satellite imagery products. Several scientific studies are designed to translate satellite information into reasonable crop yield estimates at the individual pixel and segment levels. Assessment of vegetation and monitoring of changes in vegetation patterns are concerned with the management and monitoring of natural resources, such as crop vigor analysis (Thiam & Eastman, 1999). Healthy crops are characterized by a strong absorption of red energy and a strong reflection of near-infrared energy (Taylor et al., 1997). A strong contrast between absorption and dispersion of red and near-infrared bands can be combined into different quantitative indices. Such statistical quantitative combinations shall be referred to as vegetation indices. Since the late 1980s, numerous studies have been conducted on crop growth analysis using the Normalized Difference Vegetation Index (NDVI) to support precision farming. Several studies have shown that there is a significant correlation between yield and NDVI value. The small sample size problem can be easily addressed using the available auxiliary information to enhance inadequate sample information from the study area. Small Area Estimation (SAE) is the principle behind it. The SAE practices are usually based on model-based practices, see for example (Pfeffermann, 2002; Rao, 2003). The concept is to utilize statistical models to connect the concerned variable to the ancillary data. e.g., administrative data, remote sensing data, etc. for smaller areas to describe area specific model estimators (Chandra, 2013; Jaslam et al., 2020). Such small-area models are divided into two specific groups.

(i)
Area level random effect models are applied when the ancillary data is accessible only at area level. They connect the small area direct survey estimates to area specific covariates (Fay & Herriot, 1979) and,
(ii)
Unit level random effect models, initially explained by (Battese et al., 1988) Such models link the unit values of the study variable to the unit-specific covariates.

In many SAE concerns, the unit level small area model is rarely used primarily due to unavailability unit level data. This study aims to estimate precise wheat crop yield estimate at block level using remote sensing and administrative data under unit level random effect model along with its mean squared prediction errors.
2. Materials and methods

It is not possible to increase the number of CCE’s. The only solution here is to complement and reinforce the data by using relevant and influential auxiliary information to boost the reliability of small area estimates. in particular, the study variable in this study is yield of wheat crop. Yield data from the CCEs of wheat crop under the ICS scheme were collected for the Hisar and Sirsa Districts of Haryana, India during the period 2018–2019. We are focused on estimating the wheat crop’s average yield at the block level. The district of Hisar, situated between 29.12 ${}^{\circ}$ N, 75.81 ${}^{\circ}$ E in the northern part of India, covers approximately 3,983 km ${}^{2}$ . The district’s net crop area is 4040 km ${}^{2}$ with a crop intensity of 178.2 per cent; of this agricultural land, rabi wheat is a key economic crop, with a total area of up to 2240 km ${}^{2}$ . At present Hisar district consists of 9 sub-districts (blocks) viz., Adampur, Agroha, Barwala, Hisar-1, Hisar-2, Hansi-1, Hansi-2, Narnaund and Uklana. The Sirsa district situated at 29.12 ${}^{\circ}$ N and 75.81 ${}^{\circ}$ E covers approximately 4,277 km ${}^{2}$ . The net sown area of the district is 3940 Km ${}^{2}$ and wheat is cultivated around 2430 Km ${}^{2}$ . Sirsa, Rania, Ellenabad, Dabwali, Nathusari Chopta, Baragudha, and Odhan are the seven sub-districts (blocks) that constitute the Sirsa district. A total of 275 villages from 9 blocks in the Hisar district were considered for CCEs, with sample sizes ranging from 12 to 46 villages per block, and 319 villages from 7 blocks in the Sirsa district were considered for CCEs, with sample sizes ranging from 33 to 55 villages per block. The area-specific sample sizes for blocks in six districts in Haryana’s western zone range from 12 to 119 CCEs, with an average of 45. The unit specific auxiliary variables used for this study are 1. Area under wheat crop in wheat-growing district villages which are collected from administrative records maintained by the respective agricultural development agencies. 2. NDVI wheat crop data for each village is computed from MOD13Q1 v006 satellite image at its maturing stage (9 March 2019). The values of the NDVI range from $+$ 1.0 to $-$ 1.0. Bare rock, sand, or snow areas usually have very low NDVI values (e.g., 0.1 or less). Sparse vegetation, such as shrubs and grasslands or senescing, can lead to medium NDVI values (approximately 0.2 to 0.5). High NDVI values (approximately 0.6 to 0.9) refer to dense vegetation, such as those found in temperate and tropical forests or peak-growing crops.

In this study, we consider unit-level SAE models that link the study variable to the unit-specific covariates followed by the calculation of the maximum likelihood (ML), the Huber-type M estimates and the Prediction of random effects and area-specific means. The theoretical approach of the SAE has been outlined in the following section.

The best linear unbiased prediction (BLUP) estimator integrates a straightforward estimator with an indirect estimator so that the direct estimator gets more weight if its MSE within domain $i$ is minimal, and likewise (Rao, 1999). A mixed effects model with a random intercept on the domain level is used to model the variable of interest i.e.

$\displaystyle y_{ij}=X_{ij}^{T}\beta+v_{i}+\varepsilon_{ij},i=1,\ldots,M,j=1,% \ldots,N_{i}.$ (1)

The random effects $v_{i}$ and residuals $\varepsilon_{ij}$ are supposed to be identically and independently distributed with

$\displaystyle v\sim N\left({0,\sigma_{v}^{2}}\right),\varepsilon\sim N\left({0% ,\sigma_{\varepsilon}^{2}}\right)$ (2)

respect to one another. After accounting for the fixed part of the model, the variance of the random effects ( $\sigma_{v}^{2}$ ) can be considered as a measure of variability between the domains $i$ Presuming the population size $N_{i}$ in domain $i$ is large, the BLUP estimator of the population mean within domain $i$ is given by the above assumption Eq. (2)

$\displaystyle\bar{Y}_{B,i}=\bar{X}^{T}_{i}\beta+v_{i},$ (3)

that is, because the residual likelihood is near to zero, The SRE and the manifestation of the random effect are added to yield the BLUP estimate of the population mean within area $i$ The BLUP estimate is a small area model-based estimate as it employs random effects to random variance that is not reported for by covariate. The BLUP estimator Eq. (3) can alternatively be expressed as

$\displaystyle\bar{Y}_{B,i}=\bar{X}^{T}_{i}\beta+\gamma_{i}\left(\frac{1}{N_{i}% }\sum_{j}\varepsilon_{ij}\right),$ (4)

where $\bar{X}^{T}_{i}\beta$ is the SRE based on the mixed-effects model and $\varepsilon_{ij}$ are the residuals of the mixed-effects model at the sampled population elements (Henderson, 1953). Aside from the SRE’s fitted data, the variable $\gamma_{i}$ , distinguishes the BLUP estimator Eq. (4) from the GREG estimator

$\displaystyle\gamma_{i}=\frac{\sigma_{v}^{2}}{\sigma_{v}^{2}+\frac{\sigma_{% \varepsilon}^{2}}{N_{i}}}$

that controls the bias correction factor’s weight based on model performance and the number of collected population elements with the domain $i$ . The bias correction factor is given more weight if the model has a low residual variance and the number of sampled elements inside domain $i$ expands.

Marginally, the basic unit-level model is defined as

$\displaystyle y_{i}\sim N\left({X_{i}\beta,\Omega_{i}\left(\theta\right)}% \right),i=1,\ldots,M$ (5)

with $\theta=\left(\sigma_{v}^{2},\sigma_{\varepsilon}^{2}\right)^{T}$ and $\Omega_{i}\left(\theta\right)=\sigma_{\varepsilon}^{2}I_{i}+\sigma_{v}^{2}1_{i% }1_{i}^{T}$ , where $I_{i}$ is the ( $N_{i}\times N_{i}$ ) identity matrix. $1_{i}$ the $N_{i}$ -vector of ones, and $y_{i}$ a $N_{i}$ -vector.

The maximum likelihood (ML) method can be used to estimate the parameters $\left({\beta,\sigma_{v}^{2},\sigma_{\varepsilon}^{2}}\right)$ of the mixed effects model from the data. Substituting the variance components $\left({\sigma_{v}^{2},\sigma_{\varepsilon}^{2}}\right)$ in the BLUP estimator Eq. (1) with its estimated complements $\left(\widehat{\sigma}^{2}_{v},\widehat{\sigma}^{2}_{\varepsilon}\right)$ returns the empirical best linear unbiased prediction (EBLUP) estimator ( $\bar{Y}_{E,i}$ ).

2.1 Maximum likelihood (ML) estimator of the model parameters

\beta

and

\theta=\left({\sigma_{v}^{2},\sigma_{\varepsilon}^{2}}\right)^{T}

For the model Eq. (5), the non-constant part of the log-likelihood, $l\left({\tau,X,y}\right)$ is given by

$\displaystyle-2l\left({\tau,X,y}\right)=\sum_{i=1}^{M}\log\left|{\Omega_{i}}% \right|+\sum_{i=1}^{M}\left({y_{i}-X_{i}\beta}\right)^{T}\Omega_{i}^{-1}\left(% {y_{i}-X_{i}\beta}\right)$ (6)

where $\tau=\left({\beta^{T},\theta^{T}}\right)^{T}$ , wherever no contamination is supposed to be present, estimates of the parameter vector $\tau$ shall be obtained by means of ML. The ML estimator $\widehat{\tau}^{\prime}$ is defined as $l\left({\tau,X,y}\right)=\textit{sup}_{\tau\in\Theta}l\left({\tau,X,y}\right)$ , provided $\tau$ is an interior point of $\Theta$ . The covariance matrix $\Omega_{i}$ ( $i=1,\ldots,g$ ) can be expressed as follows

$\displaystyle\Omega_{i}=\sigma_{\varepsilon}^{2}I_{i}+\sigma_{v}^{2}J_{i}=U% \left({I_{i}+\textit{dJ}_{i}}\right)=\textit{UV}_{i}$

where $U=\sigma_{\varepsilon}^{2}$ , and $d=\frac{\sigma_{v}^{2}}{\sigma_{\varepsilon}^{2}}\equiv\frac{A}{U}$ , The main benefit of using the Hartley-Rao parametrization is that we get a distinct equation for $U$ . Thus, on rewriting the log-likelihood, we obtain

$\displaystyle-2l\left({\tau,X,y}\right)=\sum_{i=1}^{M}\text{log}\left|{V_{i}}% \right|+\sum_{i=1}^{J}N_{i}\text{logU}+\frac{1}{U}\sum_{i=1}^{M}\left({y_{i}-X% _{i}\beta}\right)^{T}V_{i}^{-1}\left({y_{i}-X_{i}\beta}\right).$ (7)

As long as the maximum on the boundary is not reached the solution to the system of Fisher-score equations are the maximum likelihood estimates, $\widehat{\beta}$ , $\widehat{U}$ and $\widehat{d}$ respectively,

$\displaystyle-2\left(\frac{1}{U}\right)\sum_{i=1}^{M}X_{i}^{T}V_{i}^{-1}\left(% y_{i}-X_{i}\beta\right)=0$ (8)

$\displaystyle\sum_{i=1}^{M}\frac{N_{i}}{U}-\left({\frac{1}{U^{2}}}\right)\sum_% {i=1}^{M}\left({y_{i}-X_{i}\beta}\right)^{T}V_{i}^{-1}\left({y_{i}-X_{i}\beta}% \right)=0$ (9)

$\displaystyle\sum_{i=1}^{M}1_{i}^{T}V_{i}^{-1}1_{i}-\left({\frac{1}{U}}\right)% 1_{i}^{T}V_{i}^{-1}\left({y_{i}-X_{i}\beta}\right)\left({y_{i}-X_{i}\beta}% \right)^{T}V_{i}^{-1}1_{i}$ (10)

The MLE of $U$ is given by

$\displaystyle\widehat{U}=\widehat{\sigma}^{2}_{\varepsilon}=\frac{1}{N}\sum^{M% }_{i=1}(y_{i}-X_{i}\beta)^{T}V_{i}^{-1}(y_{i}-X_{i}\beta)$

where $N=\sum_{i=1}^{M}N_{i}$ .

Equations (8) and (10), on the other hand, do not include a closed-form representation and must be answered by means of some iterative numerical optimization techniques.

Taylor-linearization can be used to estimate the domain-specific MSE of the EBLUP estimator. The MSE estimator consists of four components (Prasad & Rao, 1990).

$\displaystyle\widehat{\textit{MSE}}\left(\bar{Y}_{E,i}\right)=C_{1,i}+C_{2,i}+% C_{3,i}+C_{3,i}^{*},$

where $C_{1,i}=\widehat{\gamma}(\frac{\widehat{\sigma}^{2}_{\varepsilon}}{N_{i}})$ .

The second component accounts for uncertainty introduced by the estimation of the coefficient vector $\beta$

$\displaystyle C_{2,i}=(\bar{X}_{i}-\widehat{\gamma}_{i}\bar{x}_{i})^{T}\left(% \sum^{M}_{i=1}X^{T}_{i}u_{i}X_{i}\right)^{-1}(\bar{X}_{i}-\widehat{\gamma}_{i}% \bar{x}_{i}),$

with the population means $\bar{X}_{i}=(1,\bar{X}_{i1},\ldots,\bar{X}_{ip})^{T}$ and sample means $\bar{x}_{i}=(1,\bar{x}_{i1},\ldots,\bar{x}_{ip})^{T}$ of the covariates within domain $i$ . The $N_{i}\times\left({P+1}\right)$ matrix $X$ is the design matrix of the mixed-effects model Eq. (1) within domain $i$ and the $N_{i}\times N_{i}$ matrix

$\displaystyle u_{i}=\frac{1}{\widehat{\sigma}^{2}_{\varepsilon}}\left({\bm{I}}% _{N_{i}}-\frac{\widehat{\gamma}_{i}}{N_{i}}{\bf 1}_{N_{i}}{\bf 1}^{T}{N_{i}}\right)$

results from the identity matrix ${\bm{I}}_{N_{i}}=\textit{diag}_{N_{i}}$ Eq. (1) and the column vector ${\bf 1}_{N_{i}}=\left({1_{1},\ldots,1_{N_{i}}}\right)$ .

The third component compensates for the variation introduced by the estimate of the variance components $\left({\sigma_{v}^{2},\sigma_{\varepsilon}^{2}}\right)$ and is given by

$\displaystyle C_{3,i}=N_{i}^{-2}\left(\widehat{\sigma}^{2}_{v}+\frac{\widehat{% \sigma}^{2}_{\varepsilon}}{N_{i}}\right)^{-1}C_{\textit{cov}},$

with $C_{\textit{cov}}=\widehat{\sigma}^{4}_{v}\bar{V}_{vv}+\widehat{\sigma}^{4}_{% \varepsilon}\bar{V}_{\varepsilon\varepsilon}+2\widehat{\sigma}^{2}_{v}\widehat% {\sigma}^{2}_{\varepsilon}\bar{V}_{v\varepsilon}$ ,

where $\bar{V}_{vv}$ , $\bar{V}_{\varepsilon\varepsilon}$ and $\bar{V}_{v\varepsilon}$ are the asymptotic variances and covariances of $\widehat{\sigma}^{2}_{v}$ and $\widehat{\sigma}^{2}_{\varepsilon}$ .

The fourth component is an area-specific version of the third component and given by

$\displaystyle C^{\ast}_{3,i}=N_{i}^{-2}\left(\widehat{\sigma}^{2}_{v}+\frac{% \widehat{\sigma}^{2}_{\varepsilon}}{N_{1}}\right)^{-4}C_{\textit{cov}}(\bar{y}% _{i}-\widehat{\bar{y}}_{i})^{2},$

where $\bar{y}_{i}$ and $\widehat{\bar{y}}_{i}$ are the means of the samples and estimates of model Eq. (1) within domain $i$ , respectively.

2.2 Robust EBLUP method (REBLUP)

Though the standard EBLUP approach is effective for estimating small area means under normalcy norms, it is significantly affected by the existence of outliers in the data. Furthermore, contrasting regression or locationscale model the mixed linear models lack a nice invariance structure. Notably, this indicates that in the presence of contamination, the parameters cannot be estimated consistently – there is an unavoidable asymptotic bias. Any approach may estimate the parameter at the core model with an unknown bias when contamination is involved. The bias in ML estimators can be very significant, making these estimators exceedingly inefficient (Welsh & Richard, 1997) Many researchers recommended robust evaluation procedures in mixed level modeling going from rather algorithmic approaches (Rocke, 1983, 1991) over robustification of mixed-model equations by Henderson (Fellner, 1986) to substituting the Fisher scores by robust Frechet-differentiable statistical functions (Bednarski & Zontek, 1996). (Copt & Victoria-Feser, 2006) have proposed an S-estimator and provide software for balanced data. The M-estimator-type methods, based on either a robustified likelihood (RML 1) or bounded influence estimating equations (RML 2) have received considerable attention in the literature (Richardson & Welsh, 1995).

2.3 Robust M-Estimator EBLUP

When there is contamination, the ML estimates can be significantly influenced. It is therefore sensible to substitute the system of Fisher-score Eqs (8) and (10) by estimating equations (EE) whose influence functions are bounded – i.e., so-called bounded-influence estimating equations (BIEE). BIEE for $\beta$ , $\sigma_{\varepsilon}^{2}$ , $\frac{\sigma_{v}^{2}}{\sigma_{\varepsilon}^{2}}$ is computed using the methodology described by (Schoch, 2011).

2.4 BIEE for $\beta$

For the fixed effects, $\beta$ Eq. (8) shall be replaced by the BIEE

$\displaystyle\sum_{i=1}^{M}\left({\frac{1}{\sqrt{U}}}\right)X_{i}^{T}V_{i}^{-% \frac{1}{2}}\psi_{k}\left[{\left({\frac{1}{\sqrt{U}}}\right)V_{i}^{-\frac{1}{2% }}\left({y_{i}-x_{i}\widehat{\beta}^{R}}\right)}\right]R=0,$ (11)

where $\psi_{k}$ denotes the Huber $\psi$ function.

The Solution of the above equation shall be achieved by an iterative re-weighted least square (IRWLS) procedure and that’s the mainstay for calculating $M$ -estimates of regression (Maronna et al., 2006). Denote by $\left\{\beta\right\}^{\left(s\right)}$ the estimate of $\beta$ produced by the algorithm on the $s^{\text{th}}$ iteration ( $s=$ 1, 2, $\ldots$ ). A revised estimate is obtained

$\displaystyle\left\{\beta\right\}^{\left({s+1}\right)}=\left({\mathop{\sum}% \limits_{i=1}^{m}\left({\left\{{W_{i}}\right\}^{\left(s\right)}\left\{{V_{i}^{% -\frac{1}{2}}}\right\}^{s}X_{i}}\right)^{T}\left\{{W_{i}}\right\}^{\left(s% \right)}\left\{{V_{i}^{-\frac{1}{2}}}\right\}^{s}X_{i}}\right)^{-1}\times\left% ({\sum_{i=1}^{m}\left({\left\{{W_{i}}\right\}^{\left(s\right)}\left\{{V_{i}^{-% \frac{1}{2}}}\right\}^{s}X_{i}}\right)^{T}\left\{{W_{i}}\right\}^{\left(s% \right)}\left\{{V_{i}^{-\frac{1}{2}}}\right\}^{s}y_{i}}\right)$

where $W_{i}=\textit{diag}\left({w_{i}}\right)$ , with $w_{i}=\left({w_{i1,\ldots,w_{iN_{i}}}}\right)^{T}$ , $w_{ij}=\left[{\frac{\psi_{k}\left({r_{ij}}\right)}{r_{ij}}}\right]^{\frac{1}{2}}$ , and $r_{i}=\left({\frac{1}{\sqrt{U}}}\right)V_{i}^{-\frac{1}{2}}\left({y_{i}-x_{i}% \beta}\right)$ .

Put $\tilde{X}_{i}=(\frac{1}{\sqrt{U}})W_{i}V_{i}^{-\frac{1}{2}}X_{i}$ and $\tilde{y}_{i}=(\frac{1}{\sqrt{U}})W_{i}V_{i}^{-\frac{1}{2}}y_{i}$ then,

$\displaystyle\left\{\beta\right\}^{\left({s+1}\right)}=\left(\mathop{\sum}% \limits_{i=1}^{M}\left\{\tilde{X}^{T}_{i}\right\}^{(s)}\{\tilde{X}_{i}\}^{(s)}% \right)^{-1}\left(\mathop{\sum}\limits_{i=1}^{M}\left\{\tilde{X}^{T}_{i}\right% \}^{(s)}\{\tilde{y}_{i}\}^{(s)}\right)$

Now, the above equation is a classical least squares problem, we iteratively obtain revised estimates of $\beta$ by regression practice. Hence, we shall decompose by means of the “skinny” QR-factorization (Gentle, 2007). Writing $\tilde{X}=\textit{QR}$ , where $R=\left({R_{1}^{T},O^{T}}\right)^{T}$ , with $R_{1}$ a ( $p x p$ ) upper triangular matrix. $Q$ is an ( $N\times N$ ) orthogonal matrix which can be apportioned likewise: $Q=(Q_{1},Q_{2}$ ), where $Q_{1}$ is an ( $N x p$ ) matrix whose columns are orthonormal. Thus $\tilde{X}=Q_{1}R_{1}$ Therefore, the influenced linear system $\tilde{X}\beta=\tilde{y}$ can be stated as $R_{1}\beta=Q_{1}^{T}\tilde{y}$ . Since $R_{1}$ is an ( $p\times p$ ) triangular matrix, the system is easy to solve: $\beta=R_{1}^{-1}Q_{1}^{T}\tilde{y}$ . The IRWLS algorithm now consists of solving $\left\{\beta\right\}^{\left({s+1}\right)}=\left\{{R_{1}^{-1}}\right\}^{\left(s% \right)}\left\{{Q_{1}^{T}}\right\}^{\left(s\right)}\left\{\tilde{y}^{(s)}% \right\}^{\left(s\right)}$ , in an iterative way. The ultimate value is considered as the estimate $\widehat{\beta}_{R}$

2.5 BIEE for

\sigma_{\varepsilon}^{2}\equiv U

A bounded-influence EE for $\sigma_{\varepsilon}^{2}$ that replaces the bnon-robust Fisher score (2.9) is obtained in the spirit of Huber’s proposal 2 (Huber, 1964) – as the solution $U^{R}$ to the bounded influence estimating equation

$\displaystyle\left[\sum_{i=1}^{M}N_{i}\right]^{-1}\frac{1}{\delta_{k}}\sum_{i=% 1}^{M}\psi_{k}\left(\frac{V_{i}^{-\frac{1}{2}}r_{i}}{\sqrt{\widehat{U}^{R}}}% \right)^{T},\psi_{k}\left(\frac{V_{i}^{-\frac{1}{2}}r_{i}}{\sqrt{\widehat{U}^{% R}}}\right)^{T}=1$

where $\delta_{k}=E\left[{\psi_{k}\left(v\right)^{2}}\right]$ is a consistency correction term that ensures consistency of the estimate at the Gaussian core model. An updated estimate, $\left\{U\right\}^{s+1}$ , is given by

$\displaystyle\left\{U\right\}^{s+1}=\frac{1}{N\delta_{k}}\sum_{i=1}^{M}\left\{% {W_{i}}\right\}^{s}\left\{{r_{i}^{T}}\right\}^{s}\left\{{V_{i}^{-1}}\right\}^{% s}\left\{{r_{i}}\right\}^{s}$

2.6 BIEE for $\frac{\sigma_{v}^{2}}{\sigma_{\varepsilon}^{2}}\equiv d$

For the estimator of $d$ , we also have to replace the non-robust Fisher-score function Eq. (10) by a bounded-influence estimating equation. In contrast to $U$ , the BIEE of $d$ has no closed form solution. If we put $v_{i}\left(d\right)=\left({\frac{1}{\sqrt{U}}}\right)V_{i}\left(d\right)^{-% \frac{1}{2}}\left({y_{i}-x_{i}\beta}\right)$ , then a robust estimate of $d$ , say $\widehat{d}^{R}$ , is obtained as the solution to

$\displaystyle\sum_{i=1}^{M}1_{i}^{T}V_{i}\left(\widehat{d}^{R}\right)^{-1}1_{i% }-1^{T}_{i}V_{i}(\widehat{d}^{R})^{-\frac{1}{2}}\psi_{k}[v_{i}(\widehat{d}^{R}% )]\psi_{k}(v_{i}(\widehat{d}^{R})]^{T}V_{i}(\widehat{d}^{R})^{-\frac{1}{2}}1_{% i}=0.$ (12)

Schoch [24] proposed the method to solve (3.3.19) by means of Brent’s root-finding algorithm (Brent, 1973). Further, (Schoch, 2012) showed that robust estimate of area-specific means can be computed much easier through a robust predictor of random effects based on the (Copt & Victoria-Feser, 2009) proposal

The Robust EBLUP of $y_{i}$ is given by

$\displaystyle\widehat{y}^{R}_{i}=\bar{X}^{T}_{i}\beta^{R}+\widehat{v}^{R}_{i},$ (13)

where

$\displaystyle\widehat{v}^{R}_{i}=k\frac{\widehat{A}^{R}}{\sqrt{\widehat{e}^{R}% }}1^{T}_{i}V_{i}^{-\frac{1}{2}}(\widehat{d}^{R})\psi_{c}\left[\frac{1}{\sqrt{% \widehat{e}^{R}}}V_{i}^{-\frac{1}{2}}(\widehat{d}^{R})[y_{i}-X_{i}\widehat{% \beta}]\right].$

Estimation of MSE of small area estimators is a challenging problem even in the case of classical EBLUP estimators. (Sinha and Rao, 2009) introduced a parametric bootstrap method to compute MSE of prediction based on the robust estimates $\widehat{y}_{i}^{R}$ and $\widehat{\beta}$

$\displaystyle\textit{MSE}\ (\widehat{y}^{R}_{i})=E[(\widehat{y}^{R}_{i}-% \widehat{y}_{i})^{2}].$ (14)

These methods are adopted in this study for variance estimation, prediction of the area-specific means and MSPE estimation.

3. Results and discussion

3.1 Unit level empirical best linear unbiased prediction estimator

To validate the expectations of the core model, the model diagnostics were conducted. The distribution of the village level residual plots shows that residuals from the model are randomly distributed, and the line of fit does not significantly differ from the line $y=$ 0. Even though few outlier points found most of the model residuals points on the graph fall on the straight line. The Q-Q plots also prove the normality hypothesis for random area effects. Hence, the model diagnostics are satisfactory for both Hisar and Sirsa data.

The next step is to construct the model. The Hisar district model is as follows:

$\displaystyle\textit{YIELD}_{i,j}=\alpha+\beta_{1}\textit{NDVI}_{i,j}+u_{i}+e_% {i,j}$

Here, Fixed Effects: YIELD $\sim$ (Intercept) $+$ NDVI

Area-specific random effects: Blocks

No. of areas: 9 blocks

No. of observations: 275 villages

The following is the Sirsa district model:

$\displaystyle\textit{YIELD}_{i,j}=\alpha+\beta_{1}\textit{AREA}+\beta_{2}% \textit{NDVI}+u_{i}+e_{i,j}$

Here, Fixed Effects: YIELD $\sim$ (Intercept) $+$ AREA $+$ NDVI

Area-specific random effects: Blocks

No. of areas: 7 blocks

No. of observations: 327 villages

After setting up the model, the parameters of the model are estimated using the ML method. Coefficient value of NDVI was found to be highly significant ( $P<$ 0.01) for both districts. The residual value of the random effect part is 600.19 for Hisar district and 56584 observed for Sirsa district. After estimating the parameters of the Gaussian core model, we considered predicting random effects using the EBLUP method and computing area-specific standard error and percentage coefficient of variation (CV %) However, the inclusion of outliers in the data can have a significant impact on the standard EBLUP approach, even though it is effective at estimating the small area means under normalcy assumptions. Residual vs Leverage plot for Hisar and Sirsa district are shown in Fig. 1, Several points of high residual and leverage could be found in both districts. The points that lie close to or outside of the dashed red curves are worth investigating further. In the case of Hisar District, a village observation is far beyond the Cook’s distance lines, the influential observation case Any method predicts the parameter at the model with an unknown bias in the event of contamination. The bias in ML estimators can be very great, making these estimators exceedingly inefficient (Welsh & Richardson, 1997). Hence, we have looked for robust analysis method.

3.2 Robust EBLUP method for unit level estimation

The default-mode setup of the Huber-type M-estimation methodology for SAE explained by (Schoch, 2012) is used to estimate the parameter. Further, if the default mode failed to converge. The safe-mode algorithm can be used in which initialization of the method is done by a fast-LTS regression estimator. Estimates of SAE-model for Hisar and Sirsa district are given below (Table 1).

Table 1
Estimated coefficients of SAE- model under Huber M estimation method

District	Fixed effects			Random effects model: $\sim$ 1 $\|$ Block
	Intercept	AREA	NDVI	Intercept	Residual
Hisar	3040.69	–	3845.67	1152.85	427.86
Sirsa	$-$ 387.80	0.092	6700.84	181.347	529.813

After estimating the parameters of the Gaussian core model, we considered predicting random effects with the Robust EBLUP method and computing area specific MSE (prediction) with a robust parametric bootstrap technique for both estimation procedures (ML and Huber M estimation). These are done for comparison purposes The output of the analysis is presented in Tables 2 and 3. Mean Square error value is considerably lesser when EBLUP estimation done using Huber M estimation methodology compared to maximum likelihood estimation procedure. It is also worth noting from Fig. 2 that the CV per cent of the two approaches vary notably in the case of Hisar but not in the case of Sirsa.

Table 2

Robustly predicted wheat yield (kg/ha) under maximum likelihood estimation

Districts	Blocks	Random effect	Fixed effect	Yield (kg/ha)	Std. Error	CV (%)
Hisar	Hansi 1	$-$ 143	4863	4721	93.9	1.99
	Hansi 2	$-$ 187	4474	4287	123	2.87
	Narnaund	350	5138	5488	104	1.89
	Hisar1	118	4765	4883	84.3	1.73
	Hisar2	$-$ 4.2	4751	4747	93.2	1.96
	Adampur	$-$ 163	4753	4591	116	2.52
	Barwala	51.7	5062	5114	96.2	1.88
	Uklana	204	4626	4830	153	3.17
	Agroha	$-$ 227	4784	4558	118	2.60
Sirsa	Sirsa	$-$ 28	4747	4720	106	2.25
	Rania	$-$ 70	4860	4790	101	2.11
	Ellenabad	25.6	5119	5145	99.6	1.94
	Dabwali	205	4911	5117	90.2	1.76
	Nathusari Chopta	139	4879	5018	80.7	1.61
	Baragudha	45.7	4845	4891	108	2.20
	Odhan	$-$ 318	4716	4398	104	2.38

Figure 1.

Residual vs Leverage plot for Hisar and Sirsa districts.

Table 3

Robustly predicted wheat yield (kg/ha) under Huber M estimation

Districts	Blocks	Random effect	Fixed effect	Yield (kg/ha)	Std. Error	CV (%)
Hisar	Hansi 1	$-$ 1037.7	6048	5010	68.1	1.36
	Hansi 2	$-$ 1532.5	5802	4269	95.6	2.24
	Narnaund	209.83	5221	5430	77	1.41
	Hisar1	$-$ 808.2	5985	5177	64.4	1.24
	Hisar2	$-$ 189.7	5977	5787	69.4	1.20
	Adampur	$-$ 233.9	5978	5744	87.3	1.52
	Barwala	165.1	5173	5338	71.4	1.38
	Uklana	$-$ 630.3	5898	5267	126	2.39
	Agroha	$-$ 1364.9	5998	4633	90.1	1.94
Sirsa	Sirsa	54.05	4760	4814	100	2.08
	Rania	$-$ 15.5	4867	4852	96.8	1.99
	Ellenabad	152.33	5115	5267	93.6	1.78
	Dabwali	253.9	4916	5170	82.7	1.60
	Nathusari Chopta	240.62	4887	5127	72.6	1.42
	Baragudha	122.37	4853	4975	99.2	1.99
	Odhan	$-$ 290.69	4729	4438	99.5	2.24

Figure 2.

Block wise CVs for the wheat yield estimate by ML and Huber M estimation.

Figure 3.

REBLUP estimates of the wheat yield for Hisar and Sirsa districts.

This may be due to the observed influence point in the Hisar district. Figure 3 shows the block level EBLUP estimate of wheat yield is calculated using the robust method for Hisar and Sirsa districts. As previously stated, the classical EBLUP approach is effective for valuing small area means under normality conditions. However, the existence of outliers in the data can have a significant impact on it. In the presence of outliers in the data, a robust small area estimate approach must be attempted. The Residual vs. Leverage plot (Fig. 1) revealed an influential point in the Hisar district and evaluating this data using Robust SAE technique produced better estimates (in terms of CV) than the classic unit level EBLUP technique.

4. Conclusions

The robust method of small area estimation approaches is very effective in restricting the effect of outliers on small area estimators. Now, the GOI emphasizes micro-level planning. Estimates of the EBLUP may make a significant contribution to the allocation of resources and to decision-making. Such yield estimates are also useful when classifying blocks with less crop yields to draw the attention of the planner. For the purpose of estimating micro-level yields, these techniques can be broadly applied to additional data sets from various small areas and crops.

Footnotes

Acknowledgments

This research was supported by CCS Haryana Agricultural University. We are thankful to our universities that provided us with wonderful facilities. The authors are also grateful to the Mahalanobis National Forecast Centre, New Delhi, for providing ground truth data on wheat production.

References

Battese

G. E.

, & Fuller

W. A.

(1981). Prediction of county crop areasusing survey and satellite data. Proceedings of the Section on Survey Research Methods, (pp. 500-505).

Battese

G. E.

Harter

R. M.

, & Fuller

W. A.

(1988). An error-components model for prediction of county crop areas using survey and satellite data. Journal of the American Statistical Association, 83(401), 28-36.

Bednarski

, & Zontek

(1996). Robust estimation of parameters in a mixed unbalanced model. The Annals of Statistics, 24(4), 1493-1510.

Chandra

(2013). Exploring spatial dependence in area-level random effect model for disaggregate-level crop yield estimation. Journal of Applied Statistics, 40(4), 823-842.

Copt

, & Victoria-Feser

M. P.

(2006). High-breakdown inference for mixed linear models. Journal of the American Statistical Association, 101(473), 292-300.

Fay III

R. E.

, & Herriot

R. A.

(1979). Estimates of income for small places: an application of James-Stein procedures to census data. Journal of the American Statistical Association, 74(366a), 269-277.

Fellner

W. H.

(1986). Robust estimation of variance components. Technometrics, 28(1), 51-60.

Henderson

C. R.

(1953). Estimation of variance and covariance components. Biometrics, 9(2), 226-252.

Jaslam

P. K.

, & Kumar

(2020). EBLUP estimate of crop yield at sub-district level in Hisar district, Haryana, India using MODIS/Terra data. Current Science (00113891), 119(12-12).

10.

Maronna

R. A.

Martin

R. D.

Yohai

V. J.

, & Salibián-Barrera

(2019). Robust statistics: theory and methods (with R). John Wiley & Sons.

11.

Pfeffermann

(2002). Small area estimation-new developments and directions. International Statistical Review, 70(1), 125-143.

12.

Prasad

N. N.

, & Rao

J. N.

(1990). The estimation of the mean squared error of small-area estimators. Journal of the American Statistical Association, 85(409), 163-171.

13.

Rao

J. N.

(1999). Some recent advances in model-based small area estimation. Survey Methodology, 25, 175-186.

14.

Rao

J. N.

(2003). Small Area Estimation. Hoboken, New Jersey: John Wiley and Sons Inc.

15.

Richardson

A. M.

, & Welsh

A. H.

(1995). Robust restricted maximum likelihood in mixed linear models. Biometrics, 1429-1439.

16.

Rocke

D. M.

(1983). Robust statistical analysis of interlaboratory studies. Biometrika, 70(2), 421-431.

17.

Rocke

D. M.

(1991). Robustness and balance in the mixed model. Biometrics, 303-309.

18.

Schoch

(2011). The robust basic unit-level small area model: A simple and fast Fisher-scoring algorithm for large datasets. Proceedings of the Conference on New Technologies and Techniques in Statistics (NTTS).

19.

Schoch

(2012). Robust unit-level small area estimation: A fast algorithm for large datasets. Austrian Journal of Statistics, 41(4), 243-265.

20.

Sinha

S. K.

, & Rao

J. N.

(2009). Robust small area estimation. Canadian Journal of Statistics, 37(3), 381-399.

21.

Taylor

Wood

, & Thomas

(1997). Mapping yield potential with remote sensing. Precision Agriculture, 713-720.

22.

Thiam

, & Eastman

J. R.

(1999). Vegetation indices. Guide to GIS and image processing (Vol. 2). Worcester, MA, USA: IDRISI Production: Clarke University.

23.

Tikkiwal

B. D.

, & Tikkiwal

G. C.

(2000). Small area estimation in India – Crop yield and acreage statistics. International Conference on Agricultural Statistics. Washington, D.C., USA.

24.

Tikkiwal

G. C.

, & Ghiya

(2004). A generalized class of composite estimators with application to crop acreage estimation for small domains. Statistics in Transition, 1(6), 697-711.

25.

Tikkiwal

G. C.

Rai

P. K.

, & Goyal

(2020). Simulation-Cum-Regression (SICURE) method of estimation for the small domains. International Journal of Computational and Theoretical Statistics, 7(1), 1-14.

26.

Welsh

A. H.

, & Richardson

A. M.

(1997). Approaches to the robust estimation of mixed models. In Maddala

G. S.

, & Rao

C. R.

, Handbook of Statistics: Robust Inference (Vol. 15, pp. 343-384). Elsevier.

Analysis of unit level models for small area estimation in crop statistics assisted with satellite auxiliary information

Abstract

Keywords

1. Introduction

2.3 Robust M-Estimator EBLUP

2.4 BIEE for β

2.6 BIEE for σ v 2 σ ε 2 ≡ d

3.1 Unit level empirical best linear unbiased prediction estimator

3.2 Robust EBLUP method for unit level estimation

Table 1 Estimated coefficients of SAE- model under Huber M estimation method

Footnotes

Acknowledgments

References

2.4 BIEE for $\beta$

2.6 BIEE for $\frac{\sigma_{v}^{2}}{\sigma_{\varepsilon}^{2}}\equiv d$

Table 1
Estimated coefficients of SAE- model under Huber M estimation method