Model assisted estimation of sensitive proportions from randomised responses by unequal probability sampling

Abstract

In order to estimate the proportion of people bearing a sensitive characteristic in a community, a sample is selected with unequal probabilities and randomized response data are obtained. Supposing data on a related variable are at hand in addition a model-design based estimation procedure modifying Chaudhuri and Saha’s (2004) is developed and studied. Four well-known Randomized Response (RR) methods are illustrated and a one-parameter logistic regression model is tried. Empirical Bayes estimation is examined and simulated results are presented to study the resulting efficacy.

Keywords

Empirical Bayes one parameter logistic qualitative characteristic stigmatizing features Warner’s and other RR models AMS Subject Classification: 62 DO5

1. Introduction

Warner (1965) gave the pioneering randomized response (RR) technique (RRT) in a judicious attempt to gather trustworthy data on sensitive matters. Boruch (1972) provided an amended “Forced Response” technique and Kuk (1990) gave another generalization. A basic deviation needed to cover the case when a characteristic $A$ and its complement $A^{c}$ are both potentially sensitive (presumably not envisaged in Warner’s original model) is “Simmons’s” RRT called URL approach as described by Greenberg et al. (1969) and Hortivz et al. (1967). Since his work is not narrated by him elsewhere no separate reference to Simmons is observed. Fox and Tracy (1986) and Chaudhuri and Mukerjee (1988) in consolidated manners covering many procedures demonstrated their uses. Chaudhuri (2011) has given a general treatment of each permitting sample selection by general schemes liberating it from the erstwhile compulsion of simple random sampling with replacement (SRSWR) alone. Simmons’s procedure as above involves one sensitive characteristic and another “unrelated” innocuous characteristic. So, for simplicity this is called a URL-model. Visualizing the existence of another variable shedding some informative light on the prime stigmatizing qualitative variable Chaudhuri and Saha (2004) developed alternative estimation procedures by way of bringing out improvements upon Kuk’s (1990) and Simmons’s URL procedures utilizing data on an auxiliary variable using one-parameter logistic regression modeling following Maddala (1983) and Van der Heijden and Van Gils (1996). Chaudhuri and Saha (2004) noted the failure of this logistic regression approach to cover Warner’s (1965) and Boruch’s (1972) Forced Response RRT’s. The present work addresses this deficiency. We throughout consider unequal probability sampling and consequently Chaudhuri’s (2011) as well as Chaudhuri and Christofides’s (2013) version of each classical RRT get appropriately amended. We find it convenient to follow Fay and Herriot (1979) and Prasad and Rao (1990) to obtain empirical Bayes estimators for the logistic regression model parameter. Finally a model-design based estimation procedure is applied pointing out how to measure the resulting accuracy in estimation. Details are given in Section 2 below. The Section 3 presents a simulation-based numerical comparison followed by concluding remarks.

In summary our innovations are: The classical RRT’s mostly used SRSWR’s (simple random sampling taken with replacement) obtaining from each RR an unbiased estimator for the population proportion and hence employing the sample mean for the final estimator. But we use each RR to unbiasedly estimate the true value for the respondent and taking an unequal probability sample are able to employ general linear unbiased estimators for the required proportion. Also we are able to extend Chaudhuri and Saha’s (2004) one parameter logistic regression modeling to cover Warner’s and Boruch’s RR and empirical Bayes estimation procedure we have introduced in RR’s.

2. Four classical RRT’s and concerned estimation

2.1 Classical RRT’s and estimation

2.1.1 Warner’s (1965) RRT

A sampled person labelled i chosen from a population $U=(1,2,\ldots,N)$ of labelled persons is approached with a box containing a large number of identical cards with a proportion $p_{W}(0<p_{W}\neq\frac{1}{{2}}<1)$ of them marked $A$ and the rest marked $A^{c}$ . On request, he/she draws randomly one card from the box to respond

$\displaystyle I_{iW}=1\text{ if card type matches his/her characteristic }A% \text{ or }A^{C}$ $\displaystyle=0,\text{otherwise}.$

The person’s true value is $y_{i}$ which is either 1 or 0 according as $i$ bears $A$ or its complement $A^{C}.$ The RR-based expectation and variance operators generically as $E_{R},V_{R}$ give

$\displaystyle E_{R}(I_{iW})=p_{W}y_{i}+(1-p_{W})(1-y_{i}),y_{i}=1(0)\text{ if % }i\text{ bears }A(A^{c})$ $\displaystyle V_{R}(I_{iW})=p_{W}(1-p_{W})\text{ since }I_{iW}=1/0\text{ and }% y_{i}=1/0.$

Then, $r_{iW}=\frac{I_{iW}-(1-p_{W})}{2p_{W}-1}\text{ has }E_{R}(r_{iW})=y_{i}\text{ % and }V_{iW}=V_{R}(r_{iW})=\frac{p_{W}(1-p_{W})}{(2p_{W}-1)^{2}},$ a known number.

2.1.2 Boruch’s (1972) forced response RRT

When approached as in Warner’s RRT the person labelled $i$ is given a pack of a proportion $p_{1}$ of cards marked 1, a proportion $p_{2}$ of cards marked 0 and the remaining $(1-p_{1}-p_{2}),(0<p_{1},p_{2}<1,p_{1}\neq p_{2},p_{1}+p_{2}<1)$ marked “true”. His/her response as instructed will truthfully be

$\displaystyle I_{iF}=1\text{ if card type is marked 1 or ``True'' and the % characteristic is }A$ $\displaystyle=0,\text{if the card is marked 0 or/and the characteristic is }A^% {C}$

Then,

$\displaystyle E_{R}(I_{iF})=p_{1}+y_{i}(1-p_{1}-p_{2})=\Pr{\rm ob}(I_{iF}=1)$ $\displaystyle\Pr{\rm ob}(I_{iF}=0)=p_{2}+(1-y_{i})(1-p_{1}-p_{2})$

So,

$\displaystyle r_{iF}=\frac{I_{iF}-p_{1}}{(1-p_{1}-p_{2})}\text{ has }E_{R}(r_{% iF})=y_{i}$ $\displaystyle\text{ and }V_{iF}=V_{R}(r_{iF})=\frac{p_{1}(1-p_{1})+y_{i}(1-p_{% 1}-p_{2})(p_{2}-p_{1})}{(1-p_{1}-p_{2})^{2}}$ $\displaystyle =\frac{p_{1}(1-p_{1})}{(1-p_{1}-p_{2})^{2}}\text{ if }% y_{i}=0$ $\displaystyle =\frac{p_{2}(1-p_{2})}{(1-p_{1}-p_{2})^{2}}\text{ if }% y_{i}=1.$

2.1.3 Kuk’s (1990) RRT

A box I with a proportion $p_{1K}(0<p_{1K}<1)$ of “red” and the remaining “blue” cards and a second box II with a proportion $p_{2K}(0<p_{2K}<1)$ of “red” and the remaining “blue” cards are presented to a sampled person labelled $i,p_{1K}\neq p_{2K}$ . He/she if bearing $A$ is asked to report the numbers of “red’ cards drawn in k independent random draws from box I or from box II if bearing $A^{C}.$

The expected “number of red cards drawn $f_{iK}$ ” is, by binomial distribution,

$E_{R}(f_{iK})=k[p_{1K}y_{i}+p_{2K}(1-y_{i})]$

and the variance is $V_{R}(f_{iK})=k[p_{1K}(1-p_{1K})y_{i}+p_{2K}(1-p_{2K})(1-y_{i})]$ .

Letting

$\displaystyle r_{iK}=\frac{\frac{f_{iK}}{k}-p_{2k}}{(p_{1k}-p_{2k})},E_{R}(r_{% iK})=y_{i}\text{ and }V_{iK}=V_{R}(r_{iK})=a_{i}y_{i}+b_{i}$ $\displaystyle\text{with }a_{i}=\frac{1-p_{1k}-p_{2k}}{k(p_{1k}-p_{2k})^{2}},b_% {i}=\frac{p_{2k}(1-p_{2k})}{k(p_{1k}-p_{2k})^{2}}$

2.1.4 Simmons’s URL RRT

Let

$\displaystyle t_{i}=1(0)\text{ if ith person bears }B(B^{c});$

$B$ is an innocuous characteristic unrelated to $A$ . Two boxes I and II respectively containing $A$ -marked and $B$ -marked cards in proportions $p_{1U}\text{ and }(1-p_{1U})$ in box-I and $p_{2U}\text{ and }(1-p_{2U})$ in box-II, $p_{1U}\neq p_{2U}$ are presented to each sampled person labeled $i$ . On request he/she independently draws one card randomly from each box to give RR truthfully as

$\displaystyle I_{iU}=1\text{ if card type matches the trait }$ $\displaystyle=0,\text{ else, using box I;}$

Similarly, $J_{iU}=1/0,$ using box II.

Then, $r_{iU}^{\prime}=\frac{(1-p_{2U})I_{iU}-(1-p_{1U})J_{iU}}{(p_{1U}-p_{2U})}\text% { has }E_{R}(r_{iU}^{\prime})=y_{i}.$

This exercise is independently repeated once again to produce RR’s as $I_{iU}^{\prime}\text{ and }J_{iU}^{\prime}$ .

Then, $r_{iU}^{\prime\prime}=\frac{(1-p_{2U})I_{iU}^{\prime}-(1-p_{1U})J_{iU}^{\prime% }}{(p_{1U}-p_{2U})}\text{ has }E_{R}(r_{iU}^{\prime\prime})=y_{i}.$

Then, $r_{iU}=\frac{1}{2}(r_{iU}^{\prime}+r_{iU}^{\prime\prime})\text{ has }E_{R}(r_{% iU})=y_{i}$ and $V_{iU}=V_{R}(r_{iU})$ has an unbiased estimator $v_{iU}=\frac{1}{4}(r_{iU}^{\prime}-r_{iU}^{\prime\prime})^{2}$ . From now we shall write generically $r_{i},V_{i}$ simply omitting the subscripts.

2.1.5 Estimation

Supposing a sample $s$ from $U$ is selected with probability $p(s)$ according to a design $p$ and choosing constants $b_{si},c_{si},c_{sij}$ each free of elements of $\underaccent{\tilde{}}{Y}=(y_{1},\ldots,y_{i},\ldots,y_{N}),\underaccent{% \tilde{}}{R}=(r_{1},\ldots,r_{i},\ldots,r_{N})$ such that

$\sum\limits_{s\ni i}{p(s)}b_{si}=1,∼{}∼{}C_{i}=\sum\limits_{s\ni i}{p(s)b_{si}% ^{2}-1},∼{}∼{}C_{ij}=\sum\limits_{s\ni i,j}{p(s)}C_{sij}-1,$

it is well-known, vide Chaudhuri (2011) that $e=\sum\nolimits_{i\in s}{r_{i}b_{si}}$ is an estimator for $Y=\sum\nolimits_{i=1}^{N}{y_{i}}$ with the following properties where $E_{R}(r_{i})=y_{i},i=1,2,\ldots,N.$ ,

Let $E_{P},V_{P}$ denote the design based operators for expectation, variance and let $E=E_{P}E_{R}=E_{R}E_{P}$ and $V=E_{P}V_{R}+V_{P}E_{R}=E_{R}V_{P}+V_{R}E_{P}.$ Then it is well known, vide [3] that

$\displaystyle E(e)=Y,V(e)=\sum\limits_{i}{y_{i}^{2}C_{i}}+\sum\limits_{i}{\sum% {y_{i}y_{j}C_{ij}}}+\sum\limits_{i}{V_{i}(1+C_{i})}.\text{ The expressions }$ $\displaystyle v_{1}=\sum\limits_{i\in s}{r_{i}^{2}c_{si}+}\sum\limits_{i\neq j% \in s}{\sum{r_{i}r_{j}c_{sij}}+}\sum\limits_{i\in s}{w_{i}b_{si}}$ $\displaystyle v_{2}=\sum\limits_{i\in s}{r_{i}^{2}c_{si}+}\sum\limits_{i\neq j% \in s}{\sum{r_{i}r_{j}c_{sij}}+}\sum\limits_{i\in s}{w_{i}(b_{si}^{2}-}c_{si})$

are two estimators for $V(e)$ satisfying $E(v_{1})=V(e)=E(v_{2})$ ; here $w_{i}$ is $V_{i}$ if $V_{i}$ is known or is $v_{i}$ otherwise such that $E_{R}(v_{i})=V_{i}=V_{R}(r_{i})\forall i.$

Citing these classical results from the literature let us suppose that the values $x_{i}$ closely related to $y_{i},i\in U$ are available.

2.2 Model based approach: One-parameter logistic regression modeling

Chaudhuri and Saha (2004) postulated for $\prod=\frac{1}{N}\sum\nolimits_{1}^{N}{y{}_{i}},$ noting that $0<\prod<1,$ a one-parameter logistic regression model $\log(\frac{\prod}{1-\prod})=\psi(\underaccent{\tilde{}}{x}),\underaccent{% \tilde{}}{x}=(x_{1},\ldots,x_{i},\ldots,x_{N})$ and for an estimator $\hat{{\prod}}$ for $\prod$ , the model $\log(\frac{\hat{{\prod}}}{1-\hat{{\prod}}})=\log(\frac{\prod}{1-\prod})+\in$ , applicable provided $0<\hat{{\prod}}<1,$ postulating a suitable probability distribution for the error term $\in.$ They pointed out the inapplicability of this approach for improved estimation of $\prod$ to (a) Warner’s RRT and (b) Forced Response RRT of Boruch (1972) because the requisite conditions on $\prod$ , $\hat{{\prod}}$ (generically) do not hold. So, we proceed as follows.

2.2.1 Warner’s RRT

Let

$\displaystyle\theta_{iW}=(2p_{W}-1)y_{i}+p_{W}(1-y_{i})$ $\displaystyle\hat{{\theta}}_{iW}=(2p_{W}-1)r_{iW}+p_{W}(1-r_{iW})\quad\text{ % yielding }r_{iW}=\frac{\hat{{\theta}}_{iW}-p_{W}}{p_{W}-1}$ $\displaystyle\left.{\hat{{\theta}}_{iW}}\right|_{I_{iW}=1}=\left.{r_{iW}(2p_{W% }-1)+p_{W}(1-r_{iW})}\right|_{I_{iW}=1}$ $\displaystyle\mspace{85.0mu }=\left.{r_{iW}(p_{W}-1)+p_{W}}\right|_{I_{iW}=1}=% \left.{\frac{I_{iW}-(1-p_{W})}{2p_{W}-1}(p_{W}-1)+p_{W}}\right|_{I_{iW}=1}$ $\displaystyle\mspace{308.0mu }=\frac{1-(1-p_{W})}{2p_{W}-1}(p_{W}-1)+p_{W}$

Then,

$\displaystyle\left.{\hat{{\theta}}_{iW}}\right|_{{}_{I_{iW}=1}}=\frac{3p_{W}^{% 2}-2p_{W}}{2p_{W}-1}>0\text{ if }p_{W}>\frac{2}{3}$ (1) $\displaystyle<1\text{ if }3p_{W}^{2}-4p_{W}+1<0$ (2)

Similarly substituting $I_{iW}=0$ in the expression of $\hat{{\theta}}_{iW}$ we may write

$\displaystyle\left.{\hat{{\theta}}_{iW}}\right|_{I_{iW}=0}=\frac{3p_{W}^{2}-3p% _{W}+1}{2p_{W}-1}>0\text{ if }p_{W}(1-p_{W})>\frac{1}{3}$ (3) $\displaystyle<1\text{ if }3p_{W}^{2}-5p_{W}+2<0$ (4)

Also,

$V(\hat{{\theta}}_{iW})=(1-p_{W})^{2}V_{iW}>0.$

On choosing $p_{W}$ subject to Eqs (1)–(4) it is ensured that $0<\theta_{iW}<1$ and $0<\hat{{\theta}}_{iW}<1\forall i$ .

2.2.2 Forced response RRT

$\displaystyle\theta_{iF}=(1-p_{1}-p_{2})y_{i}+(1-p_{1})(1-y_{i})$ $\displaystyle\hat{{\theta}}_{iF}=(1-p_{1}-p_{2})r_{iF}+(1-p_{1})(1-r_{iF})$

Then, substituting $I_{iF}=1$ in (ii) we may write

$\displaystyle\left.{\hat{{\theta}}_{iF}}\right|_{I_{iF}=1}=\frac{(1-p_{1})^{2}% }{1-p_{1}-p_{2}}>0$ $\displaystyle\text{and }<1\text{ if }p_{1}(1-p_{1})>p_{2}$ (5)

Similarly,

$\displaystyle\left.{\hat{{\theta}}_{iF}}\right|_{I_{iF}=0}=\frac{(1-p_{1})(1-p% _{1}-p_{2})+p_{1}p_{2}}{(1-p_{1}-p_{2})}>0$ $\displaystyle\text{and }<1\text{ if }(1-p_{1})^{2}-p_{2}(1-p_{1})+p_{1}p_{2}<(% 1-p_{1}-p_{2})$ (6)

With appropriate $p_{1},p_{2}$ both $\theta_{iF}\text{ and }\hat{{\theta}}_{iF}$ take values in the open interval (0, 1).

Also, $V(\hat{{\theta}}_{iF})=p_{2}^{2}V_{iF}\forall i$ .

Though Kuk’s RRT and URL of Simmons did not create any problem to Chaudhuri and Saha (2004) for use of one-parameter logistic regression modeling let us still note.

2.2.3 Kuk’s (1990) RRT

Here

$\displaystyle\theta_{iK}=(p_{1k}-p_{2K})y_{i}+p_{2K}$ $\displaystyle\hat{{\theta}}_{iK}=(p_{1k}-p_{2K})r_{iK}+p_{2K}$

both belong to (0, 1) for $k>0$ . Now,

$\displaystyle V(\hat{{\theta}}_{iK})=(p_{1K}-p_{2K})^{2}V_{iK}=(p_{1K}-p_{2K})% ^{2}(a_{i}+b_{i})\text{ if }y_{i}=1$ $\displaystyle\mspace{214.0mu }=(p_{1K}-p_{2K})^{2}b_{i}\text{ if }y_{i}=0$ $\displaystyle\text{ with }a_{i}=\frac{(1-p_{1K}-p_{2K})}{k(p_{1K}-p_{2K})^{2}}% \text{ and }b_{i}=\frac{p_{2K}(1-p_{2K})}{k(p_{1K}-p_{2K})^{2}};$

2.2.4 URL model of simmons

Let

$\displaystyle\hat{{\theta}}_{iU}=\frac{p_{1U}-p_{2U}}{2-(p_{1U}+p_{2U})}(r_{iU% }+\frac{1-p_{1U}}{p_{1U}-p_{2U}})$ $\displaystyle\text{and }\theta_{iU}=\left.{\hat{{\theta}}_{iU}}\right|_{r_{iU}% =y_{i}}$

Both $\hat{{\theta}}_{iU},\theta_{iU}$ are points in (0, 1). From now on we shall write $\hat{{\theta}}_{i},\theta_{i}$ generically omitting additional subscripts.

For simplicity, let us write

$\displaystyle L_{i}=\log it(\theta_{i})=\log(\frac{\theta_{i}}{1-\theta_{i}}),$ $\displaystyle\hat{{L}}_{i}=\log it(\hat{{\theta}}_{i})=L_{i}+\in_{i},$ $\displaystyle\in_{i}\sim N(0,\hat{{V}}_{iL}),\text{``$\sim$'' means distriuted% indepently as.}$ $\displaystyle\text{ Here }V_{iL}=V[\log it(\hat{{\theta}}{}_{i})]=V[\log(\frac% {\hat{{\theta}}{}_{i}}{1-\hat{{\theta}}_{i}})]=\frac{V(\hat{{\theta}}{}_{i})}{% \left|{\theta_{i}(1-\theta_{i})}\right|}$ $\displaystyle\text{ and it may be estimated by }\hat{{V}}_{{iL}}=\frac{\hat{{V% }}(\hat{{\theta}}{}_{i})}{\left|{\hat{{\theta}}_{i}(1-\hat{{\theta}}_{i})}% \right|}.$

2.3 Empirical Bayes estimation

Let, further, the model satisfy

$L_{i}\sim N(x_{i}\beta,\psi).$

here $\beta$ is an unknown constant and $\psi$ is another unknown constant.

Now $\beta$ and $\psi$ will be estimated iteratively following Empirical best linear unbiased predictor approach with Fay-Herriot Model (1979) as described by Prasad and Rao (1979) in Small Area estimation. The procedure briefly is described as follows.

Letting

$\hat{{\beta}}=\frac{\sum\limits_{i\in s}{\hat{{L}}_{{i}}}x_{i}/(\psi+\hat{{V}}% _{iL})}{\sum\limits_{i\in s}1/(\psi+\hat{{V}}_{iL})}$

and noting

$\displaystyle\sum{\frac{[\hat{{L}}_{{i}}-\hat{{\beta}}x_{i}]^{{2}}}{\psi+\hat{% {V}}_{iL}}}$ (7)

is distributed as a chi-square variable with degrees of freedom equal to ( $n-1$ ) supposing $n$ is the sample-size, it is easy to estimate by iteration $\psi$ and $\beta$ by method of moments aided by iteration (Newton-Raphson) to solve the Eq. (7) $=$ ( $n-1$ ).

Let these estimates be $\hat{{\psi}}$ and $\hat{{\beta}}.$ Then, $L_{{i}}$ may be estimated by the empirical Bayes estimate

$\displaystyle\hat{{L}}_{{\textit{iEB}}}=(\frac{\hat{{\psi}}}{\hat{{\psi}}+\hat% {{V}}_{iL}})\hat{{L}}_{i}+(\frac{\hat{{V}}}{\hat{{\psi}}+\hat{{V}}_{iL}})\hat{% {\beta}}x_{i}$ $\displaystyle=\lambda_{{i}},\text{ say }$

Then, $\theta_{i}$ is estimated by

$\hat{{\theta}}_{i(\textit{EB})}=\frac{e^{\lambda_{i}}}{1+e^{\lambda_{i}}}.$

We follow Prasad and Rao (1990) to estimate the mean square error (MSE) of $\hat{{L}}_{{\textit{iEB}}}$ in the following way:

Let

$\displaystyle g_{1i}=(1-\hat{{L}}_{\textit{iEB}})\hat{{V}}_{{\textit{iL}}}$ $\displaystyle g_{2i}=[\hat{{L}}_{\textit{iEB}}]^{{2}}\frac{x_{i}^{2}}{\sum% \limits_{1}^{n}{(\frac{x_{i}^{2}}{\hat{{\psi}}+\hat{{V}}_{\textit{iL}}})}}$ $\displaystyle g_{3i}=\frac{(\hat{{V}}_{\textit{iL}})^{2}}{(\hat{{\psi}}+\hat{{% V}}{}_{\textit{iLt}})^{3}}\frac{2}{\sum\limits_{1}^{n}{(\frac{1}{\hat{{\psi}}+% \hat{{V}}{}_{\textit{iL}}})}}$

Then the MSE of $\hat{{L}}{}_{\textit{iEB}}=\lambda_{{i}}$ is estimated as $m_{{i}}=g_{1i}+g_{2i}+2g_{3i}$ .

Then, MSE of $\hat{{\theta}}_{i(\textit{EB})}$ is estimated on noting the well-known formula using Taylor Series

$\left.V[f(\hat{{t}}(x)]\approx\left[\frac{\partial f(\hat{{t}}(x))}{\partial{x% }}\right]^{2}\right|_{\hat{{t}}(x)=t(x)}V(\hat{{t}}(x)).$

If $t(x)$ is estimated by $\hat{{t}}(x),$ then $V[f(\hat{{t}}(x)]$ is estimated by $[\frac{\partial f(\hat{{t}}(x))}{\partial{x}}]^{2}\hat{{V}}(\hat{{t}}(x))$ .

Then the Empirical Best Linear Unbiased Predictor for $y_{i}$ in Warner’s (1965) model is the following:

$\displaystyle r_{\textit{iEB}}=\frac{(p_{W}-\hat{{\theta}}_{i({\textit{EB})}})% }{1-p_{W}}\text{ has }E_{R}(r_{\textit{iEB}})=y_{i}\text{ and }V_{\textit{iEB}% }=V_{R}(r_{\textit{iEB}})=\frac{M_{i}}{(1-p_{W})^{2}},$

where $M_{i}$ is the true MSE of the estimate of $\lambda_{i}$ .

Let us note from Chaudhuri (2011) that $e_{\textit{EB}}=\sum\nolimits_{i\in s}{r_{\textit{iEB}}b_{si}}$ is an estimator for $Y=\sum\nolimits_{i=1}^{N}{y_{i}}$ with the following properties.

Let $E_{P},V_{P}$ denote the design based operator for expectation, variance and $E_{M},V_{M}$ the model based operators for expectation and variance. Let $E=E_{P}E_{R}E_{M}$ and $V=E_{P}E_{R}V_{M}+E_{P}V_{R}E_{M}+V_{P}E_{R}E_{M}$ . Then it may be noted that $E_{P}V_{R}E_{M}(e_{\textit{EB}})=0;$ Here considering Horvitz-Thompson’s (HT, 1952) approach the estimator for $\theta$ is $e_{\textit{EB}}=\frac{1}{N}\sum\nolimits_{i\in s}{\frac{r_{\textit{iEB}}}{\pi_% {i}}}$ where $\pi_{i}$ is defined as the first order inclusion probability of $i$ as defined as $\pi_{i}=\sum\nolimits_{s\ni i}{p(s)}$ .

$\displaystyle E(e_{\textit{EB}})=E_{P}E_{R}E_{M}(e_{\textit{EB}})=\frac{1}{N}E% _{P}E_{R}\frac{p_{W}-\hat{{\theta}}_{\textit{iEB}}}{(p_{W}-1)\pi_{i}}$ $\displaystyle=\frac{1}{N}E_{P}\sum\limits_{i\in s}{\frac{p_{W}-(2p_{W}-1)y_{i}% -p_{W}(1-y_{i})}{(p_{W}-1)\pi_{i}}}=\frac{1}{N}E_{P}\left(\sum\limits_{i\in s}% {\frac{y_{i}}{\pi_{i}}}\right)$

(for Warner’s Model).

$\displaystyle V(e_{\textit{EB}})=E_{P}E_{R}V_{M}(e_{\textit{EB}})+E_{P}V_{R}E_% {M}(e_{\textit{EB}})+V_{P}E_{R}E_{M}(e_{\textit{EB}})$ $\displaystyle E_{P}E_{R}V_{M}(e_{\textit{EB}})=\frac{1}{N^{2}(1-p_{W})^{2}}E_{% P}E_{R}\sum\limits_{i\in s}{\frac{M_{i}}{\pi_{i}^{2}}}=\frac{1}{N^{2}(1-p_{W})% ^{2}}E_{P}\sum\limits_{1}^{N}{\frac{M_{i}}{\pi_{i}^{2}}}=\sum\limits_{1}^{N}{% \frac{M_{i}}{\pi_{i}}}$

where $M_{i}=$ MSE of $\hat{L}_{\textit{iEB}}$ equal to $\lambda_{{i}}$ .

Now,

$\displaystyle E_{P}V_{R}E_{M}(e_{\textit{EB}})=E_{P}V_{R}\sum\limits_{i\in s}{% \frac{p_{W}-\theta_{i}}{N\pi_{i}}}=0$ $\displaystyle V_{P}E_{R}E_{M}(e_{\textit{EB}})=\frac{1}{N^{2}}V_{P}E_{R}\sum% \limits_{i\in s}\frac{p_{W}-\theta_{i}}{\pi_{i}(p_{W}-1)}=\frac{1}{N^{2}}V_{P}% \left(\sum\limits_{i\in s}\frac{y_{i}}{\pi_{i}}\right)=\frac{1}{N^{2}}\sum% \limits_{i<j}\sum(pi_{i}\pi_{j}-\pi_{ij})\left(\frac{y_{i}}{\pi_{i}}-\frac{y_{% j}}{\pi_{j}}\right)^{2}$

The variance $V(e_{\textit{EB}})$ may be estimated by

$m(e_{\textit{EB}})=\frac{1}{N^{2}(1-p_{W})^{2}}\sum\limits_{i\in s}\frac{m_{i}% }{\pi_{i}^{2}}+\frac{1}{N^{2}}\left[T-\sum\limits_{i<j\in}\sum(\pi_{i}\pi_{j}-% \pi_{ij})\left(\frac{m_{i}}{\pi_{i}}-\frac{m_{j}}{\pi_{j}}\right)^{2}\frac{1}{% \pi_{ij}}\right]$

where

$\displaystyle T=\sum\limits_{i<j\in s}\sum(\pi_{i}\pi_{j}-\pi_{ij})\left(\frac% {r_{\textit{iEB}}}{\pi_{i}}-\frac{r_{\textit{jEB}}}{\pi_{j}}\right)^{2}\frac{1% }{\pi_{ij}}\text{ and }m_{i}=\text{ estimate of }M_{{i}}=g_{1i}+g_{2i}+2g_{3i}.$

here $\pi_{ij}$ is described as the second order inclusion probability of $i,j(i\neq j)$ as $\pi_{ij}=\sum\nolimits_{s\ni i,j}{p(s)}$ . For the other RR models we may similarly produce empirical Bayes estimators and their MSE estimators.

3. Simulation results

We use the data given by Chaudhuri and Saha (2004) on $U=(1,2,\ldots,N)$ with $N=$ 113 households for which are given for $i\in U$ , $t_{i}=$ 1 if one person chosen from the $i^{\rm th}$ household (hh) prefers cricket to football, 0, else.

$a_{i}$ is the size of i ${}^{\rm th}$ household of which one member is chosen and questioned.

$x_{i}$ is the i ${}^{\rm th}$ hh expenses on “necessaries” last month.

$y_{i}=1/0$ if the person chosen from i ${}^{\rm th}$ hh earns in dubious/clandestine way/if not.

Problem is to estimate $\theta=\frac{\sum\nolimits_{1}^{N}{y_{i}}}{N}$ .

A sample of $n=$ 33 hh’s is selected by Hartley-Rao (1962) scheme using household size as the size-measure. Details given in the cited location are omitted. The table below gives for the four RRT’s performance criteria for the original design-cum-RR based estimators for $\theta$ and the (2) revised logistic cum EB-based alternatives. The criteria are (a) Actual Coverage Percent (ACP) for the confidence intervals (CI) ( ${\rm est}\pm 1.96\sqrt{{\rm MSE(est)}}$ ) covering $\theta=\frac{93}{113}=0.823$ in 1000 replicated samples, (b) Average over 1000 replicates of estimated coefficient of variation (ACV) $=100\frac{\sqrt{{\rm MSE(est)}}}{\hat{{\theta}}}$ and (c) Average Length (AL) of the calculated CI. The closer ACP to 95% and the less the ACV and the less the AL, the better the estimate $\hat{{\theta}}$ for $\theta$ .

Table 1
Performances of original versus EB modifications in four RRT’s

RRT	ACP	ACV	AL
	Original/Revised	Original/Revised	Original/Revised
Warner’s	80.1/89.5	29.3/22.4	529.3/504.5
Boruch’s	83.1/93.8	28.9/24.3	623.9/723.5
Kuk’s	94.3/95.8	21.9/20.6	527.8/505.6
URL	95.3/94.9	26.5/18.3	424.3/428.5

4. Discussion and conclusion

We follow Chaudhuri’s (2011) recommendation to choose varying probability samples to employ Randomized Response (RR) techniques (RRT). By way of illustration we estimate a sensitive finite population proportion using a Horvitz-Thompson (HT, 1952) estimator as a function of suitable RR-based estimators of the indicator function denoting a person’s bearing a sensitive characteristic of interest. In lieu of such an RR-based traditional estimator we are curious to try for a possible improvement an Empirical Bayes (EB) estimator taking the cues from Prasad and Rao’s (1990) work in the context of Small Area Estimation (SAE). To this also we combine a possible advantage of exploring a possible suitability of utilizing auxiliary data on postulating appropriate logistic regression modeling. Our empirical findings as illustrated vindicate the advantages in these approaches by us.

Footnotes

Acknowledgments

A reviewer’s helpful comments that led to this improved version are gratefully acknowledged.

References

Boruch

R. F.

(1972). Relations among statistical methods for assuring confidentiality of social research data. Soc Sci Res, (1), 403-411.

Chaudhuri

, & Saha

(2004). Utilizing covariates by logistic regression modeling in improved estimation of population proportions bearing stigmatizing features through randomized responses in complex surveys. Jour Ind Soc Agricultural Stat, 58(2), 190-211.

Chaudhuri

(2011). Randomized Response and Indirect Questioning Techniques in surveys. Chapman and Hall, CRC Press, Taylor & Francis Group, Boca Raton, FL.

Chaudhuri

, & Christofides

T. C.

(2013). Indirecting Questioning in Sample Surveys. Springer-Verlag, Berlin, Heidelberg.

Chaudhuri

, & Mukerjee

(1988). Randomized Response: Theory and Techniques. Marcel Dekker, New York.

Fox

J. A.

, & Tracy

P. E.

(1986). Randomized Response: A Method for Sensitive Surveys. Sage, London.

Fay

R. E.

, & Herriot

R. A.

(1979). Estimation of income from small places: An application of James-Stein procedures to census data. J Am Stat Assoc, 74, 269-277.

Greenberg

B. G.

Abul-Ela

A. A.

Simmons

W. R.

, & Horvitz

D. G.

(1969). The unrelated question randomized response model: Theoretical framework. J Am Stat Assoc, 64, 520-539 352.

Hartley

H. O.

, & Rao

J. N. K.

(1962). Sampling with unequal probabilities and without replacement. Ann Math Stat, 33, 350-374.

10.

Heijden

P. G. M.

, & Gils

G. V.

(1996). Some logistic regression models for randomized response data. Statistical Modelling Proc 11th Int Workshop, Orvieto, Italy, 341-348.

11.

Horvitz

D. G.

, & Thompson

D. J.

(1952). A generalization of sampling without replacement from a finite universe. J Amer Statist Assoc, 47, 663-685.

12.

Horvitz

D. G.

Shah

B. V.

, & Simmons

W. R.

(1967). The unrelated question RR model. Proc Soc Stat Sec ASA, 65-72.

13.

Kuk

A. Y. C.

(1990). Asking sensitive questions indirectly. Biometrika, 77, 436-438.

14.

Maddala

G. S.

(1983). Limited Dependent and Qualitative Variables in Econometrics. Cambridge Press, New York.

15.

Prasad

N. G. N

, & Rao

J. N. K.

(1990). The estimation of the mean squared errors of small area estimators. J Amer Statist Assoc, 85, 163-171.

16.

Van der Heijden

P. G.

, & van Gils

(1996). Some logistic regression models for randomized response data. Proceedings of the 11th International Workshop on Statistical Modelling, Orvieto, Italy, 15-19.

17.

Warner

S. L.

(1965). RR: A survey technique for eliminating evasive answer bias. Jour Amer Stat Assoc, 60, 63-69.

Model assisted estimation of sensitive proportions from randomised responses by unequal probability sampling

Abstract

Keywords

1. Introduction

2. Four classical RRT’s and concerned estimation

2.1 Classical RRT’s and estimation

2.1.1 Warner’s (1965) RRT

2.1.2 Boruch’s (1972) forced response RRT

2.1.3 Kuk’s (1990) RRT

2.1.4 Simmons’s URL RRT

2.1.5 Estimation

2.2 Model based approach: One-parameter logistic regression modeling

2.2.1 Warner’s RRT

2.2.4 URL model of simmons

2.3 Empirical Bayes estimation

Table 1 Performances of original versus EB modifications in four RRT’s

Footnotes

Acknowledgments

References

Table 1
Performances of original versus EB modifications in four RRT’s