In order to estimate the proportion of people bearing a sensitive characteristic in a community, a sample is selected with unequal probabilities and randomized response data are obtained. Supposing data on a related variable are at hand in addition a model-design based estimation procedure modifying Chaudhuri and Saha’s (2004) is developed and studied. Four well-known Randomized Response (RR) methods are illustrated and a one-parameter logistic regression model is tried. Empirical Bayes estimation is examined and simulated results are presented to study the resulting efficacy.
Warner (1965) gave the pioneering randomized response (RR) technique (RRT) in a judicious attempt to gather trustworthy data on sensitive matters. Boruch (1972) provided an amended “Forced Response” technique and Kuk (1990) gave another generalization. A basic deviation needed to cover the case when a characteristic and its complement are both potentially sensitive (presumably not envisaged in Warner’s original model) is “Simmons’s” RRT called URL approach as described by Greenberg et al. (1969) and Hortivz et al. (1967). Since his work is not narrated by him elsewhere no separate reference to Simmons is observed. Fox and Tracy (1986) and Chaudhuri and Mukerjee (1988) in consolidated manners covering many procedures demonstrated their uses. Chaudhuri (2011) has given a general treatment of each permitting sample selection by general schemes liberating it from the erstwhile compulsion of simple random sampling with replacement (SRSWR) alone. Simmons’s procedure as above involves one sensitive characteristic and another “unrelated” innocuous characteristic. So, for simplicity this is called a URL-model. Visualizing the existence of another variable shedding some informative light on the prime stigmatizing qualitative variable Chaudhuri and Saha (2004) developed alternative estimation procedures by way of bringing out improvements upon Kuk’s (1990) and Simmons’s URL procedures utilizing data on an auxiliary variable using one-parameter logistic regression modeling following Maddala (1983) and Van der Heijden and Van Gils (1996). Chaudhuri and Saha (2004) noted the failure of this logistic regression approach to cover Warner’s (1965) and Boruch’s (1972) Forced Response RRT’s. The present work addresses this deficiency. We throughout consider unequal probability sampling and consequently Chaudhuri’s (2011) as well as Chaudhuri and Christofides’s (2013) version of each classical RRT get appropriately amended. We find it convenient to follow Fay and Herriot (1979) and Prasad and Rao (1990) to obtain empirical Bayes estimators for the logistic regression model parameter. Finally a model-design based estimation procedure is applied pointing out how to measure the resulting accuracy in estimation. Details are given in Section 2 below. The Section 3 presents a simulation-based numerical comparison followed by concluding remarks.
In summary our innovations are: The classical RRT’s mostly used SRSWR’s (simple random sampling taken with replacement) obtaining from each RR an unbiased estimator for the population proportion and hence employing the sample mean for the final estimator. But we use each RR to unbiasedly estimate the true value for the respondent and taking an unequal probability sample are able to employ general linear unbiased estimators for the required proportion. Also we are able to extend Chaudhuri and Saha’s (2004) one parameter logistic regression modeling to cover Warner’s and Boruch’s RR and empirical Bayes estimation procedure we have introduced in RR’s.
Four classical RRT’s and concerned estimation
Classical RRT’s and estimation
Warner’s (1965) RRT
A sampled person labelled i chosen from a population of labelled persons is approached with a box containing a large number of identical cards with a proportion of them marked and the rest marked . On request, he/she draws randomly one card from the box to respond
The person’s true value is which is either 1 or 0 according as bears or its complement The RR-based expectation and variance operators generically as give
Then, a known number.
Boruch’s (1972) forced response RRT
When approached as in Warner’s RRT the person labelled is given a pack of a proportion of cards marked 1, a proportion of cards marked 0 and the remaining marked “true”. His/her response as instructed will truthfully be
Then,
So,
Kuk’s (1990) RRT
A box I with a proportion of “red” and the remaining “blue” cards and a second box II with a proportion of “red” and the remaining “blue” cards are presented to a sampled person labelled . He/she if bearing is asked to report the numbers of “red’ cards drawn in k independent random draws from box I or from box II if bearing
The expected “number of red cards drawn ” is, by binomial distribution,
and the variance is .
Letting
Simmons’s URL RRT
Let
is an innocuous characteristic unrelated to . Two boxes I and II respectively containing -marked and -marked cards in proportions in box-I and in box-II, are presented to each sampled person labeled . On request he/she independently draws one card randomly from each box to give RR truthfully as
Similarly, using box II.
Then,
This exercise is independently repeated once again to produce RR’s as .
Then,
Then, and has an unbiased estimator . From now we shall write generically simply omitting the subscripts.
Estimation
Supposing a sample from is selected with probability according to a design and choosing constants each free of elements of such that
it is well-known, vide Chaudhuri (2011) that is an estimator for with the following properties where ,
Let denote the design based operators for expectation, variance and let and Then it is well known, vide [3] that
are two estimators for satisfying ; here is if is known or is otherwise such that
Citing these classical results from the literature let us suppose that the values closely related to are available.
Model based approach: One-parameter logistic regression modeling
Chaudhuri and Saha (2004) postulated for noting that a one-parameter logistic regression model and for an estimator for , the model , applicable provided postulating a suitable probability distribution for the error term They pointed out the inapplicability of this approach for improved estimation of to (a) Warner’s RRT and (b) Forced Response RRT of Boruch (1972) because the requisite conditions on , (generically) do not hold. So, we proceed as follows.
Warner’s RRT
Let
Then,
Similarly substituting in the expression of we may write
Also,
On choosing subject to Eqs (1)–(4) it is ensured that and .
Forced response RRT
Then, substituting in (ii) we may write
Similarly,
With appropriate both take values in the open interval (0, 1).
Also, .
Though Kuk’s RRT and URL of Simmons did not create any problem to Chaudhuri and Saha (2004) for use of one-parameter logistic regression modeling let us still note.
Kuk’s (1990) RRT
Here
both belong to (0, 1) for . Now,
URL model of simmons
Let
Both are points in (0, 1). From now on we shall write generically omitting additional subscripts.
For simplicity, let us write
Empirical Bayes estimation
Let, further, the model satisfy
here is an unknown constant and is another unknown constant.
Now and will be estimated iteratively following Empirical best linear unbiased predictor approach with Fay-Herriot Model (1979) as described by Prasad and Rao (1979) in Small Area estimation. The procedure briefly is described as follows.
Letting
and noting
is distributed as a chi-square variable with degrees of freedom equal to () supposing is the sample-size, it is easy to estimate by iteration and by method of moments aided by iteration (Newton-Raphson) to solve the Eq. (7) ().
Let these estimates be and Then, may be estimated by the empirical Bayes estimate
Then, is estimated by
We follow Prasad and Rao (1990) to estimate the mean square error (MSE) of in the following way:
Let
Then the MSE of is estimated as .
Then, MSE of is estimated on noting the well-known formula using Taylor Series
If is estimated by then is estimated by .
Then the Empirical Best Linear Unbiased Predictor for in Warner’s (1965) model is the following:
where is the true MSE of the estimate of .
Let us note from Chaudhuri (2011) that is an estimator for with the following properties.
Let denote the design based operator for expectation, variance and the model based operators for expectation and variance. Let and . Then it may be noted that Here considering Horvitz-Thompson’s (HT, 1952) approach the estimator for is where is defined as the first order inclusion probability of as defined as .
(for Warner’s Model).
where MSE of equal to .
Now,
The variance may be estimated by
where
here is described as the second order inclusion probability of as . For the other RR models we may similarly produce empirical Bayes estimators and their MSE estimators.
Simulation results
We use the data given by Chaudhuri and Saha (2004) on with 113 households for which are given for , 1 if one person chosen from the household (hh) prefers cricket to football, 0, else.
is the size of i household of which one member is chosen and questioned.
is the i hh expenses on “necessaries” last month.
if the person chosen from i hh earns in dubious/clandestine way/if not.
Problem is to estimate .
A sample of 33 hh’s is selected by Hartley-Rao (1962) scheme using household size as the size-measure. Details given in the cited location are omitted. The table below gives for the four RRT’s performance criteria for the original design-cum-RR based estimators for and the (2) revised logistic cum EB-based alternatives. The criteria are (a) Actual Coverage Percent (ACP) for the confidence intervals (CI) () covering in 1000 replicated samples, (b) Average over 1000 replicates of estimated coefficient of variation (ACV) and (c) Average Length (AL) of the calculated CI. The closer ACP to 95% and the less the ACV and the less the AL, the better the estimate for .
Performances of original versus EB modifications in four RRT’s
RRT
ACP
ACV
AL
Original/Revised
Original/Revised
Original/Revised
Warner’s
80.1/89.5
29.3/22.4
529.3/504.5
Boruch’s
83.1/93.8
28.9/24.3
623.9/723.5
Kuk’s
94.3/95.8
21.9/20.6
527.8/505.6
URL
95.3/94.9
26.5/18.3
424.3/428.5
Discussion and conclusion
We follow Chaudhuri’s (2011) recommendation to choose varying probability samples to employ Randomized Response (RR) techniques (RRT). By way of illustration we estimate a sensitive finite population proportion using a Horvitz-Thompson (HT, 1952) estimator as a function of suitable RR-based estimators of the indicator function denoting a person’s bearing a sensitive characteristic of interest. In lieu of such an RR-based traditional estimator we are curious to try for a possible improvement an Empirical Bayes (EB) estimator taking the cues from Prasad and Rao’s (1990) work in the context of Small Area Estimation (SAE). To this also we combine a possible advantage of exploring a possible suitability of utilizing auxiliary data on postulating appropriate logistic regression modeling. Our empirical findings as illustrated vindicate the advantages in these approaches by us.
Footnotes
Acknowledgments
A reviewer’s helpful comments that led to this improved version are gratefully acknowledged.
References
1.
BoruchR. F. (1972). Relations among statistical methods for assuring confidentiality of social research data. Soc Sci Res, (1), 403-411.
2.
ChaudhuriA, & SahaA. (2004). Utilizing covariates by logistic regression modeling in improved estimation of population proportions bearing stigmatizing features through randomized responses in complex surveys. Jour Ind Soc Agricultural Stat, 58(2), 190-211.
3.
ChaudhuriA. (2011). Randomized Response and Indirect Questioning Techniques in surveys. Chapman and Hall, CRC Press, Taylor & Francis Group, Boca Raton, FL.
4.
ChaudhuriA, & ChristofidesT. C. (2013). Indirecting Questioning in Sample Surveys. Springer-Verlag, Berlin, Heidelberg.
5.
ChaudhuriA, & MukerjeeR. (1988). Randomized Response: Theory and Techniques. Marcel Dekker, New York.
6.
FoxJ. A., & TracyP. E. (1986). Randomized Response: A Method for Sensitive Surveys. Sage, London.
7.
FayR. E., & HerriotR. A. (1979). Estimation of income from small places: An application of James-Stein procedures to census data. J Am Stat Assoc, 74, 269-277.
8.
GreenbergB. G.Abul-ElaA. A.SimmonsW. R., & HorvitzD. G. (1969). The unrelated question randomized response model: Theoretical framework. J Am Stat Assoc, 64, 520-539 352.
9.
HartleyH. O., & RaoJ. N. K. (1962). Sampling with unequal probabilities and without replacement. Ann Math Stat, 33, 350-374.
10.
HeijdenP. G. M., & GilsG. V. (1996). Some logistic regression models for randomized response data. Statistical Modelling Proc 11th Int Workshop, Orvieto, Italy, 341-348.
11.
HorvitzD. G., & ThompsonD. J. (1952). A generalization of sampling without replacement from a finite universe. J Amer Statist Assoc, 47, 663-685.
12.
HorvitzD. G.ShahB. V., & SimmonsW. R. (1967). The unrelated question RR model. Proc Soc Stat Sec ASA, 65-72.
13.
KukA. Y. C. (1990). Asking sensitive questions indirectly. Biometrika, 77, 436-438.
14.
MaddalaG. S. (1983). Limited Dependent and Qualitative Variables in Econometrics. Cambridge Press, New York.
15.
PrasadN. G. N, & RaoJ. N. K. (1990). The estimation of the mean squared errors of small area estimators. J Amer Statist Assoc, 85, 163-171.
16.
Van der HeijdenP. G., & van GilsG. (1996). Some logistic regression models for randomized response data. Proceedings of the 11th International Workshop on Statistical Modelling, Orvieto, Italy, 15-19.
17.
WarnerS. L. (1965). RR: A survey technique for eliminating evasive answer bias. Jour Amer Stat Assoc, 60, 63-69.