A Logistic Regression Extension for the Randomized Response Simple and Crossed Models: Theoretical Results and Empirical Evidence

Abstract

We propose some theoretical and empirical advances by supplying the methodology for analyzing the factors that influence two sensitive variables when data are collected by randomized response (RR) survey modes. First, we provide the framework for obtaining the maximum likelihood estimates of logistic regression coefficients under the RR simple and crossed models, then we carry out a simulation study to assess the performance of the estimation procedure. Finally, logistic regression analysis is illustrated by considering real data about cannabis use and legalization and about abortion and illegal immigration. The empirical results bring out certain considerations about the effect of the RR and direct questioning survey modes on the estimates. The inference about the sign and the significance of the regression coefficients can contribute to the debate on whether the RR approach is an effective survey method to reduce misreporting and improve the validity of analyses.

Keywords

maximum likelihood estimates prevalence estimates privacy protection randomized response theory sensitive questions

In survey practice, it is a well-established fact that investigating sensitive issues concerning deviant, undesirable, illegal, dishonest, incriminating, or simply personal and confidential traits by direct questioning (DQ) produces high nonresponse rates and misleading reporting. Nonresponse and untruthful responses represent nonsampling errors that can seriously flaw the quality of the collected data with harmful consequences for the validity of the final analyses. The problem cannot be completely solved, but it may be somewhat limited by assuring respondents of a high degree of anonymity, thereby enhancing their cooperation. Many solutions have been proposed to serve this purpose. One approach for encouraging greater cooperation is the elicitation of the requested information without posing any direct sensitive question to the survey participants who are thus not obliged to openly declare whether they bear stigmatizing traits. Hence, privacy is protected since true responses remain known only to the respondents, and consequently, their true status remains uncertain and undisclosed to both the interviewer and the researcher. This approach covers different procedures which are grouped under the general term of indirect questioning techniques (e.g., Chaudhuri 2011; Chaudhuri and Christofides 2013). In terms of the amount of contributions produced in the literature, the randomized response (RR) theory (RRT) plays a role of primary importance among the indirect questioning survey modes. The technique was introduced by Warner (1965) in order to collect reliable sensitive information without jeopardizing respondents’ privacy. Greenberg et al. (1969) proposed a first variant on the Warner method, called the unrelated-question model, which requires one sensitive statement and one innocuous, neutral statement (e.g., “My mother was born between January and March”). The RRT, initially conceived for dealing with binary variables denoting the presence or absence of a sensitive behavior, was rapidly adapted for more complex situations and has generated over time a vast amount of literature, including both theoretical and empirical contributions, most of which are included in the monographs by Fox and Tracy (1986), Chaudhuri and Mukerjee (1988), Chaudhuri (2011), Chaudhuri and Christofides (2013), Tian and Tang (2014), Chaudhuri, Christofides, and Rao (2016), Fox (2016), and in the two special issues of the international journal Model Assisted Statistics and Applications (Vol. 9, No. 1, 2014; Vol. 10, No. 4, 2015).

The idea underlying the binary RRT is intuitive and fascinating. Respondents are provided with a randomization device (for instance, a die, a deck of cards, a spinner, a box with colored and numbered balls) which is used to select one question from a set, some or all of which are related to the sensitive variable. The outcome of the device determines which question is selected and answered with a “yes” or “no” response. Since the respondent is instructed not to reveal to anyone which question is being answered, and both the interviewer and the researcher are kept blind about the outcome of the device, the technique enables the respondents to reply without revealing their true status, which remains unknown, and consequently, privacy is protected. Although the individual responses cannot be used to discover the true status of the respondents, the responses collected from all the survey participants can be used for inferential purposes. In fact, the randomization device, designed and controlled by the researcher, generates a probabilistic relation between the sensitive question and a released answer which is used to make inferences about unknown parameters of interest, in particular the prevalence of the sensitive attribute in the target population.

The Warner procedure applies to a population which is ideally divided into two mutually exclusive groups according to whether respondents have or do not have the sensitive attribute in question. However, in real studies, interest may be focused on more than one sensitive variable. With this in mind, Christofides (2005) proposed an RR procedure for investigating two sensitive characteristics at the same time using two randomization devices. Tian et al. (2007) proposed a method for assessing the association of two binary variables which uses a nonsensitive question instead of a randomization device. Lee, Sedory, and Singh (2013) proposed two different RR designs, called the simple model and the crossed model, to simultaneously collect data on two sensitive attributes and estimate their prevalence in the population together with other characteristics of association. The recent contribution by Ewemooje, Amahia, and Abedola (2017) also goes in the direction of examining the prevalence of two related sensitive characteristics. Perri, Pelle, and Stranges (2016) employed the crossed model by Lee et al. (2013) in a real survey to investigate induced abortion and the illegal immigration status of female immigrants. Although during the study a number of sociodemographic variables were surveyed by means of face-to-face interviews, the main emphasis of the research focused on estimating the prevalence of the two sensitive attributes, for the entire population and subgroups of it, without using other individual characteristics. Indeed, in social and behavioral sciences, interest generally goes beyond determining the prevalence of deviant behaviors, stigmatizing traits, or incriminating attitudes but rather extends to the associations between several variables. Motivated by this need, the present article aims at combining the collected RR data with other nonsensitive information on respondents and estimating the determinants of the sensitive behaviors by including available covariates in a bivariate logistic regression model that assumes, as dependent variables, the observed “yes” and “no” responses stemming from the RR crossed model questioning format. For completeness, theoretical developments are also extended to the simple model, even if no empirical analysis is supplied.

Due to the artificial variance that affects the response variable when subject to RR, traditional regression techniques are not suitable for RR data and need to be adapted to this new framework. The idea of applying logistic regression models to RR dates back to Maddala (1983), who complained about the lack of methods for estimating the factors that affect sensitive behaviors. Scheers and Dayton (1998) developed logistic regression models for the Warner device and the unrelated-question technique. Van den Hout, Van der Heijden, and Gilchrist (2007) elaborated the multivariate logistic regression model for two RR variables as presented by Glonek and McCullagh (1995), while Corstange (2009) proposed a method to estimate the parameters of a hidden logistic regression model. For the unrelated-question technique, Hsieh, Lee, and Shen (2010) presented a logistic regression model in the presence of missing covariates. Recently, Hsieh et al. (2016) used a logistic regression model based on an RR design which jointly considers the Warner method and unrelated-question technique and can effectively improve the efficiency of the maximum likelihood (ML) estimator of Scheers and Dayton (1998). Cruyff et al. (2016) provided a review of regression procedures for RR data, including univariate and multivariate logistic regression. Applications of the RR logistic regression model may be found, among others, in Kerkvliet (1994), Elffers, Van der Heijden, and Hezemans (2003), Lensvelt-Mulders et al. (2006), Van den Hout et al. (2007), Jann, Jerke, and Krumpal (2012), Krumpal (2012), Wolter and Preisendörfer (2013), and Korndörfer, Krumpal, and Schmukle (2014).

Following the theoretical approach discussed in Van den Hout et al. (2007), in this article, we extend bivariate logistic regression analysis to RR data that are assumed to be produced according to the simple model and crossed model. For both models, methodological aspects are supported by simulated experiments and, for the RR crossed design only, by two empirical analyses performed on real RR data. Hopefully, our contribution can support and further the applications of these RR models.

The outline of the article is as follows. First, we introduce the notation and the design for the simple and crossed models in the absence of covariates. Then, we discuss a bivariate logistic regression model under the aforesaid RR designs and provide the theory for ML estimation. Afterward, we present the results of a Monte Carlo simulation study carried out to investigate the finite-sample properties of the estimates of the logistic regression coefficients for the simple and crossed models. Next, we provide two applications based on real RR data collected with the crossed model. The first application is a study conceived to investigate two sensitive variables, cannabis use and cannabis legalization, using both RR and DQ survey modes; the second revisits the RR survey presented in Perri et al. (2016) related to induced abortion and illegal immigration status of female immigrants. Some final comments conclude the work.

Method

A Review of the Warner Model

We start with the design of the original related-question RR model conceived by Warner (1965) as a way of introducing the notation and making a bridge with the RR methods suggested by Lee et al. (2013). The basic idea is to randomize whether a respondent has to answer the sensitive statement (A) or its inverse (A^c ). In order to illustrate the Warner procedure, let us assume that each female respondent is provided with a deck of cards marked with the following two statements:

A: I have had an abortion.

A^c : I have never had an abortion.

Let p denote the proportion of cards marked with the statement A, and $1 - p$ the proportion with the statement A^c . The aim is to estimate the proportion (prevalence) of women in the population who have had an abortion. The respondents are instructed to shuffle the deck, select a card, and, without revealing to anyone the type of selected card, report a truthful “yes” or “no” response according to whether their true status matches the statement on the card. Since both the interviewer and the researcher are completely unaware of the statement to which the released response refers, the true status of the respondents about abortion remains undisclosed and, hence, their privacy is protected.

In order to set a general framework for developing the theoretical aspect of the problem, let, without loss of generality, Y ₁ be a latent RR binary variable (i.e., a sensitive variable subject to RR) that models the response to the sensitive statement (i.e., statement A in the aforementioned example), with $Y_{1} = 1$ if the respondent has the sensitive attribute and $Y_{1} = 0$ otherwise. Let $P (Y_{1} = 1)$ be the unknown prevalence of the sensitive attribute in the population. Moreover, let S ₁ be a binary variable based on a game of chance (i.e., on a randomization device) which models the statement to be answered. Hence, $S_{1} = 1$ if the device produces statement A, and $S_{1} = 0$ in the case where the device redirects the respondents to statement A^c . Assume that $P (S_{1} = 1) = p$ and $P (S_{1} = 0) = 1 - p$ , where probability p is a design parameter fixed in advance by the researcher to ensure adequate levels of privacy protection and accuracy of the estimates. Finally, let $Y_{1}^{*}$ be a binary variable that models the observed answers to statements A or A^c with $Y_{1}^{*} = 1$ if a “yes” response is observed and $Y_{1}^{*} = 0$ in case of a “no” response. We readily note that $Y_{1}^{*} = Y_{1}$ with probability p, and $Y_{1}^{*} = 1 - Y_{1}$ with probability $1 - p$ . Then, under the Warner design, a probabilistic relationship among Y ₁ and $Y_{1}^{*}$ can be established by computing the probability of answering “yes”:

\begin{array}{l} P (Y_{1}^{*} = 1) = P (Y_{1}^{*} = 1 | Y_{1} = 0) \times P (Y_{1} = 0) + P (Y_{1}^{*} = 1 | Y_{1} = 1) \times P (Y_{1} = 1) \\ = (1 - p) \times P (Y_{1} = 0) + p \times P (Y_{1} = 1) . \end{array}

Similarly, we can determine probability $P (Y_{1}^{*} = 0)$ of observing a “no” response. From the expressions for $P (Y_{1}^{*} = 1)$ and $P (Y_{1}^{*} = 0)$ , it is at once apparent that the RR latent variable Y ₁ turns out to be a misclassified variable, where probabilities $p_{j | k} = P (Y_{1}^{*} = j | Y_{1} = k)$ for $j, k \in \{0, 1\}$ denote conditional misclassification probabilities fixed by the randomization device. Now, let $λ_{1} = {(P (Y_{1}^{*} = 0), P (Y_{1}^{*} = 1))}^{t}$ denote the vector of the probabilities of the observable “no” and “yes” RRs, $π_{1} = {(P (Y_{1} = 0), P (Y_{1} = 1))}^{t}$ the vector of the probabilities of the true latent “no” and “yes” responses, and

P = (\begin{matrix} p_{0 | 0} & p_{0 | 1} \\ p_{1 | 0} & p_{1 | 1} \end{matrix}) = (\begin{matrix} p & 1 - p \\ 1 - p & p \end{matrix}),

the misclassification (or transition) matrix of $(Y_{1}, Y_{1}^{*})$ which links the true probabilities of the latent states $π_{1}$ to the expected observable proportions of responses $λ_{1}$ . Accordingly, the Warner design may be expressed in the matrix form:

λ_{1} = P π_{1} .

The prevalence estimate of the sensitive attribute A can be obtained from equation (1). If P is nonsingular and ${\hat{λ}}_{1}$ denotes the vector of the observed proportions of “no” and “yes” responses in a sample selected at random from the population, then $π_{1}$ can be estimated by the moment estimator:

{\hat{π}}_{1} = P^{- 1} {\hat{λ}}_{1} .

Let us assume, now, that a second sensitive characteristic, related to the first, has to be surveyed. For instance, the status of female immigrants in a country may be of interest for the researcher investigating induced abortion (see, e.g., Perri et al. 2016). Hence, the sensitive attribute to be studied is the status of female immigrants, and the aim is the estimation of the prevalence of women without the legal status.

As in the previous example, a deck of cards marked with the following two statements is used to randomize the response:

B: I am an irregular immigrant.

B^c : I am a regular immigrant.

Hence, according to the previous setting, and without loss of generality, let Y ₂, S ₂, and $Y_{2}^{*}$ denote binary variables that model, respectively, the true response, the outcome of a second randomization device, and the observable misclassified responses. Moreover, let $π_{2} = {(P (Y_{2} = 0), P (Y_{2} = 1))}^{t}$ , $P (S_{2} = 1) = q$ , and $λ_{2} = {(P (Y_{2}^{*} = 0), P (Y_{2}^{*} = 1))}^{t}$ . In a univariate setting, the Warner design produces an estimate of the prevalence of the second sensitive attribute in a similar fashion as shown in equation (2):

{\hat{π}}_{2} = Q^{- 1} {\hat{λ}}_{2},

with an obvious meaning of the symbols employed.

Indeed, when two (or more) RR-sensitive variables are considered which are correlated with each other and simultaneously surveyed in the same population, the emphasis may be placed on different prevalence estimates and some measures of association. In this spirit, Lee et al. (2013) proposed two methods for studying the association between two sensitive characteristics and estimating many different population parameters from a single sample and a couple of responses from each survey participant. Hence, using the notation already introduced in two observable variables $Y_{1}^{*}$ and $Y_{2}^{*}$ , let $π_{k l}^{*} = P (Y_{1}^{*} = k, Y_{2}^{*} = l)$ for $k, l \in \{0, 1\}$ , and $π^{*} = {(π_{00}^{*}, π_{01}^{*}, π_{10}^{*}, π_{11}^{*})}^{t}$ be the vector of the probabilities of the observable responses. Next, considering both the latent binary variables Y ₁ and Y ₂, let $π_{k l} = P (Y_{1} = k, Y_{2} = l)$ , and $π = {(π_{00}, π_{01}, π_{10}, π_{11})}^{t}$ be the vector of the probabilities of the unobservable true responses.

After these preliminary remarks, in the next two sections, we will describe the Lee et al. (2013) simple and crossed models in terms of misclassification probabilities. For simplicity, we will continue referring to the two sensitive attributes previously delineated through statements A, A^c , B, and B^c . Specifically, we will define the transition matrix associated with the two RR designs and express them in a matrix form in such a way as to include them in the framework outlined by Van den Hout et al. (2007) and, hence, incorporated in an RR bivariate logistic regression model.

Simple Model

The implementation of the simple model may be described as follows. Each respondent is provided with two randomization devices such as two decks of cards, say deck I and deck II, with deck I containing cards marked with the statements A and A^c , and deck II with cards of type B and B^c as in the previous section. Respondents are asked to shuffle the two decks, select a card from each deck, and report a simultaneous response to the statements on the selected cards. Hence, each respondent provides a couple of responses which may be (no, no), (no, yes), (yes, no), or (yes, yes). Note that each deck of cards acts as a Warner randomization device.

According to the results previously shown for the Warner design, the misclassification for $(Y_{1}, Y_{2})$ is described by the transition matrix $P_{s} = P \otimes Q$ (see Online Appendix [which can be found at http://smr.sagepub.com/supplemental/]), where $\otimes$ denotes the Kronecker product:

P_{s} = (\begin{matrix} p q & p (1 - q) & (1 - p) q & (1 - p) (1 - q) \\ p (1 - q) & p q & (1 - p) (1 - q) & (1 - p) q \\ (1 - p) q & (1 - p) (1 - q) & p q & p (1 - q) \\ (1 - p) (1 - q) & (1 - p) q & p (1 - q) & p q \end{matrix}) .

Hence, the misclassification design for the RR simple model may be expressed as:

π^{*} = P_{s} π,

and $π$ can be estimated by:

{\hat{π}}^{s} = P_{s}^{- 1} π^{*} .

Crossed Model

The procedure for collecting data is the same for the simple model, the only difference being the statements marked on the cards of the two decks. In fact, a proportion p of cards in deck I are marked with statement A and the remainder with statement B^c . Conversely, cards in deck II show the statements B and A^c in proportions q and $1 - q$ , respectively. Thus, we have $Y_{1}^{*} = Y_{1}$ with probability p, $Y_{1}^{*} = 1 - Y_{2}$ with probability $1 - p$ , $Y_{2}^{*} = Y_{2}$ with probability q, and $Y_{2}^{*} = 1 - Y_{1}$ with probability $1 - q$ .

For this design, the transition matrix of $(Y_{1}, Y_{2})$ and $(Y_{1}^{*}, Y_{2}^{*})$ cannot be easily derived using the Kronecker product as in the simple model. A few more calculations (see Online Appendix [which can be found at http://smr.sagepub.com/supplemental/]) are required to yield:

P_{c} = (\begin{matrix} p q & 0 & 0 & (1 - p) (1 - q) \\ p (1 - q) & 1 & 0 & (1 - p) q \\ (1 - p) q & 0 & 1 & p (1 - p) \\ (1 - p) (1 - q) & 0 & 0 & p q \end{matrix}) .

Accordingly, the crossed model can be expressed as:

π^{*} = P_{c} π,

and $π$ can be estimated by:

{\hat{π}}^{c} = P_{c}^{- 1} π^{*} .

ML Estimation for RR Bivariate Logistic Regression Model

We consider the multivariate logistic regression model proposed by Glonek and McCullagh (1995), and, following Van den Hout et al. (2007), we adapt it to the bivariate model with RR data collected using the simple and crossed models previously discussed. Preliminarily, it is worth remarking that, due to the randomization procedure which misclassifies the dependent variable and introduces artificial variability, it is not possible to employ standard regression estimation techniques as in the situation where collected data are not subject to RR. ML estimation requires that the likelihood function be modified in order to incorporate the randomization design. Hence, maximization can be achieved following the usual iterative procedures.

For two latent binary RR variables Y ₁ and Y ₂, the bivariate logistic regression model may be described in the context of the well-known family of generalized linear models through the link functions:

η_{Y_{1}} = log \frac{π_{1 +}}{1 - π_{1 +}}, η_{Y_{2}} = log \frac{π_{+ 1}}{1 - π_{+ 1}}, and η_{Y_{12}} = log \frac{π_{00} π_{11}}{π_{01} π_{10}},

where $π_{1 +} = π_{11} + π_{10}$ and $π_{+ 1} = π_{11} + π_{01}$ . The odds ratio $η_{Y_{12}}$ is used to model the dependence between Y ₁ and Y ₂. Now, if $X_{Y_{1}}$ , $X_{Y_{2}}$ , and $X_{Y_{12}}$ denote covariate vectors, then the logistic regression equations are given by:

η_{Y_{1}} = β_{Y_{1}}^{t} X_{Y_{1}}, η_{Y_{2}} = β_{Y_{2}}^{t} X_{Y_{2}}, and η_{Y_{1} Y_{2}} = β_{Y_{1} Y_{2}}^{t} X_{Y_{12}},

where $β = (β_{Y_{1}}^{t}, β_{Y_{2}}^{t}, β_{Y_{12}}^{t})^{t}$ denotes the r-dimensional vector of regression coefficients to be estimated. Given $η = (0, η_{Y_{1}}, η_{Y_{2}}, η_{Y_{12}})^{t}$ , with the first element 0 representing the null contrast $log (π_{+ 1} + π_{1 +}) = 0$ , in a standard setting without any RR design, the link in the bivariate model between the linear predictor and expected vector is:

η = C^{t} log (L π),

where C is the contrast matrix and L is the marginal indicator given by:

C^{t} = (\begin{matrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & - 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & - 1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & - 1 & - 1 & 1 \end{matrix}), L^{t} = (\begin{matrix} 1 & 1 & 0 & 1 & 0 & 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 & 1 & 0 & 1 & 0 & 0 \\ 1 & 0 & 1 & 1 & 0 & 0 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 & 1 & 0 & 0 & 0 & 1 \end{matrix}) .

Note that equation (4) is a marginal model in the sense that it implies univariate logistic models for both Y ₁ and Y ₂ marginally (Glonek and McCullagh 1995). ML estimates of vector $β$ may be achieved as described in Glonek and McCullagh (1995). Nonetheless, when the model has to be estimated in an RR framework, it is necessary to adapt the method properly by taking into account the misclassification induced by the RR design. Firstly, for the aim of this article and following Van den Hout et al. (2007), the link described in equation (4) becomes:

η = C^{t} log (L P_{d}^{- 1} π^{*}),

where P _d denotes the transition matrix for the simple design ( $d = s$ ) or the crossed design ( $d = c$ ).

Given a sample of n selected units, the bivariate logistic regression model for the ith observation, $i = 1, 2, \dots, n$ , is formulated as $η_{i} = X_{i} β$ , where X _i is the matrix:

X_{i} = (\begin{matrix} 0 & 0 & 0 \\ X_{Y_{1 i}}^{t} & 0 & 0 \\ 0 & X_{Y_{2 i}}^{t} & 0 \\ 0 & 0 & X_{Y_{12 i}}^{t} \end{matrix}) .

Note that the elements of the first row of X _i are 0 because the first element of $η_{i}$ is always 0. Moreover, let $Y_{1 i}^{*}$ and $Y_{2 i}^{*}$ be the two binary observed variables for the ith sample unit, and assume $y_{i}^{*}$ is a vector from a multinomial distribution with parameter $π^{*} = {(π_{00}^{*}, π_{01}^{*}, π_{10}^{*}, π_{11}^{*})}^{t}$ . Assume also that available data are given by independent observations $(y_{i}^{*}, X_{i})$ where $y_{i}^{*} \in {{(1, 0, 0, 0)}^{t} {, (0, 1, 0, 0)}^{t} {, (0, 0, 1, 0)}^{t} {, (0, 0, 0, 1)}^{t}}$ . Note that if $Y_{1 i}^{*} = 0$ and $Y_{2 i}^{*} = 0$ , then $y_{i}^{*} {= (1, 0, 0, 0)}^{t}$ ; if $Y_{1 i}^{*} = 1$ and $Y_{2 i}^{*} = 1$ , then $y_{i}^{*} {= (0, 0, 0, 1)}^{t}$ ; and similarly for the others. Hence, ML estimates of $β$ are obtained by maximizing, in a two-step procedure (see, e.g., Van den Hout et al. 2007), the log-likelihood function:

l (β) = \sum_{i = 1}^{n} {(y_{i}^{*})}^{t} log (π^{*}) = \sum_{i = 1}^{n} {(y_{i}^{*})}^{t} log (P_{d} π) .

Simulation

In this section, we present the results of a Monte Carlo simulation study carried out to investigate the finite-sample performance of the bivariate logistic regression estimates under the simple and crossed models. For each configuration of the simulation experiments, we use sample sizes of $n = 300$ and $500$ , replicate each scenario $1, 000$ times, and compute, by averaging over the simulation runs, the bias of the estimators, the percentage relative bias (PRB), the asymptotic standard error (ASE), and the coverage probability (CP) of a 95 percent confidence interval for $β$ .

We consider the case with three covariates, say X ₁, X ₂, and X ₃, which are binary variables with $P (X_{1} = 1) = 0.45$ , $P (X_{2} = 1) = 0.5$ , and $P (X_{3} = 1) = 0.6$ . Hence, we present the results for $X_{Y_{1 i}}^{t} = X_{Y_{2 i}}^{t} = (1, X_{1 i}, X_{2 i}, X_{3 i})$ , $X_{Y_{12 i}}^{t} = (1)$ , according to which X _i is the $4 \times 9$ matrix. Finally, we assume $β = (β_{Y_{1, 0}}, β_{Y_{1, 1}}, β_{Y_{1, 2}}, β_{Y_{1, 3}}, β_{Y_{2, 0}}, β_{Y_{2, 1}}, β_{Y_{2, 2}}, β_{Y_{2, 3}}, β_{Y_{12, 0}})^{t} = (- 0.5, 0.3, - 0.65, 0.6, - 1, 0.2, 0.7, - 0.5, - {0.5)}^{t} .$ Based on $C^{t} log (L π_{i}) = X_{i} β$ , we use a Newton–Raphson iteration to obtain $π_{i}$ and generate the two binary variables $Y_{1 i}$ and $Y_{2 i}$ . Then, the binary outcomes $Y_{1 i}^{*}$ and $Y_{2 i}^{*}$ are generated according to the following situation: $P (Y_{1 i}^{*} = Y_{1 i}) = p$ and $P (Y_{2 i}^{*} = Y_{2 i}) = q$ , where $(p, q) = (0.9, 0.9), (0.8, 0.8)$ , $(0.8, 0.7)$ , and $(0.7, 0.7)$ . Details on the choice of the initial guess for $β$ and $π$ to start the maximization procedure, and the stopping criteria for the convergence of the estimation algorithm, are reported in Online Appendix (which can be found at http://smr.sagepub.com/supplemental/).

The simulation findings for the crossed model and simple model are reported in Table 1 for $n = 500$ and in Table 2 for $n = 300$ . Although the outcomes appear slightly different for the two RR designs, they share some common features. In general, the results show the good performance of the ML estimates, which improves as the sample size increases. All in all, the ASE increases as the RR design parameters p and q decrease. The ASE shows an evident monotonic behavior for the crossed model across the four simulation cases and under the two sample sizes. For the simple model, the increase in the ASE may not be so evident and, in some cases, the ASE can decrease slightly with p and q (see, e.g., $β_{Y_{2, 1}}, β_{Y_{2, 2}}$ , and $β_{Y_{2, 3}}$ across case 3 and case 4 in Table 2). However, the behavior of the ASE is hardly surprising since, when the design parameters decrease, the degree of misclassification of the true responses increases and, consequently, estimates become less accurate. This represents the downside of increasing respondent cooperation and consequently reducing nonresponse rates and voluntary misleading responses. We note that the minimum ASE is achieved when $p = q = 1$ , meaning that the responses are not randomized at all. Since the ASE increases as p and q decrease, the CP tends to worsen, assuming values which are tendentially greater than the nominal one. Moreover, in all the considered cases, the ASE does not appear so regular, and it certainly fluctuates more under the simple model. However, by increasing the sample size, ceteris paribus, the ASE decreases and the coverage improves, approaching, as expected, the nominal 0.95 level. With regard to the bias, no definitive conclusions can be drawn, in the sense that it appears to fluctuate across the four simulation situations. The PRB tends to decrease as the sample size increases, assuming values that are almost always negligible for the crossed model while they sometimes appear striking under the simple model.

Table 1.

Simulation Results for the Crossed Model and Simple Model for n = 500.

Parameters	Crossed Model					Simple Model
Parameters	Estimate	Bias	PRB	ASE	CP	Estimate	Bias	PRB	ASE	CP
Case 1: $(p, q) = (0.9, 0.9)$
$β_{Y_{1, 0}}$	−0.501	−0.001	0.111	0.218	0.957	−0.502	−0.002	0.366	0.248	0.952
$β_{Y_{1, 1}}$	0.306	0.006	2.016	0.208	0.958	0.306	0.006	1.964	0.239	0.950
$β_{Y_{1, 2}}$	−0.658	−0.008	1.192	0.209	0.952	−0.658	−0.008	1.281	0.239	0.951
$β_{Y_{1, 3}}$	0.602	0.002	0.306	0.216	0.944	0.606	0.006	1.030	0.248	0.950
$β_{Y_{2, 0}}$	−1.007	−0.007	0.724	0.237	0.960	−1.005	−0.005	0.532	0.275	0.968
$β_{Y_{2, 1}}$	0.197	−0.003	−1.714	0.225	0.958	0.188	−0.012	−6.227	0.261	0.959
$β_{Y_{2, 2}}$	0.712	0.012	1.717	0.228	0.951	0.714	0.014	1.953	0.266	0.959
$β_{Y_{2, 3}}$	−0.502	−0.002	0.387	0.226	0.944	−0.502	−0.002	0.375	0.262	0.948
$β_{Y_{12, 0}}$	−0.516	−0.016	3.298	0.257	0.955	−0.517	−0.017	3.321	0.366	0.957
Case 2: $(p, q) = (0.8, 0.8)$
$β_{Y_{1, 0}}$	−0.499	0.001	−0.104	0.250	0.960	−0.499	0.001	−0.103	0.339	0.960
$β_{Y_{1, 1}}$	0.308	0.008	2.680	0.238	0.958	0.302	0.002	0.572	0.325	0.956
$β_{Y_{1, 2}}$	−0.662	−0.012	1.899	0.239	0.950	−0.658	−0.008	1.197	0.326	0.958
$β_{Y_{1, 3}}$	0.604	0.004	0.718	0.248	0.956	0.605	0.005	0.841	0.339	0.959
$β_{Y_{2, 0}}$	−1.010	−0.010	1.012	0.277	0.967	−0.993	0.007	−0.695	0.381	0.975
$β_{Y_{2, 1}}$	0.198	−0.002	−0.956	0.261	0.962	0.186	−0.014	−6.774	0.359	0.970
$β_{Y_{2, 2}}$	0.711	0.011	1.524	0.266	0.958	0.703	0.003	0.475	0.370	0.968
$β_{Y_{2, 3}}$	−0.501	−0.001	0.268	0.262	0.957	−0.502	−0.002	0.373	0.361	0.959
$β_{Y_{12, 0}}$	−0.525	−0.025	5.060	0.309	0.954	−0.426	0.074	−14.888	0.697	0.979
Case 3: $(p, q) = (0.8, 0.7)$
$β_{Y_{1, 0}}$	−0.491	0.009	−1.879	0.256	0.956	−0.489	0.011	−2.285	0.337	0.969
$β_{Y_{1, 1}}$	0.303	0.003	1.047	0.243	0.958	0.287	−0.013	−4.427	0.324	0.960
$β_{Y_{1, 2}}$	−0.666	−0.016	2.432	0.244	0.951	−0.646	0.004	−0.566	0.325	0.961
$β_{Y_{1, 3}}$	0.604	0.004	0.714	0.253	0.951	0.600	0.000	−0.050	0.338	0.963
$β_{Y_{2, 0}}$	−0.995	0.005	−0.545	0.329	0.972	−0.892	0.108	−10.756	0.583	0.971
$β_{Y_{2, 1}}$	0.187	−0.013	−6.701	0.308	0.962	0.156	−0.044	−22.033	0.550	0.985
$β_{Y_{2, 2}}$	0.711	0.011	1.571	0.315	0.964	0.616	−0.084	−11.951	0.565	0.969
$β_{Y_{2, 3}}$	−0.509	−0.009	1.731	0.309	0.969	−0.470	0.030	−6.092	0.554	0.970
$β_{Y_{12, 0}}$	−0.518	−0.018	3.553	0.340	0.971	−0.264	0.236	−47.137	1.043	1.000
Case 4: $(p, q) = (0.7, 0.7)$
$β_{Y_{1, 0}}$	−0.471	0.029	−5.704	0.307	0.959	−0.446	0.054	−10.740	0.522	0.972
$β_{Y_{1, 1}}$	0.292	−0.008	−2.647	0.291	0.957	0.249	−0.051	−16.930	0.498	0.975
$β_{Y_{1, 2}}$	−0.670	−0.020	3.121	0.293	0.960	−0.609	0.041	−6.249	0.501	0.976
$β_{Y_{1, 3}}$	0.602	0.002	0.415	0.304	0.962	0.552	−0.048	−7.930	0.522	0.973
$β_{Y_{2, 0}}$	−0.978	0.022	−2.239	0.346	0.971	−0.891	0.109	−10.858	0.583	0.963
$β_{Y_{2, 1}}$	0.184	−0.016	−8.043	0.324	0.965	0.161	−0.039	−19.297	0.548	0.987
$β_{Y_{2, 2}}$	0.703	0.003	0.372	0.332	0.963	0.594	−0.106	−15.189	0.562	0.965
$β_{Y_{2, 3}}$	−0.505	−0.005	1.081	0.325	0.964	−0.436	0.064	−12.722	0.551	0.974
$β_{Y_{12, 0}}$	−0.506	−0.006	1.270	0.375	0.976	−0.155	0.345	−68.955	1.607	1.000

Note: PRB = percentage relative bias; ASE = asymptotic standard error; CP = coverage probability.

Table 2.

Simulation Results for the Crossed Model and Simple Model for n = 300.

Parameters	Crossed Model					Simple Model
Parameters	Estimate	Bias	PRB	ASE	CP	Estimate	Bias	PRB	ASE	CP
Case 1: $(p, q) = (0.9, 0.9)$
$β_{Y_{1, 0}}$	−0.502	−0.002	0.339	0.283	0.964	−0.495	0.005	−1.022	0.324	0.958
$β_{Y_{1, 1}}$	0.302	0.002	0.526	0.271	0.960	0.301	0.001	0.338	0.311	0.956
$β_{Y_{1, 2}}$	−0.667	−0.017	2.563	0.271	0.960	−0.670	−0.020	3.064	0.311	0.958
$β_{Y_{1, 3}}$	0.608	0.008	1.350	0.281	0.963	0.610	0.010	1.615	0.323	0.961
$β_{Y_{2, 0}}$	−1.014	−0.014	1.353	0.310	0.962	−1.005	−0.005	0.471	0.359	0.963
$β_{Y_{2, 1}}$	0.184	−0.016	−7.819	0.294	0.955	0.174	−0.026	−12.980	0.341	0.962
$β_{Y_{2, 2}}$	0.713	0.013	1.870	0.298	0.963	0.702	0.002	0.250	0.348	0.961
$β_{Y_{2, 3}}$	−0.505	−0.005	1.042	0.295	0.943	−0.507	−0.007	1.456	0.342	0.965
$β_{Y_{12, 0}}$	−0.520	−0.020	4.094	0.337	0.951	−0.481	0.019	−3.845	0.479	0.971
Case 2: $(p, q) = (0.8, 0.8)$
$β_{Y_{1, 0}}$	−0.497	0.003	−0.662	0.327	0.970	−0.471	0.029	−5.720	0.443	0.967
$β_{Y_{1, 1}}$	0.296	−0.004	−1.351	0.310	0.953	0.286	−0.014	−4.774	0.424	0.955
$β_{Y_{1, 2}}$	−0.669	−0.019	2.853	0.312	0.951	−0.648	0.002	−0.236	0.426	0.963
$β_{Y_{1, 3}}$	0.608	0.008	1.336	0.324	0.958	0.579	−0.021	−3.464	0.443	0.959
$β_{Y_{2, 0}}$	−0.998	0.002	−0.216	0.363	0.963	−0.959	0.041	−4.122	0.496	0.960
$β_{Y_{2, 1}}$	0.176	−0.024	−12.093	0.342	0.957	0.166	−0.034	−17.139	0.468	0.978
$β_{Y_{2, 2}}$	0.706	0.006	0.809	0.348	0.971	0.640	−0.060	−8.635	0.479	0.976
$β_{Y_{2, 3}}$	−0.515	−0.015	3.086	0.343	0.966	−0.468	0.032	−6.484	0.470	0.962
$β_{Y_{12, 0}}$	−0.497	0.003	−0.613	0.405	0.967	−0.294	0.206	−41.296	0.897	0.990
Case 3: $(p, q) = (0.8, 0.7)$
$β_{Y_{1, 0}}$	−0.477	0.023	−4.625	0.333	0.969	−0.470	0.030	−6.068	0.443	0.970
$β_{Y_{1, 1}}$	0.285	−0.015	−5.148	0.317	0.962	0.302	0.002	0.604	0.424	0.965
$β_{Y_{1, 2}}$	−0.670	−0.020	3.082	0.318	0.956	−0.648	0.002	−0.305	0.426	0.967
$β_{Y_{1, 3}}$	0.605	0.005	0.795	0.330	0.966	0.550	−0.050	−8.390	0.442	0.955
$β_{Y_{2, 0}}$	−0.965	0.035	−3.507	0.429	0.959	−0.819	0.181	−18.118	0.746	0.969
$β_{Y_{2, 1}}$	0.162	−0.038	−18.879	0.401	0.962	0.167	−0.033	−16.696	0.706	0.990
$β_{Y_{2, 2}}$	0.700	0.000	0.057	0.411	0.979	0.565	−0.135	−19.233	0.718	0.980
$β_{Y_{2, 3}}$	−0.516	−0.016	3.110	0.403	0.968	−0.401	0.099	−19.800	0.711	0.993
$β_{Y_{12, 0}}$	−0.487	0.013	−2.594	0.444	0.974	−0.113	0.387	−77.398	1.339	1.000
Case 4: $(p, q) = (0.7, 0.7)$
$β_{Y_{1, 0}}$	−0.443	0.057	−11.484	0.401	0.964	−0.403	0.097	−19.442	0.685	0.981
$β_{Y_{1, 1}}$	0.279	−0.021	−6.987	0.380	0.958	0.269	−0.031	−10.464	0.655	0.987
$β_{Y_{1, 2}}$	−0.662	−0.012	1.883	0.381	0.958	−0.548	0.102	−15.626	0.658	0.978
$β_{Y_{1, 3}}$	0.591	−0.009	−1.575	0.397	0.964	0.455	−0.145	−24.146	0.681	0.975
$β_{Y_{2, 0}}$	−0.925	0.075	−7.484	0.448	0.957	−0.770	0.230	−22.950	0.739	0.971
$β_{Y_{2, 1}}$	0.167	−0.033	−16.310	0.420	0.972	0.176	−0.024	−12.107	0.703	0.991
$β_{Y_{2, 2}}$	0.685	−0.015	−2.136	0.430	0.987	0.485	−0.215	−30.686	0.710	0.969
$β_{Y_{2, 3}}$	−0.520	−0.020	4.053	0.422	0.964	−0.398	0.102	−20.337	0.709	0.993
$β_{Y_{12, 0}}$	−0.469	0.031	−6.287	0.484	0.971	−0.023	0.477	−95.476	2.042	1.000

Note: PRB = percentage relative bias; ASE = asymptotic standard error; CP = coverage probability.

In order to compare the two RR designs, the simple model seems to suffer from lack of efficiency, especially when the values of p and q and the sample size decrease. Estimates under the crossed model show smaller ASE and tend to have smaller bias, especially for $n = 300$ , where the differences appear more evident. The CP also seems to be more reliable and closer to the nominal value. In conclusion, the crossed design seems to perform best, at least for the experimental situations considered in the simulation study. Hence, motivated by the presumed superiority of the crossed model, in the following sections, we shall focus on the use of this RR design in two real surveys also because, to the best of our knowledge, no application of the simple model has yet been published and, consequently, no RR data are available for empirical analyses.

Two Real Studies

We apply RR logistic regression to real data collected by means of the crossed model in order to evaluate the feasibility of the method in estimating the determinants of two sensitive attributes. Data from two small-scale surveys are analyzed. Both studies are face-to-face surveys with interviewer-administered paper-and-pencil questionnaires. The first survey considers the use of cannabis for recreational purposes and its legalization. Conceived as a pilot study, it was designed ad hoc to collect both RR and DQ data and to compare logistic regression estimates under the two question survey modes in order to check whether the impact of factors determining the response behaviors varies across the two survey approaches. The study has, hence, intrinsic value in terms of comparative validation and can contribute to the debate regarding whether DQ surveys are reasonably accurate and useful for estimating the determinants of illegal drug use (see, e.g., Kerkvliet 1994). The second survey is related to the study discussed in Perri et al. (2016) on induced abortion and illegal immigration. Here, only RR data are considered and analyzed.

It is worth pointing out right away that the two applications are not intended as sociological and demographic studies aimed at discovering facts and providing interpretations and motivations of the observed findings. Nonetheless, they may offer the experts in the research fields some useful elements for further investigation. The next two sections describe the survey designs and present the results of the bivariate logistic regression analyses.

Cannabis Use and Its Legalization

The survey was conducted in a municipality of about 5,000 inhabitants located in Southern Italy. The motivating idea of the study is twofold: (1) estimate the prevalence of individuals who have used cannabis at least once in their life and those who are in favor of its legalization (either for therapeutic or recreational purposes) and (2) show whether the impact of individual determinants may differ across data collection modes. To this end, data have been collected on the same people with both the RR crossed model and the DQ survey format. Following the crossed model, the randomization device was implemented exactly as previously described by means of two decks of cards, with deck I containing a proportion $p = 0.6$ of cards marked with the statement “I have used cannabis at least once in my life” and a proportion $1 - p = 0.4$ of cards with the statement “I am not in favor of cannabis legalization,” and deck II containing a proportion $q = 0.6$ of cards reporting the statement “I am in favor of cannabis legalization” and a proportion $1 - q = 0.4$ of cards with the statement “I have never used cannabis in my life”. The DQ was carried out by a face-to-face interview, and the two questions were posed to the respondents: “Have you used cannabis at least once in your life?” and “Are you in favor of cannabis legalization?”.

The fieldwork was realized by a single interviewer who recruited a voluntary sample of respondents on the basis of personal contacts. Participants were first submitted to a face-to-face interview using a short paper-and-pencil standardized questionnaire containing some generic sociodemographic information about gender, age, education, employment status, marital status, and the number of children. Then, once a confidential atmosphere had been established, interviewees were provided with the two decks of cards in order to implement the RR crossed device. The interviewer asked the respondents to shuffle the cards, draw one card from each deck, read the statements on the selected cards without revealing them, and report in “yes” or “no” fashion on whether their status did or did not match the statements on the cards. Hence, each respondent provided just one of the possible pairs of responses: (no, no), (no, yes), (yes, no), or (yes, yes). The interviewer was instructed to check by an example, before embarking on the RR survey, on whether the survey participants fully understood the rules of the randomization device. In the event of doubts or questions raised about the correct execution of the experiment, the interviewer was instructed to explain the rules again and invite the participants to desist if any doubts remained after the third explanation or if they were unable to implement the technique properly. After the crossed model was run and the responses were collected, the interviewer posed the two sensitive questions directly. At the end of the period scheduled for the collection of the data, the analyzable cases included 289 participants, aged 16–60 years, who gave all the information required during the face-to-face interview.

In order to perform our analyses, let Y ₁ denote the observed answer to the question “Have you used cannabis at least once in your life?” and Y ₂ the observed answer to the question “Are you in favor of cannabis legalization?” with $Y_{j} = 1$ in case of a “yes” response and $Y_{j} = 0$ otherwise, for $j = 1, 2$ . Likewise, under the RR crossed model, let $Y_{1}^{*}$ be the response to one of the two cards from deck I and $Y_{2}^{*}$ be the response to one of the two cards from deck II, with $Y_{j}^{*} = 1$ in case of a “yes” response and $Y_{j}^{*} = 0$ otherwise.

Prevalence estimation without employing covariates

Using both types of collected data, we first obtain prevalence estimates of the investigated behaviors without employing covariates. Specifically, under DQ responses on Y ₁ and Y ₂, we get the prevalence estimate of $π = (π_{00}, π_{01}, π_{10}, π_{11})$ as:

{\hat{π}}^{D Q} = (0.318, 0.401, 0.028, 0.253) .

Consequently, the prevalence estimate of individuals who have used cannabis at least once in their life, say $P (Y_{1} = 1)$ , is ${\hat{π}}_{1 +}^{D Q} = {\hat{π}}_{10}^{D Q} + {\hat{π}}_{11}^{D Q} = 0.281$ . Similarly, we obtain a direct estimate of the proportion of individuals in favor of legalization, say $P (Y_{2} = 1)$ , as ${\hat{π}}_{+ 1}^{D Q} = {\hat{π}}_{01}^{D Q} + {\hat{π}}_{11}^{D Q} = 0.654$ . Using RR data collected by means of the crossed model, in line with equation (3), we obtain:

{\hat{π}}^{c} = (0.225, 0.304, 0.090, 0.381),

and, consequently, the RR estimates of $P (Y_{1} = 1)$ and $P (Y_{2} = 1)$ are given by ${\hat{π}}_{1 +}^{c} = 0.471$ and ${\hat{π}}_{+ 1}^{c} = 0.685$ , respectively. By summarizing previous results in Table 3, we immediately observe that estimates produced with the RR crossed model, and denoting the prevalence of individuals bearing one of the sensitive attributes or both, are higher than DQ estimates. Consequently, the RR prevalence of individuals who have neither used cannabis nor favor its legalization is lower than that under DQ. Note that the estimate of $P (Y_{1} = 1)$ under RR is nearly 1.68 times higher than under DQ, while estimates of $P (Y_{2} = 1)$ are quite high and close to each other across both survey approaches. The latter result is perhaps unsurprising given that supporting cannabis legalization is not an illegal act, and questions about it are seen as less intrusive of the personal sphere than those about cannabis use, which conversely, besides being a socially disapproved behavior, is also, and above all, punishable by law. Results for the entire sample clearly show that respondents tend to underreport socially undesirable behaviors when asked sensitive questions directly. As a consequence, DQ data may yield misleading conclusions based on lower prevalence estimates of the sensitive behavior. In contrast, the RR survey seems to reduce underreporting and, therefore, leads to higher prevalence estimates. Hence, according to the “more-is-better assumption” (e.g., Höglinger and Jann 2018; Lensvelt-Mulders et al. 2005), the crossed model seems to be an effective method for eliciting sensitive information and yields more valid prevalence estimates regarding the use and legalization of cannabis than a DQ survey. The “more-is-better assumption” appears legitimate in this case. Nonetheless, since the social desirability bias that affects the estimates might differ between subgroups of the population, it is not unusual to encounter situations where this assumption does not hold. Some examples of the latter will be discussed below.

Table 3.

Prevalence Estimates Under DQ and RR Survey Approaches Without Covariates.

Prevalence	DQ	RR Crossed Model
$P (Y_{1} = 1) = π_{1 +}$	0.281	0.471
$P (Y_{2} = 1) = π_{+ 1}$	0.654	0.685
$P (Y_{1} = 1, Y_{2} = 1) = π_{11}$	0.253	0.381
$P (Y_{1} = 0, Y_{2} = 0) = π_{00}$	0.318	0.225

Note: DQ = direct questioning; RR = randomized response.

Table 4 reports the distribution of the entire sample by sociodemographic characteristics and gives evidence on the association between the six individual characteristics and the two sensitive attributes under investigation through the chi-squared independence test. We observe that gender is almost equally represented (46.71 percent of respondents are female and are 53.29 percent male) and 46.02 percent of respondents are aged 35–60 years. Most interviewees (66.44 percent) claimed to have a medium or high level of education (nine or more years of school), while the remainder have a low level (less than nine years). Turning to the employment status, about 50 percent of interviewees have a job. Just over half of respondents claimed to be married or cohabiting and with no children (52.25 percent and 52.60 percent, respectively).

Table 4.

Sociodemographic Characteristics of Respondents for DQ Data

Variable	n	Percent	Y ₁	Y ₂
Variable	n	Percent	X ² (p-value)	X ² (p-value)
Gender (Ge)
Female ( $= 0$ )	135	46.71	14.165 $(.000)$ ***	4.743 $(.029)$ **
Male ( $= 1$ )	154	53.29	14.165 $(.000)$ ***	4.743 $(.029)$ **
Education (Edu)
Less than nine years of school ( $= 0$ )	97	33.56	16.593 $(.000)$ ***	15.298 $(.000)$ ***
Nine or more years of school ( $= 1$ )	192	66.44	16.593 $(.000)$ ***	15.298 $(.000)$ ***
Employment status (Emp)
Not working ( $= 0$ )	145	50.17	11.847 $(.000)$ ***	3.284 $(.070)$ *
Working ( $= 1$ )	144	49.83	11.847 $(.000)$ ***	3.284 $(.070)$ *
Age
16–34 years ( $= 0$ )	156	53.98	0.532 $(.466)$	0.135 $(.714)$
35–60 years ( $= 1$ )	133	46.02	0.532 $(.466)$	0.135 $(.714)$
Marital status (Ms)
Single/divorced/separated ( $= 0$ )	138	47.75	1.004 $(.316)$	0.096 $(.757)$
Married/cohabiting ( $= 1$ )	151	52.25	1.004 $(.316)$	0.096 $(.757)$
Children (Ch)
No ( $= 0$ )	152	52.60	2.393 $(.122)$	1.029 $(.311)$
Yes ( $= 1$ )	137	47.47	2.393 $(.122)$	1.029 $(.311)$

Note: Chi-squared independence test of Y ₁ versus Y ₂: $χ^{2} = 28.906$ and $p - value < .001$ . DQ = direct questioning.

$* p - value < .1$ . ** $p - value < .05$ . *** $p - value < .001$ .

Some associations appear statistically significant. Specifically, the group of respondents who reported having used cannabis ( $Y_{1} = 1$ ) is significantly different from respondents who did not ( $Y_{1} = 0$ ) on variables Ge, Edu, and Emp. Similarly, respondents who support cannabis legalization ( $Y_{2} = 1$ ) significantly differ from respondents who are not in favor ( $Y_{2} = 0$ ) on variables Ge, Edu, and, to a lesser extent, Emp.

To examine such possible effects of the two sensitive behaviors further, we first consider two marginal univariate logistic regression models for Y ₁ and Y ₂ which include six binary variables for the full model in Table 5. From our analysis, we note that cannabis use is more likely for males, highly educated people, and people with a job and less likely for people who have children or are married or cohabiting. Moreover, the probability of cannabis use increases with age. Similar results, except for the sign of the variable Ms, are observed for the determinants that affect Y ₂. After sequentially removing all these nonsignificant background variables, we examined the three variables Ge, Edu, and Emp for the reduced model. For both the response variables Y ₁ and Y ₂, DQ data produce statistically significant estimates for the effects of Ge and Edu, while the determinant Emp has a significant positive effect only on Y ₁, even though its effect becomes smaller and its significance lessens in the reduced model in Table 5.

Table 5.

Univariate Logistic Regression Analysis for DQ Data.

Variable	Parameters	Full Model				Reduced Model
		Y ₁		Y ₂		Y ₁		Y ₂
		Estimate	p-value	Estimate	p-value	Estimate	p-value	Estimate	p-value
Intercept	$β_{0}$	−2.645 (0.423)	.000***	−0.319 (0.300)	.288	−2.844 (0.397)	.000***	−0.365 (0.253)	.150
$X_{1} : G e$	$β_{1}$	0.983 (0.302)	.000***	0.548 (0.270)	.043**	1.013 (0.300)	.000***	0.558 (0.268)	.037**
$X_{2} : E d u$	$β_{2}$	1.309 (0.360)	.000***	1.031 (0.276)	.000***	1.341 (0.356)	.000***	1.022 (0.270)	.000***
$X_{3} : E m p$	$β_{3}$	0.654 (0.311)	.036**	0.130 (0.284)	.648	0.564 (0.294)	.055*	0.143 (0.273)	.600
$X_{4} : A g e$	$β_{4}$	0.123 (0.385)	.749	0.267 (0.362)	.461
$X_{5} : M s$	$β_{5}$	−0.294 (0.440)	.504	0.050 (0.399)	.900
$X_{6} : C h$	$β_{6}$	−0.256 (0.501)	.609	−0.392 (0.449)	.383
AIC		314.95		364		311.39		359.04

Note: Values in parentheses are the asymptotic standard error of the estimates. DQ = direct questioning;.

* $p - value < .1$ . ** $p - value < .05$ . *** $p - value < .001$ .

Bivariate logistic regression analysis for RR and DQ data

Table 6 reports the bivariate logistic regression analyses between cannabis use (Y ₁) and cannabis legalization (Y ₂) using the same two sets of covariates: one that includes six variables for the results of the full model and the other only the variables Ge, Edu, and Emp for the results of the reduced model. The results highlight interesting comparisons between the RR and DQ survey modes. We preliminarily remark that the interpretation of the regression coefficients estimated on RR data mimics that for DQ estimates.

First, looking at the DQ estimates, we observe that the effects and the significance of the determinants are nearly the same as in the univariate models. The effects of $G e$ and $E d u$ are statistically significant on Y ₁ and Y ₂, while the determinant $E m p$ has a positive significant effect only on Y ₁. Additionally, the bivariate model reveals a significant association between Y ₁ and Y ₂ by means of the estimation of the odds ratio which, for the full model, is $exp ({\hat{β}}_{Y_{12},0}) = exp (1.776) = 5.906$ with a Wald 95 percent confidence interval $[2.596, 13.423]$ . Similarly, the effects of the reduced model are the same as the full model.

For the RR crossed design and full model, we observe that the association between Y ₁ and Y ₂ is significant ( ${\hat{β}}_{Y_{12, 0}} = 1.630$ , $p - value = .038$ ), while none of the covariates has a significant impact on Y ₁ and Y ₂. Indeed, only $G e$ ( ${\hat{β}}_{Y_{2, 1}} = - 1.388$ , $p - value = .079$ ) seems to be a significant variable at the 10 percent level in the reduced model. RR estimates from Table 6 suggest that men and more educated people are less likely to use cannabis. Men also seem less prone to support cannabis legalization. Looking at the sign of the regression coefficients across the two survey modes, we note that opposite conclusions stem from DQ data. We therefore conclude that DQ and RR data do not yield the same conclusions on the effects of the determinants of the two sensitive behaviors.

Table 6.

Bivariate Logistic Regression Analysis for RR and DQ Data.

Variable	Parameters	Full Model				Reduced Model
		DQ		RR Crossed Model		DQ		RR Crossed Model
		Estimate	p-value	Estimate	p-value	Estimate	p-value	Estimate	p-value
Intercept	$β_{Y_{1, 0}}$	−2.688(0.425)	$.000$ **	−0.277(0.835)	.740	−2.884(0.398)	$.000$ ***	−0.652(0.727)	.370
$X_{1} : G e$	$β_{Y_{1, 1}}$	1.004(0.301)	$.001$ **	0.008(0.676)	.991	1.030(0.300)	$.001$ **	−0.065(0.635)	.919
$X_{2} : E d u$	$β_{Y_{1, 2}}$	1.377(0.361)	$.000$ ***	−0.120(0.782)	.878	1.359(0.356)	$.000$ ***	−0.311(0.746)	.677
$X_{3} : E m p$	$β_{Y_{1, 3}}$	0.705(0.312)	$.024$ **	1.241(0.763)	.104	0.590(0.294)	$.045$ **	1.162(0.726)	.109
$X_{4} : A g e$	$β_{Y_{1, 4}}$	0.082(0.383)	.830	−1.344(0.871)	.123
$X_{5} : M s$	$β_{Y_{1, 5}}$	−0.377(0.439)	.391	−1.047(1.012)	.301
$X_{6} : C h$	$β_{Y_{1, 6}}$	−0.174(0.500)	.727	1.187(1.047)	.257
Intercept	$β_{Y_{2, 0}}$	−0.300(0.300)	.316	0.182(0.896)	.839	−0.355(0.253)	.160	0.424(0.745)	.569
$X_{1} : G e$	$β_{Y_{2, 1}}$	0.552(0.270)	$.041$ **	−1.241(0.811)	.126	0.566(0.268)	$.034$ **	−1.388(0.790)	.079*
$X_{2} : E d u$	$β_{Y_{2, 2}}$	1.042(0.276)	$.000$ ***	0.876(0.816)	.283	1.029(0.270)	$.000$ ***	0.468(0.781)	.549
$X_{3} : E m p$	$β_{Y_{2, 3}}$	0.090(0.284)	.753	0.933(0.827)	.259	0.107(0.273)	.694	1.382(0.862)	.109
$X_{4} : A g e$	$β_{Y_{2, 4}}$	0.264(0.362)	.465	−1.199(1.098)	.275
$X_{5} : M s$	$β_{Y_{2, 5}}$	0.066(0.399)	.869	0.136(1.133)	.904
$X_{6} : C h$	$β_{Y_{2, 6}}$	−0.419(0.449)	.350	1.416(1.205)	.240
Intercept	$β_{Y_{12, 0}}$	1.776(0.419)	$.000$ ***	1.630(0.784)	.038 **	1.751(0.414)	$.000$ ***	1.077(0.647)	.096*
AIC		657.386		744.33		648.937		743.496

Note: Values in parentheses are the asymptotic standard error of the estimates. DQ = direct questioning; RR = randomized response.

$* p - value < .1$ . ** $p - value < .05$ . *** $p - value < .001$ .

At first sight, this finding might not sound promising even if, upon deeper analysis, it is not so surprising given the (modest) size of the sample (n = 289) and considering that, in general, the RRT adds a random noise to the responses that reduce the statistical efficiency of the inferential procedures. This lack of efficiency represents the cost of enhancing respondent privacy protection and increasing cooperation. For the RR crossed design, this aspect clearly emerges from Table 6, where we observe that RR estimates are always affected by higher standard errors than DQ. As a consequence, the higher standard error of the estimates is likely to significantly reduce the statistical power of the testing procedure for the regression coefficients, and larger sample sizes would be required to achieve the same level of precision as DQ estimates. Undoubtedly, the statistical power of RR analysis depends on several factors such as the prevalence of the sensitive attribute, the RR design, and the randomization probabilities (see, e.g., Ulrich, Schröter, and Stringel 2012). That said, it is not so easy to determine in advance the sample size required to yield a sufficiently high level of statistical power and reduce Type II error. To our knowledge, no statistical power analysis for the RR crossed and simple models has been carried out in the literature, so we do not have sufficient information to assess the extent of the problem. The issue, which is beyond the scope of this article, certainly deserves more attention and may be the object of future investigation. The lack of statistical power is not the sole reason for why regression coefficients do not appear significant. A more epistemological and theoretical reason can be adduced: It is based on an underlying assumption of the RRT, according to which the indirect questioning survey modes should eliminate or at least reduce the effects of individual attributes of the respondent and/or situational characteristics of the interview. In line with this hypothesis, discussed and supported, for instance, in Wolter and Preisendörfer (2013), the covariates are not expected to be significant in predicting the response variables. Therefore, if that expectation were well grounded, the assumption would be confirmed by our work. Unfortunately, at the moment, there is insufficient empirical evidence in the literature to support this working hypothesis, and the same results in Wolter and Preisendörfer (2013) are not fully confirmatory.

As stated by Kerkevliet (1994), differences between the estimates of the regression coefficients with RR and DQ data can be used to evaluate the sensitivity (or fragility) of the two survey modes. Large differences (in the sign and the statistical significance of the coefficients) would tend to vitiate the usefulness of DQ data on drug use. Due to these ascertained differences in our analysis, we would have reasons to support many of the criticisms directed at DQ, according to which DQ data are likely to produce biased conclusions on the determinants of drug use. Nonetheless, even if our results seem to point in that direction, in the light of the questions raised previously, we prefer to be more cautious and maintain a low profile, without venturing any definitive conclusion on the validity of one method rather than another. Certainly, with hindsight, a larger scale survey would have been useful to disentangle the question about the statistical performance of the crossed model in investigating the determinants of cannabis use and its legalization, but, following the Kerkevliet (1994) RR survey based on 215 students, and considering our limited resources, we felt that a sample size of around 300 units would be sufficient for a pilot study and to implement the RR bivariate logistic model.

To conclude the section, we display in Table 7 the point prevalence estimates and the 95 percent confidence intervals for $π_{1 +}$ , $π_{+ 1}$ , $π_{11}$ , and $π_{00}$ obtained for subgroups of the population from the bivariate logistic regression model estimated in Table 6. For space saving, we omit the estimates for $π_{10}$ or $π_{01}$ , which, however, can be easily computed from the provided estimates. As anticipated, we note that the “more-is-better assumption” does not hold in all of the subgroups considered. For instance, focusing on the estimates for $π_{1 +}$ and $π_{+ 1}$ , we note that for the men who are both highly educated and unemployed, DQ provides higher prevalence estimates than RR; similarly for $π_{11}$ . In three other cases of eight, higher DQ estimates are obtained only for cannabis legalization. While the unique higher DQ estimate for cannabis use might be considered the exception that proves the rule, the motivation for supporting legalization may be different. It is worth pointing out that declaring a favorable opinion on cannabis legalization is certainly less sensitive than reporting cannabis use, and therefore respondents may not have the same reason to be unforthcoming. This expectation is confirmed for the entire population by looking at the DQ and RR estimates of $π_{+ 1}$ reported in Table 3, which are nearly identical. On the other hand, since favoring cannabis legalization may denote positive and desirable personal traits such as an open mind, liberal behaviors, and modern social views, some groups of the population may tend to overreport the attitude when it is surveyed by DQ, and, consequently, this would produce higher false estimates. Given this working hypothesis, it is expected that RR estimates would be lower than DQ ones and the “less-is-better assumption” might be applied to support the validity of the RR crossed model. However, whatever assumption is adopted, it is hard to draw definitive conclusions because other factors can arise making evaluation more complex. For instance, RR estimates might be subject to bias induced by noncompliance with the RR design, which in a broad sense includes cheating and self-protective responding, misunderstanding the RR instructions, a desire to break the RR rules, carelessness, and so on (see, e.g., Heck, Hoffmann, and Moshagen 2018; Höglinger and Jann 2018).

Table 7.

Prevalence Point Estimate and Wald 95 Percent Confidence Interval (CI) From Bivariate Logistic Regression Model for Cannabis Use and Its Legalization.

X ₁ $G e$	X ₂ $E d u$	X ₃ $E m p$		DQ		RR Crossed Model
X ₁ $G e$	X ₂ $E d u$	X ₃ $E m p$	Prevalence	Estimate	95 Percent CI	Estimate	95 Percent CI
0	0	0	$π_{1 +}$	0.053	[0.018, 0.088]	0.342	[0.260, 0.425]
			$π_{+ 1}$	0.412	[0.333, 0.491]	0.604	[0.499, 0.710]
			$π_{11}$	0.042	[0.019, 0.065]	0.261	[0.210, 0.312]
			$π_{00}$	0.577	[0.520, 0.634]	0.314	[0.261, 0.368]
1	0	0	$π_{1 +}$	0.135	[0.082, 0.188]	0.328	[0.243, 0.413]
			$π_{+ 1}$	0.552	[0.458, 0.646]	0.276	[0.196, 0.356]
			$π_{11}$	0.116	[0.079, 0.152]	0.140	[0.100, 0.180]
			$π_{00}$	0.428	[0.371, 0.485]	0.536	[0.478, 0.593]
0	1	0	$π_{1 +}$	0.179	[0.121, 0.236]	0.276	[0.204, 0.348]
			$π_{+ 1}$	0.662	[0.562, 0.762]	0.709	[0.603, 0.816]
			$π_{11}$	0.161	[0.119, 0.203]	0.234	[0.186, 0.283]
			$π_{00}$	0.320	[0.266, 0.374]	0.249	[0.199, 0.299]
1	1	0	$π_{1 +}$	0.379	[0.305, 0.453]	0.264	[0.186, 0.341]
			$π_{+ 1}$	0.776	[0.664, 0.888]	0.378	[0.289, 0.468]
			$π_{11}$	0.351	[0.296, 0.406]	0.150	[0.109, 0.191]
			$π_{00}$	0.196	[0.150, 0.242]	0.508	[0.450, 0.566]
0	0	1	$π_{1 +}$	0.092	[0.046, 0.137]	0.625	[0.541, 0.709]
			$π_{+ 1}$	0.438	[0.353, 0.524]	0.859	[0.750, 0.968]
			$π_{11}$	0.073	[0.043, 0.103]	0.568	[0.511, 0.625]
			$π_{00}$	0.543	[0.486, 0.600]	0.085	[0.053, 0.117]
1	0	1	$π_{1 +}$	0.220	[0.155, 0.286]	0.609	[0.508, 0.711]
			$π_{+ 1}$	0.579	[0.478, 0.680]	0.603	[0.502, 0.704]
			$π_{11}$	0.188	[0.143, 0.233]	0.429	[0.372, 0.486]
			$π_{00}$	0.389	[0.332, 0.445]	0.216	[0.169, 0.264]
0	1	1	$π_{1 +}$	0.282	[0.212, 0.352]	0.550	[0.473, 0.627]
			$π_{+ 1}$	0.686	[0.579, 0.793]	0.907	[0.793, 1.020]
			$π_{11}$	0.253	[0.203, 0.303]	0.520	[0.463, 0.578]
			$π_{00}$	0.285	[0.233, 0.337]	0.064	[0.036, 0.092]
1	1	1	$π_{1 +}$	0.524	[0.443, 0.605]	0.534	[0.442, 0.625]
			$π_{+ 1}$	0.794	[0.683, 0.905]	0.708	[0.600, 0.817]
			$π_{11}$	0.480	[0.423, 0.538]	0.432	[0.375, 0.489]
			$π_{00}$	0.163	[0.120, 0.206]	0.190	[0.145, 0.236]

Note: DQ = direct questioning; RR = randomized response.

Induced Abortion and Illegal Immigration

We perform a bivariate logistic regression analysis based on RR data to simultaneously evaluate the determinants of two sensitive topics, that is, induced abortion and illegal immigration, previously investigated by Perri et al. (2016) without making use of covariates.

According to Italian legislation on immigration, “irregular status” is defined as lack of documents (residence permit or any other authorization) which allows a foreigner to legally reside in the country. Data on the subject have been collected only through the RR crossed survey model, whose details, together with estimates of the prevalence of the two phenomena, are provided in Perri et al. (2016). In short, the survey was realized in Calabria, a region in Southern Italy, and involved a convenience sample, spread across the entire region, of $868$ foreign women from 69 different countries. A number of qualified female interviewers were employed to conduct a face-to-face survey and to contact foreign women at various aggregation points such as places of worship or leisure, medical and assistance centers, phone centers, parks, public squares, and so on. The women contacted were first asked to provide sociodemographic information through a standardized short one-page questionnaire. Then, each woman was asked to implement the crossed model and then received the two decks of cards. Cards in deck I reported the two statements “I have had an abortion in Italy” and “I am legally present in Italy” in the proportion $p = 0.7$ and $1 - p = 0.3$ , respectively. Similarly, cards in deck II reported the two statements “I am not legally present in Italy” and “I have never had an abortion in Italy” in the proportion $q = 0.68$ and $1 - q = 0.32$ , respectively. All the women were assured by the interviewers that their privacy would be adequately protected by the survey design and that their confidential details would not be exposed in their answers.

Prevalence estimation without employing covariates

Let Y ₁ and Y ₂ denote the two latent sensitive variables indicating abortion and immigrant status, respectively. Moreover, let $Y_{1}^{*} = 1$ denote a “yes” answer to the statement marked on the card selected from deck I, and $Y_{2}^{*} = 1$ denote a “yes” answer from deck II. Without using any covariate, the RR estimate of the vector $π$ , according to equation (3), is given by:

{\hat{π}}^{c} = (0.797, 0.022, 0.100, 0.081),

from which it follows that ${\hat{π}}_{1 +}^{c} = 0.181$ and ${\hat{π}}_{+ 1}^{c} = 0.103$ , respectively. We observe that the prevalence estimate of induced abortion is almost double the prevalence for illegal female immigrants, and this may be a sign of the more sensitive nature of the latter trait.

Bivariate logistic regression analysis for RR data

Bivariate regression analysis between abortion (Y ₁) and irregular immigrant status (Y ₂) is performed using the following binary sociodemographic variables: age (Age; 15–34 years = 0; 35 years or more = 1), education (Edu; 13 years or more = 1), employment status (Emp; not working = 0, working = 1), marital status (Ms; single/separated/divorced = 0, married/cohabiting = 1), number of unwanted pregnancies (Nup; 1 or more = 1), contraceptive use (Cu; no = 0, yes = 1), and time spent in Italy (Time; 6.5 years or more = 1). A number of models have been estimated using different combinations of covariates for Y ₁ and Y ₂. For brevity, and the sake of illustration, we shall only comment on the results for the model presented in Table 8. Note first that there are 794 subjects in the validation data set.

Table 8.

Bivariate Logistic Regression Estimates Under the RR Crossed Model for Induced Abortion and Irregular Immigrant Status.

Variable	Parameters	Estimate	p-value
Intercept	$β_{Y_{1, 0}}$	−1.635(0.495)	.001**
$X_{1} : A g e$	$β_{Y_{1, 1}}$	0.608(0.362)	.093**
$X_{2} : E d u$	$β_{Y_{1, 2}}$	−0.102(0.497)	.838
$X_{3} : E m p$	$β_{Y_{1, 3}}$	−0.482(0.462)	.297
$X_{4} : M s$	$β_{Y_{1, 4}}$	−0.214(0.462)	.634
$X_{5} : N u p$	$β_{Y_{1, 5}}$	1.669(0.344)	.000***
$X_{6} : C u$	$β_{Y_{1, 6}}$	−0.275(0.357)	.441
Intercept	$β_{Y_{2, 0}}$	−1.822(0.601)	.002**
$X_{2} : E d u$	$β_{Y_{2, 1}}$	−0.616(0.805)	.444
$X_{3} : E m p$	$β_{Y_{2, 2}}$	−0.384(0.664)	.563
$X_{4} : M s$	$β_{Y_{2, 3}}$	−0.146(0.654)	.823
$X_{7} : T i m e$	$β_{Y_{2, 4}}$	0.308(0.491)	.530
Intercept	$β_{Y_{12, 0}}$	3.635(1.130)	.001**
AIC		2,056.984

Note: Values in parentheses are the asymptotic standard error of the estimates. Estimates are obtained on the basis of 794 complete cases. RR = randomized response.

*p-value $< .1$ . **p-value $< .05$ . ***p-value $< .001$ .

Older women and those who have experienced unwanted pregnancies are much more likely to have had a voluntary abortion, while the propensity decreases for educated women, working women, those who have a partner, and those who utilize contraception. Except, perhaps, for the variable Age, which in the demographic literature is known to exert a quadratic effect on abortion (with the abortion rates decreasing with age), all these results are reasonable, coherent, and in line with the literature on the subject. Also reasonable are the conclusions for the determinants of the immigrant status. The propensity to be irregular decreases with years of education, and working women or those with a partner are more likely to have regular immigration status. Highly educated women are presumed to have sufficient financial resources for their stay in Italy, have a job, or be studying, conditions that, according to Italian law, allow non-Italian citizens the right of residence in the country. Having a partner suggests a more stable family relationship, which is often found in legal immigrants. Moreover, women who are married to men legally residing in Italy have the right stay in the country, for a long-term, fixed-term, or indefinite stay, upon the issuance of an entry visa for family reunification. The length of the stay in Italy at the time of the interview has positive effect on the irregular status: Women who have been in Italy for a long time are more likely to have irregular status. Although this aspect may sound strange, it is not completely unexpected and is probably linked to the large presence in the surveyed population of the so-called overstayers, that is, foreigners who entered in Italy legally and remained in the country after their right to stay had expired. A large number of the women surveyed (37 percent) are from Eastern Europe, and all citizens of an EU27 Member State have the right to enter the country without any formal permit, other than valid travel documents (usually an identity card or passport) and, preferably, a health insurance policy that covers medical expenses in case of need, for up to three months. But, according to Italian law, after this period, non-Italian citizens are required to register at the municipality of residence and will receive a residence permit only if they have a job, are currently studying in the country, or have sufficient financial resources to cover their stay. Our explanation based on the overstayers seems to be confirmed by adding the variable Age in the covariate set for the irregular immigrant status (Y ₂). For the new model (not presented here), the estimated coefficients for Age and Time are both positive, and this supports the idea that irregular presence is likely to be attributable mostly to overstaying.

Regarding the statistical performance of the RR survey, we observe that the association between the two sensitive behaviors is significant ( ${\hat{β}}_{Y_{12},0} = 3.635$ , $p - value = .001$ ), while RR data produce statistically significant estimated effects only for $N u p$ . The latter outcome may appear unusual to survey practitioners, but, unlike in the first survey, we are no longer inclined to identify the small sample size as the main cause for the statistical performance of the regression coefficient estimates. The sample size ( $n = 794$ ) is now nearly three times the size of the first survey, and, although this value might be considered inadequate, the severe opinion in terms of poor statistical power could be mitigated. The possible explanation then moves toward the thesis that the effect of the individual characteristics inducing respondents not to reveal sensitive behaviors in DQ survey mode is reduced when using the RRT approach. However, rather than expressing a final opinion on the question, it seems wiser to investigate the matter with ad hoc additional studies conceived for the purpose, keeping well in the mind the focus of the question and the problem to be faced.

According to the estimated regression model in Table 8, we derived for all the subgroups of the population the point estimates and the Wald 95 percent confidence intervals for the prevalence of women who have had an abortion ( ${\hat{π}}_{1 +}$ ), the prevalence of female immigrants with irregular status ( ${\hat{π}}_{+ 1}$ ), and the prevalence of female immigrants with both of the sensitive characteristics ( ${\hat{π}}_{11}$ ) and that of women with neither of the two characteristics $({\hat{π}}_{00})$ . For the sake of brevity, Table 9 reports the estimates for selected subgroups of women. The complete results are available from the authors upon request. Without going into detailed interpretation (we leave that to experts on the phenomena), we limit ourselves to observing that estimates of induced abortion can take a large spectrum of values ranging from ${\hat{π}}_{1 +} = 0.063$ to ${\hat{π}}_{1 +} = 0.655$ , with expected peaks for women with unwanted pregnancies. The estimates for the presence of illegal immigrants are undoubtedly lower, ranging from ${\hat{π}}_{+ 1} = 0.049$ to ${\hat{π}}_{+ 1} = 0.180$ , and this may denote not only the limited diffusion of the phenomenon in the population under study, but also, given the high sensitivity of the topic, a tendency to deceive that, notwithstanding the protection guaranteed by the RR crossed model, still generates underreporting due to fear of sanctions.

Table 9.

Prevalence Point Estimate and Wald 95 Percent Confidence Interval (CI) From Bivariate Logistic Regression Model for Induced Abortion and Irregular Immigrant Status.

X ₁	X ₂	X ₃	X ₄	X ₅	X ₆	X ₇	$π_{1 +}$		$π_{+ 1}$		$π_{11}$		$π_{00}$
$A g e$	$E d u$	$E m p$	$M s$	$N u p$	$C u$	$T i m e$	${\hat{π}}_{1 +}$	95 Percent CI	${\hat{π}}_{+ 1}$	95 Percent CI	${\hat{π}}_{11}$	95 Percent CI	${\hat{π}}_{00}$	95 Percent CI
0	0	0	0	0	0	0	0.163	[0.125, 0.201]	0.139	[0.105, 0.173]	0.103	[0.082, 0.124]	0.801	[0.773, 0.828]
1	0	0	0	0	0	0	0.264	[0.217, 0.311]	0.139	[0.108, 0.171]	0.123	[0.100, 0.145]	0.720	[0.689, 0.751]
0	1	0	0	0	0	0	0.150	[0.113, 0.186]	0.080	[0.054, 0.106]	0.064	[0.047, 0.081]	0.834	[0.808, 0.860]
0	0	1	0	0	0	0	0.108	[0.076, 0.139]	0.099	[0.069, 0.129]	0.065	[0.048, 0.082]	0.858	[0.834, 0.882]
0	0	0	1	0	0	0	0.136	[0.101, 0.171]	0.123	[0.090, 0.155]	0.086	[0.066, 0.105]	0.827	[0.801, 0.853]
0	1	1	1	0	0	0	0.081	[0.054, 0.108]	0.049	[0.028, 0.070]	0.033	[0.020, 0.045]	0.903	[0.882, 0.924]
0	0	0	0	1	0	0	0.508	[0.451, 0.566]	0.139	[0.111, 0.168]	0.135	[0.111, 0.158]	0.487	[0.452, 0.522]
1	0	0	0	1	0	0	0.655	[0.597, 0.714]	0.139	[0.112, 0.166]	0.137	[0.113, 0.161]	0.342	[0.310, 0.375]
0	0	0	0	0	1	0	0.129	[0.096, 0.162]	0.139	[0.104, 0.174]	0.090	[0.070, 0.110]	0.822	[0.795, 0.848]
0	1	1	1	0	1	0	0.063	[0.038, 0.087]	0.049	[0.028, 0.070]	0.029	[0.017, 0.040]	0.917	[0.898, 0.936]
0	0	0	0	1	1	0	0.440	[0.384, 0.496]	0.139	[0.110, 0.168]	0.133	[0.109, 0.156]	0.554	[0.519, 0.588]
1	0	0	0	1	1	0	0.591	[0.532, 0.649]	0.139	[0.112, 0.167]	0.136	[0.112, 0.160]	0.406	[0.372, 0.440]
1	1	1	1	1	1	0	0.394	[0.346, 0.442]	0.049	[0.031, 0.067]	0.047	[0.032, 0.061]	0.604	[0.570, 0.638]
0	0	0	0	0	0	1	0.163	[0.127, 0.200]	0.180	[0.141, 0.220]	0.121	[0.098, 0.144]	0.778	[0.749, 0.807]
0	0	0	0	1	0	1	0.508	[0.450, 0.568]	0.180	[0.148, 0.212]	0.174	[0.147, 0.200]	0.485	[0.450, 0.520]
1	0	0	0	1	0	1	0.655	[0.594, 0.716]	0.180	[0.150, 0.211]	0.177	[0.150, 0.204]	0.342	[0.309, 0.374]
0	1	1	1	1	0	1	0.318	[0.271, 0.365]	0.065	[0.044, 0.087]	0.061	[0.044, 0.078]	0.678	[0.646, 0.710]
0	1	1	1	0	1	1	0.063	[0.038, 0.087]	0.065	[0.041, 0.090]	0.035	[0.022, 0.048]	0.907	[0.887, 0.927]
1	0	0	0	1	1	1	0.591	[0.530, 0.652]	0.180	[0.149, 0.212]	0.176	[0.149, 0.202]	0.405	[0.371, 0.439]
1	1	1	1	1	1	1	0.394	[0.344, 0.444]	0.065	[0.045, 0.086]	0.062	[0.046, 0.079]	0.603	[0.569, 0.637]

Conclusion

This article aims at the evaluation of the simple and crossed models, two different RR procedures proposed by Lee et al. (2013) to simultaneously elicit more valid information than DQ on two sensitive characteristics, and to estimate their prevalence in the population. Indeed, substantive questions in the social and behavioral sciences go beyond determining the prevalence of deviant behaviors, stigmatizing traits, or incriminating attitudes and explore associations between several variables. In line with this, the prime concern of the article lies in the correlation and prediction involving one or more covariates.

Some methodological advances have been produced and supported by a simulation study and by two real surveys based on RR data. Specifically, we combined data from the RR simple and crossed models with other individual information to estimate the determinants of two sensitive behaviors. Since the RRT approach leads to data misclassification, adjusted methods for data analysis are required. Hence, to serve this purpose and utilize RR data, we adapted the multivariate logistic regression model, initially proposed by Gloneck and McCullagh (1995) and subsequently discussed in Van den Hout et al. (2007), to the case where sensitive response variables are intentionally misclassified for the purpose of protecting the privacy of survey participants. The ML estimation of the bivariate logistic regression models is introduced, and a simulation study is performed to assess the accuracy of the estimates. The simulation also sheds light on the performance of the two considered RR designs with the balance in favor of the crossed model. To complete the article, an application has been included which discusses how the crossed model has been implemented in two real surveys and presents some bivariate logistic regression analyses on the determinants of the sensitive behaviors investigated in the surveys. It is worth observing that the performed analyses do not claim to be exhaustive or to fully explain the phenomena by providing sociological motivations and implications; rather, they aim at drawing researchers’ attention to a number of open questions to be disentangled in future research.

In the first of the two surveys, concerning cannabis use and its legalization, both RR and DQ data are used to estimate the determinants of the two sensitive topics. Comparing prevalence estimates, it emerges that RR estimates are in general higher than DQ. A few exceptions have been found for certain subgroups of respondents depending on the sensitivity of the surveyed themes. Consequently, in keeping with the well-known “more-is-better assumption,” the RR survey is more likely to produce more valid results than DQ. Furthermore, regression analyses performed under the two survey modes give the opportunity to evaluate whether the inferential conclusions are the same across data types and whether there are differences in the effects of determinants according to the questioning mode. Differences are, therefore, emphasized and a study of sensitivity is conducted by looking at the sign and significance of the logistic regression coefficients. The output of the analysis shows that, under the two data collection methods, the sign of the coefficients is different as is their statistical significance. Specifically, RR estimates turn out to be nonsignificant, except for one covariate, and two possible explanations are provided: (1) low statistical power of the inferential procedure due to the modest size (n = 298) of the surveyed sample and (2) empirical evidence of the hypothesis according to which RR procedures eliminate, or at least reduce, the effect of respondent characteristics.

In the second real survey, only RR data are used to investigate the determinants of induced abortion and irregular immigration status, and therefore, comparative analyses cannot be produced between RR and DQ data as in the first study. However, also in this survey, the estimated regression coefficients are, with the exception of one covariate, nonsignificant, although the sample size is undoubtedly larger (n = 794) than in the first study.

To conclude, we shall summarize in a few points the salient aspects that characterize the article: (1) producing feasible methodological and empirical advances in the use of RR simple and crossed models; (2) confirming, with the exception of few and somewhat expected cases, that more valid and reliable estimates can be produced by using RR data rather than DQ data and (3) empirical evidence of the underlying idea behind RR surveys according to which there may be differences in the determinants that affect sensitive behaviors across questioning modes. Accordingly, at least for the considered surveys, the RR crossed model seems to significantly reduce the effect of the individual characteristics of the respondents on the sensitive response variables.

Supplemental Material

SupplementaryMaterial_SMR_20191215 - A Logistic Regression Extension for the Randomized Response Simple and Crossed Models: Theoretical Results and Empirical Evidence

SupplementaryMaterial_SMR_20191215 for A Logistic Regression Extension for the Randomized Response Simple and Crossed Models: Theoretical Results and Empirical Evidence by Shu-Hui Hsieh and Pier Francesco Perri in Sociological Methods & Research

Footnotes

Acknowledgment

The authors are grateful to an associate editor and two referees for their helpful comments that improved the presentation.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research of S. H. Hsieh was supported by the Ministry of Science and Technology (MOST) of Taiwan, ROC (105-2118-M-001-009). The research of P.F. Perri was supported by the Ministerio de Economía y Competitividad (Spain), grant ID MTM2015-63609-R.

ORCID iD

Shu-Hui Hsieh

Supplemental Material

Supplemental material for this article is available online.

References

Chaudhuri

2011. Randomized Response and Indirect Questioning Techniques in Surveys. Boca Raton, FL: Chapman & Hall/CRC.

Chaudhuri

Christofides

T. C.

. 2013. Indirect Questioning in Sample Surveys. Heidelberg, Germany: Springer.

Chaudhuri

Christofides

T. C.

Rao

C.R.

. 2016. Handbook of Statistics 34—Data Gathering, Analysis and Protection of Privacy through Randomized Response Techniques: Qualitative and Quantitative Human Traits. Amsterdam, the Netherlands: Elsevier.

Chaudhuri

Mukerjee

. 1988. Randomized Response: Theory and Techniques. New York: Marcel Dekker.

Christofides

T. C.

2005. “Randomized Response Technique for Two Sensitive Characteristics at the Same Time.” Metrika 62:53–63.

Corstange

2009. “Sensitive Questions, Truthful Answers? Modeling the List Experiment with LISTIT.” Political Analysis 17:45–63.

Cruyff

M. J. L. F.

Böckenholt

Van der Heijden

P. G. M.

Frank

L. E.

. 2016. “A Review of Regression Procedures for Randomized Response Data, Including Univariate and Multivariate Logistic Regression, the Proportional Odds Model and Item Response Model, and Self-protective Responses.” in Handbook of Statistics 34—Data Gathering, Analysis and Protection of Privacy through Randomized Response Techniques: Qualitative and Quantitative Human Traits, edited by Chaudhuri

Christofides

T.C.

Rao

C.R.

. Amsterdam, the Netherlands: Elsevier.Pp. 287–315

Elffers

Van der Heijden

Hezemans

. 2003. “Explaining Regulatory Non-compliance: A Survey Study of Rule Transgression for Two Dutch Instrumental Laws, Applying the Randomized Response Method.” Journal of Quantitative Criminology 19:409–39.

Ewemooje

O. S.

Amahia

G. N.

Abedola

F. B.

. 2017. “Estimating Prevalence of Induced Abortion and Multiple Sexual Partners Using Improved Randomized Response Technique for Two Sensitive Attributes.” Communication in Statistics: Case Studies, Data Analysis and Applications 3:21–28.

10.

Fox

J. A.

2016. Randomized Response and Related Methods: Surveying Sensitive Data. Thousand Oaks, CA: Sage Publication.

11.

Fox

J. A.

Tracy

P. E.

. 1986. Randomized Response: A Method for Sensitive Surveys. Newbury Park, CA: Sage Publication.

12.

Glonek

G. F. V.

McCullagh

. 1995. “Multivariate Logistic Models.” Journal of the Royal Statistical Society B 57:533–46.

13.

Greenberg

B. G.

Abul-Ela

Simmons

W. R.

Horvitz

D. G.

. 1969. “The Unrelated Question Randomized Response Model: Theoretical Framework.” Journal of the American Statistical Association 64:520–39.

14.

Heck

D. W.

Hoffmann

Moshagen

. 2018. “Detecting Nonadherence without Loss in Efficiency: A Simple Extension of the Crosswise Model.” Behavior Research Methods 50:1895–905.

15.

Höglinger

Jann

. 2018. “More Is Not Always Better: An Experimental Individual-Level Validation of the Randomized Response Technique and the Crosswise Model.” PLoS One 13(8):e0201770. Retrieved 14 August 2018 (https://doi.org/10.1371/journal.pone.0201770).

16.

Hsieh

S. H.

Lee

S. M.

C. S.

S. H.

. 2016. “An Alternative to Unrelated Randomized Response Techniques with Logistic Regression Analysis.” Statistical Methods & Applications 25:601–21.

17.

Hsieh

S. H.

Lee

S. M.

Shen

P. S.

. 2010. “Logistic Regression Analysis of Randomized Response Data with Missing Covariates.” Journal of Statistical Planning and Inference 140:927–40.

18.

Jann

Jerke

Krumpal

. 2012. “Asking Sensitive Questions Using the Crosswise Model: An Experimental Survey Measuring Plagiarism.” Public Opinion Quarterly 76:32–49.

19.

Kerkvliet

1994. “Estimating a Logit Model with Randomized Data: The Case of Cocaine Use.” Australian Journal of Statistics 36:9–20.

20.

Korndörfer

Krumpal

Schmukle

S. C.

. 2014. “Measuring and Explaining Tax Evasion: Improving Self-Reports Using the Crosswise Model.” Journal of Economic Psychology 45:18–32.

21.

Krumpal

2012. “Estimating the Prevalence of Xenophobia and Anti-Semitism in Germany: A Comparison of the Randomized Response Technique and Direct Questioning.” Social Science Research 41:1387–403.

22.

Lee

C. S.

Sedory

S. A.

Singh

. 2013. “Estimating at Least Seven Measures of Qualitative Variables from a Single Sample Using Randomized Response Technique.” Statistics and Probability Letters 83:399–409.

23.

Lensvelt-Mulders

G. J. L. M.

Hox

J. J.

Van der Heijden

P. G. M.

Mass

C. J. M.

. 2005. “Meta-Analysis of Randomized Response Research. Thirty-five Years of Validation.” Sociological Methods & Research 33:319–48.

24.

Lensvelt-Mulders

G. J. L. M.

Van der Heijden

P. G. M.

Laudy

Van Gils

. 2006. “A Validation of a Computer-Assisted Randomized Response Survey to Estimate the Prevalence of Fraud in Social Security.” Journal of the Royal Statistical Society A 169:305–18.

25.

Maddala

G. S.

1983. Limited-dependent and Qualitative Variables in Econometrics. Cambridge, London: Cambridge University Press.

26.

Perri

P. F.

Pelle

Stranges

. 2016. “Estimating Induced Abortion and Foreign Irregular Presence Using the Randomized Response Crossed Model.” Social Indicators Research 129:601–18.

27.

Scheers

N. J.

Dayton

C. M.

. 1998. “Covariate Randomized Response Models.” Journal of the American Statistical Association 83:969–74.

28.

Tian

G. L.

Tang

M. L.

. 2014. Incomplete Categorical Data Design: Non-Randomized Response Techniques for Sensitive Questions in Surveys. Boca Raton, FL: Chapman & Hall/CRC.

29.

Tian

G. L.

J. W.

Tang

M. L.

Geng

. 2007. “A New Non-Randomized Model for Analysing Sensitive Questions with Binary Outcomes.” Statistics in Medicine 26:4238–52.

30.

Van den Hout

Van der Heijden

P. G. M.

Gilchrist

. 2007. “The Logistic Regression Model with Response Variables Subject to Randomized Response.” Computational Statistics and Data Analysis 50:6060–69.

31.

Ulrich

Schröter

Stringel

. 2012. “Asking Sensitive Questions: A Statistical Power Analysis of Randomized Response Models.” Psychological Methods 17: 623–41.

32.

Warner

S. L.

1965. “Randomized Response: A Survey Technique for Eliminating Evasive Answer Bias.” Journal of the American Statistical Association 60:63–9.

33.

Wolter

Preisendörfer

. 2013. “Asking Sensitive Questions: An Evaluation of the Randomized Response Technique Versus Direct Questioning Using Individual Data Validation.” Sociological Methods & Research 42:321–53.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.21 MB