A Two-stage Multilevel Randomized Response Technique With Proportional Odds Models and Missing Covariates

Abstract

Surveys of income are complicated by the sensitive nature of the topic. The problem researchers face is how to encourage participants to respond and to provide truthful responses in surveys. To correct biases induced by nonresponse or underreporting, we propose a two-stage multilevel randomized response (MRR) technique to investigate the true level of income and to protect personal privacy. For a wide range of applications, we present a proportional odds model for two-stage MRR data and apply inverse probability weighting and multiple imputation methods to deal with covariates on some subjects that are missing at random. A simulation study is conducted to investigate the effects of missing covariates and to evaluate the performance of the proposed methods. The practicality of the proposed methods is illustrated with the regular monthly income data collected in the Taiwan Social Change Survey. Furthermore, we provide an estimate of personal regular monthly mean income.

Keywords

inverse probability weighting missing at random multilevel randomized response technique multiple imputation Taiwan Social Change Survey

Surveys of income are complicated by the sensitive nature of the topic. Some researchers have even ranked income as one of the hardest-to-ask questions (see, e.g., Lillard, Smith, and Welch 1986; Riphahn and Serfling 2002; Yu and Li 2011; Zweimüller 1992). In general, a sensitive question is perceived to be one related to an illegally and/or socially undesirable behavior or private information or information considered either private or shameful within a given culture, and so on. Individuals may be reluctant to admit directly having earnings or other money income, which can induce them either to lie about their income behavior (response bias) or to refuse to take part in the study (nonresponse bias) because they wish to avoid answering sensitive questions. Response and nonresponse bias in a survey affects the validity of results and makes a reliable estimate of income difficult to obtain. The problem researchers face is how to encourage participants to respond and how to elicit truthful responses in surveys. Hence, we consider a two-stage multilevel randomized response (MRR) technique to investigate the true level of income as well as to protect personal privacy.

The randomized response (RR) technique was proposed by Warner (1965) as an interview technique to elicit sensitive information while protecting respondents’ privacy and, hence, to reduce response and nonresponse bias. Using this RR technique, the respondents are divided into two mutually exclusive groups by means of a randomization device, such as a spinner, dice, playing cards, random numbers, or a computer, to enable the respondents to answer the sensitive questions without revealing their true status about the stigmatized or private attribute to the interviewer. Since then, different developments and variants of the RR technique of Warner (1965) have been proposed by different authors (see, e.g., Christofides 2003; Franklin 1989; Greenberg et al. 1969; Hsieh, Lee, and Tu 2018; Mangat 1994; Moors 1971; Lensvelt-Mulders et al. 2005). Among them, Lensvelt-Mulders et al. (2005) discussed two meta-analyses on 32 comparative RR technique studies in which the RR technique may yield a more reliable estimate compared with self-administered questionnaires or face-to-face interviews, especially when dealing with highly sensitive topics. The RR technique has been used successfully in several sensitive research areas such as homosexual behaviors and AIDS, drug abuse history, abortion experience, income and tax evasion, and so on (see, e.g., Esponda, Huerta, and Guerrero 2016; Houston and Tran 2001; Tian et al. 2009).

Despite the wide applicability of the RR technique and several methodological advances, we find surprisingly few applications. Indeed, our extensive search yields only a handful of published studies that use the RR method to answer a multiple-choice question by using an appropriate regression model. In a logistic regression model for RR data, Scheers and Dayton (1988) showed the relationship between sensitive questions and explanatory variables for a related question design (Warner 1965) and an unrelated question design (Greenberg et al. 1969). Corstange (2009) proposed a method to estimate the parameters of a hidden logistic regression model. Lensvelt-Mulders et al. (2006) extended the logistic regression procedure to incorporate personal characteristics of respondents so as to make it possible to weight a sample toward population characteristics. Ronning (2005) discussed a probit model for RR data. In a multivariable logistic regression model for RR data, Van den Hout, Van der Heijden, and Gilchrist (2007) elaborated the multivariable logistic regression model as presented by Glonek and McCullagh (1995) for two RR variables. Cruyff, Van den Hout, and Van der Heijden (2008) discussed the analysis of a summary of multivariate data. Böckenholt and Van den Heijden (2007) used a multivariate approach to estimate self-protective responses by using an item-randomized-response model. However, survey data as entered are often not measured perfectly. Hsieh, Lee, and Shen (2010) discussed a logistic regression model for RR data with missing covariates. Missing a covariate of paramount importance will hence require a more advanced method to avoid bias and imprecision in parameter estimates than if the missing covariate is of less significance.

As far as we know, there has been very little research regarding the analysis of RR data with missing covariates. In this work, we propose a proportional odds model (POM) for a two-stage MRR variable and fit the model to income data to study its relationship with a set of missing covariates. In the MRR technique proposed by Hsieh et al. (2018), the outcome variable is the absolute difference between two integers, one corresponding to her or his sensitive true states (e.g., income) on an ordinal scale and the other produced by a random number generator provided by the interviewer. The two-stage MRR technique is proposed by using both the direct question (DQ) and MRR technique where an MRR is applied in the second stage. The POM, which is used to fit the two-stage MRR data, reveals the relationship between the covariates and the probability of sensitive true states for ordinal categories. It is important to include covariates that can explain some of the observed between-subject variability for RR data in the regression models. One active research area in practical problems has been the study of regression models with missing covariates. Although the naive complete-case (CC) method can yield consistent estimates if covariates are missing completely at random (MCAR), which means that the missingness mechanism is independent of both observed data and missing data (Rubin 1976), this approach yields inconsistent estimates when the missingness mechanism depends on the observed data, which may be missing at random (MAR; Little 1992).¹ Little’s study focused almost exclusively on multivariate normal models, whereas Horton and Laird (1999) focused exclusively on the maximum likelihood (ML) methods for generalized linear models (GLMs) with MAR categorical covariates. Ibrahim et al. (2005) provided a detailed overview and comparisons for various paradigms for inference in GLMs with categorical or continuous as well as MAR or nonignorable missing covariates. Note that nonignorable missingness means the missing data depend on a missing data mechanism such as data not MAR. Under the MAR assumption, Lee, Gee, and Hsieh (2011) proposed semiparametric methods to estimate the parameters of a POM for ordinal response data with missing covariates. These approaches include the conditional estimation method, joint conditional method, and weighted method.

We develop alternative estimation procedures to accommodate MAR missing covariate data in the POM for a dependent variable subject to the two-stage MRR technique. Specifically, we apply the inverse probability weighting (IPW) and nonparametric multiple imputation (MI) methods to carry out the estimation of the covariate effects in the presence of missing covariate data. Lee, Hwang, and de Dieu Tapsoba (2016) used these methods to estimate the population size under a capture–recapture model with missing covariate data. We also prove the asymptotical equivalence of estimators based on the IPW and nonparametric MI methods under the POM for a dependent variable subject to the two-stage MRR technique with covariate data MAR.

The second section describes the motivating example and presents the POM for a dependent variable subject to the two-stage MRR technique and the ML framework. The third section summarizes the effects of missing covariate data when using a naive CC analysis method and development of alternative estimation approaches that are based on the IPW and two types of nonparametric MI. The fourth section provides an extensive simulation study, which is conducted to evaluate the finite-sample performance of the proposed estimators. The fifth section illustrates the practical use of the proposed approaches by using the regular monthly income data from the Taiwan Social Change Survey (TSCS). Some conclusions are given in the sixth section. Technical details are given in Supplementary Material (which can be found at http://smr.sagepub.com/supplemental/).

The POM Framework

Motivating Example

Income is a sensitive topic for most people. In the traditional DQ methods, respondents with either a higher income or a lower income are more likely to say “don’t know” or to refuse to reply when encountering the income question, and hence, it lowers the response accuracy. There has been very little work done using the RR technique in surveys that include questions about income.² One study directly addressed this issue by the forced response design (Boruch 1971), which was conducted in a nationwide survey to assess the level of noncompliance with social security regulations by the Dutch Department of Social Affairs in 2004 (Cruyff, Böckenholt et al. 2008). In 2012, the MRR technique was investigated and applied in the TSCS, which was administered in face-to-face interviews by the Center for Survey Research at Academia Sinica. Survey responses were representative of the general population of Taiwanese aged 18 years old or older.

The survey included questions on income level, and the results obtained by using both the DQ and MRR techniques were combined. Motivated in part by a preliminary investigation of the two-stage MRR design, the TSCS had two questions as follows in order to ask for information about income:

Q ₁: Is your regular monthly income (including your salary, compensation, and bonus) from your job NT$30,000 or more?

Respondents were instructed to say “Yes” or “No.” Under the assumption of truthful response by all participants, if their answer to the first question Q ₁ was “Yes,” they needed to answer the second question Q ₂. The second question Q ₂ was designed according to the MRR design as follows:

Q ₂: Which one of the following numbers best indicates your regular monthly income?

Number 7: income between NT$30,000 and NT$59,999,

Number 6: income between NT$60,000 and NT$79,999,

Number 0: income greater than or equal to NT$80,000.

The respondent was instructed to keep that number in mind and randomly pick up a card from a well-shuffled deck of $40$ playing cards numbered from 1 to 5, with the probability distribution of the randomization device $(P_{1}, P_{2}, P_{3}, P_{4}, P_{5}) = (0.2, 0.1, 0.2, 0.4, 0.1)$ . Note that there were 8 cards each marked number 1, 4 cards each marked number 2, 8 cards each marked number 3, 16 cards each marked 4, and 4 cards each marked number 5. Finally, the respondent told the interviewer the absolute difference (from 1 to 6) between her or his income category number and the number on the chosen card. For privacy protection, each respondent was instructed not to reveal to the interviewer the number that corresponds to the answer to Q ₂.

To the best of our knowledge, there have been no studies regarding the analysis of data from the two-stage MRR design in regression models. For a wide range of applications, we develop the ML framework to estimate the parameters of the POM for a two-stage MRR variable of the income on an ordinal scale.

Model Setting

Assume that the ordinal outcome variable, Y, takes on $m + 1$ values coded $0, 1, 2, . . ., m$ to denote regular monthly income less than A ₁, in $[A_{1}, A_{2}), . . ., [A_{m - 1}, A_{m})$ , and greater than or equal to A_m , respectively. In the motivating example, we take $m = 3$ , and let $Y = 0$ be regular monthly income less than NT$29,999, $Y = 1$ between NT$30,000 and NT$59,999, $Y = 2$ between NT$60,000 and NT$79,999, and $Y = 3$ greater than or equal to NT$80,000. Let $(X, Z)$ be a vector of covariates that are observed such as age, education, working experience, and so on. Consider a simple random sample of size n. The POM is then expressed as follows:

P (Y_{i} \leq l | X_{i}, Z_{i}) = H (α_{l} + β_{1}^{T} X_{i} + β_{2}^{T} Z_{i}), i = 1, 2, . . ., n, l = 0, 1, 2, . . ., m - 1,

where $H (u) = {1 + exp (- u {)}}^{- 1}$ . Following Lee et al. (2011), let $T_{i l} = I (Y_{i} \leq l)$ , where $I (\cdot)$ is an indicator function. The above model can then be rewritten as

P (T_{i l} = 1 | X_{i}, Z_{i}) = H (Θ^{T} X_{i, l}), i = 1, 2, . . ., n, l = 0, 1, 2, . . ., m - 1.

Here, $T_{i 0} \leq T_{i 1} \leq . . . \leq T_{i, m - 1}$ , $Θ = (α_{0}, α_{1}, \dots, α_{m - 1}, β_{1}^{T}, β_{2}^{T})^{T}$ , $X_{i, l} = (h_{i}^{{(l)}^{T}}, X_{i}^{T}, Z_{i}^{T})^{T}$ , and $h_{i}^{(l)}$ is the $m \times 1$ vector with 1 on the $(l + 1) th$ row and 0 on the rest. Therefore, the $i th$ respondent’s probability mass function of the income on each scale can be expressed as follows:

\begin{array}{l} π_{i, m}^{Y} (Θ) = P (Y_{i} = m | X_{i}, Z_{i}) = 1 - H (Θ^{T} X_{i, m - 1}), \\ π_{i, l}^{Y} (Θ) = P (Y_{i} = l | X_{i}, Z_{i}) = H (Θ^{T} X_{i, l}) - H (Θ^{T} X_{i, l - 1}), l = 1, . . ., m - 1, \\ π_{i,0}^{Y} (Θ) = P (Y_{i} = 0 | X_{i}, Z_{i}) = H (Θ^{T} X_{i,0}) . \end{array}

We define the vector by $H_{i} (Θ) = {(H (Θ^{T} X_{i, m - 1}), H (Θ^{T} X_{i, m - 2}), . . ., H (Θ^{T} X_{i,1}), H (Θ^{T} X_{i,0}))}^{T}$ to denote the cumulative probabilities of the income on an ordinal scale $(m - 1, m - 2, . . .,1,0)$ . Let $π_{i}^{Y} (Θ) = {(π_{i, m}^{Y} (Θ), π_{i, m - 1}^{Y} (Θ), . . ., π_{i,0}^{Y} (Θ))}^{T}$ denote the vector of probabilities of the income on an ordinal scale $(m, m - 1, m - 2, . . .,1,0)$ . To make this more intuitive, we conceptualize the linkage between $π_{i}^{Y} (Θ)$ and $H_{i} (Θ)$ in C and e ₁, where C is a matrix of $(m + 1) \times m$ and e ₁ is an $(m + 1) \times 1$ vector as follows:

C = [\begin{array}{r} - 1 & 0 & 0 & 0 & \dots & 0 & 0 \\ 1 & - 1 & 0 & 0 & \dots & 0 & 0 \\ 0 & 1 & - 1 & 0 & \dots & 0 & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋱ & ⋮ & ⋮ \\ 0 & 0 & 0 & 0 & \dots & 1 & - 1 \\ 0 & 0 & 0 & 0 & \dots & 0 & 1 \end{array}] and e_{1} = [\begin{matrix} 1 \\ 0 \\ 0 \\ ⋮ \\ 0 \\ 0 \end{matrix}] .

We can then have $π_{i}^{Y} (Θ) = e_{1} + C H_{i} (Θ)$ , which is an $(m + 1) \times 1$ vector.

In this study, it was observed that each respondent answered the first dichotomous question Q ₁, and Y_i can only be obtained when Y_i is equal to 0 based on the answer of “No”. For $Y_{i} \neq 0$ , her or his answer to the Q ₁ was “Yes”, and she or he needed to answer the second polytomous question Q ₂ under the two-stage MRR design. Therefore, the Y_i s generally cannot be observed, and they are latent ordinal variables. Under this two-stage MRR design, we defined the two variables ${\tilde{Y}}_{i}$ and R_i to denote the value assigned to the $i th$ respondent’s regular monthly income greater than or equal to A ₁ and the number generated from the randomization device, respectively. Here, ${\tilde{Y}}_{i}$ is the number corresponding to the level of regular monthly income Y_i of the $i th$ respondent, who was instructed to keep it in mind, such as

{\tilde{Y}}_{i} = {\begin{array}{l} 0, & if Y_{i} = m, \\ L + 1, & if Y_{i} = m - 1, \\ ⋮ & ⋮ \\ L + m - 2, & if Y_{i} = 2, \\ L + m - 1, & if Y_{i} = 1. \end{array}

The primary assumption made here is because ${\tilde{Y}}_{i} = L + m - 1$ when $Y_{i} = 1$ is less sensitive than ${\tilde{Y}}_{i} = 0$ when $Y_{i} = m$ (see Hsieh et al. 2018). R_i is the number on the card which the $i th$ respondent randomly picked up from a well-shuffled deck of playing cards numbered from 1 to L, with the probability distribution of the randomization device $P_{1}, P_{2}, . . ., P_{L}$ and $\sum_{t = 1}^{L} P_{t} = 1$ . Finally, the $i th$ respondent told the interviewer the absolute difference between her or his regular monthly income category number and the number on the chosen card, that is, $D_{i} = | {\tilde{Y}}_{i} - R_{i} | = j$ , $j = 1, 2, . . ., L + m - 2$ . However, when the $i th$ respondent’s answer was “No” to the first question Q ₁, she or he could not answer the second question Q ₂ under the two-stage MRR design and, hence, $Y_{i} = 0$ and D_i is set to 0. Therefore, the observed response to $(Q_{1}, Q_{2})$ is one of the following $L + m - 1$ combinations ${“ (No, 0) ”, “ (Yes, 1) ”, “ (Yes, 2) ”, . . ., “ (Yes, L + m - 2) ”}$ .

Following Hsieh et al. (2018), for $D_{i} = j$ , $j = 1, 2, . . ., L + m - 2$ , we let $d_{i} = (d_{i,1}, d_{i,2}, . . ., d_{i, L + m - 2})$ be a response vector, where $d_{i, j} = I (D_{i} = j)$ is an indicator variable, which is 1 if $D_{i} = j$ ; 0 otherwise, for $i = 1, 2, . . ., n$ . Under the assumption that these $D_{i} = j$ reports are made truthfully and $(P_{1}, P_{2}, . . ., P_{L})$ is set by the researcher, the probability of $d_{i, j} = 1$ is given by

P (d_{i, j} = 1 | X_{i}, Z_{i}) = π_{i, j}^{D} (Θ)

= {\begin{array}{l} π_{i, m}^{Y} (Θ) P_{k} + \sum_{k = 2}^{j + 1} π_{i, m - (k - 1)}^{Y} (Θ) P_{L - j + (k - 1)}, & j = 1, . . ., m - 2, \\ π_{i, m}^{Y} (Θ) P_{j} + \sum_{k = 2}^{m} π_{i, m - (k - 1)}^{Y} (Θ) P_{L - j + (k - 1)}, & j = m - 1, . . ., L, \\ \sum_{k = j - L + 2}^{m} π_{i, m - (k - 1)}^{Y} (Θ) P_{L - j + (k - 1)}, & j = L + 1, . . ., L + m - 2. \end{array}

We then define the parameter vector by $π_{i}^{D} (Θ) = (π_{i,1}^{D} (Θ), π_{i,2}^{D} (Θ), \dots, π_{i, L + m - 2}^{D} (Θ {))}^{T}$ to denote the vector ${(P (d_{i, 1} = 1 | X_{i}, Z_{i}), P (d_{i,2} = 1 | X_{i}, Z_{i}), \dots, P (d_{i, L + m - 2} = 1 | X_{i}, Z_{i}))}^{T}$ . Hsieh et al. (2018) conceptualized the linkage between d_i and true level of income Y_i in $P$ . Here, Y_i is an ordinal scale $1, . . ., m - 1$ or m. P is a transition probability matrix of $(L + m - 2) \times m$ as follows:

P = [\begin{matrix} P_{1} & P_{L} & 0 & 0 & \dots & 0 \\ P_{2} & P_{L - 1} & P_{L} & 0 & \dots & 0 \\ P_{3} & P_{L - 2} & P_{L - 1} & P_{L} & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ P_{L - 1} & P_{2} & P_{3} & P_{4} & \dots & P_{m} \\ P_{L} & P_{1} & P_{2} & P_{3} & \dots & P_{m - 1} \\ 0 & 0 & P_{1} & P_{2} & \dots & P_{m - 2} \\ 0 & 0 & 0 & P_{1} & \dots & P_{m - 3} \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & 0 & 0 & \dots & P_{1} \end{matrix}] .

Under the two-stage MRR design, $D_{i} = 0$ is obtained based on the answer of “No” from the first question Q ₁. Note that $d_{i} = 0$ means $D_{i} = 0$ with $π_{i,0}^{Y} (Θ) = π_{i,0}^{D} (Θ)$ and $\sum_{j = 1}^{L + m - 2} π_{i, j}^{D} (Θ) = 1 - π_{i,0}^{Y} (Θ)$ . We then define $π_{i}^{Y^{*}} (Θ) = {(π_{i, m}^{Y} (Θ), π_{i, m - 1}^{Y} (Θ), . . ., π_{i,1}^{Y} (Θ))}^{T} = e + C_{m} H_{i} (Θ)$ , which is rows 1 through m of $π_{i}^{Y} (Θ)$ . Here, e is rows 1 through m of e ₁, and C _m is rows 1 through m of C . By using the transition probability matrix P , we can express $π_{i}^{D} (Θ)$ , which is an $(L + m - 2) \times 1$ vector, as follows:

π_{i}^{D} (Θ) = P π_{i}^{Y^{*}} (Θ) = P [e + C_{m} H_{i} (Θ)] = p_{1} + W H_{i} (Θ),

where $p_{1} = P e$ is the first column of matrix P and $W = P C_{m}$ . The covariance matrix of d_i is an $(L + m - 2) \times (L + m - 2)$ matrix, denoted by $Σ_{i} (Θ)$ , where the $(r, s)$ element of $Σ_{i} (Θ)$ is ${[p_{1} + W H_{i} (Θ)]}_{r} {[1 - (p_{1} + W H_{i} (Θ))]}_{s}$ when $r = s$ ; $- {[p_{1} + W H_{i} (Θ)]}_{r} {[p_{1} + W H_{i} (Θ)]}_{s}$ when $r \neq s$ , $r, s = 1, . . ., L + m - 2$ .

Estimation

To derive some properties of the ML estimator of $Θ$ , we can express the likelihood function for $Θ$ as follows:

L (Θ) = \prod_{i = 1}^{n} [\prod_{j = 1}^{L + m - 2} π_{i, j}^{D} {(Θ)}^{I (D_{i} = j)}] {[π_{i,0}^{D} (Θ)]}^{I (D_{i} = 0)} .

Let $H_{i}^{(1)} (Θ) = (H (Θ^{T} X_{i, m - 1}) [1 - H (Θ^{T} X_{i, m - 1})], . . ., H (Θ^{T} X_{i,1}) [1 - H (Θ^{T} X_{i,1})], H (Θ^{T} X_{i,0}) [1 - H (Θ^{T} X_{i,0} {)])}^{T}$ and $X_{i} = (X_{i, m - 1}, X_{i, m - 2}, . . ., X_{i,0})$ . To estimate $Θ$ , we consider the following unbiased estimating function (Godambe 1960):

U_{n} (Θ) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} [\frac{\partial π_{i}^{D} (Θ)}{\partial Θ}] Σ_{i}^{- 1} (Θ) {d_{i} - π_{i}^{D} (Θ)} = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} Ψ_{i} (Θ),

where

Ψ_{i} (Θ) = X_{i} diag [H_{i}^{(1)} (Θ)] W^{T} Σ_{i}^{- 1} (Θ) {d_{i} - [p_{1} + W H_{i} (Θ)]} .

The ML estimator of $Θ$ based on the full data, denoted by ${\hat{Θ}}_{F}$ , is the solution of $U_{n} (Θ) = 0$ . The asymptotic properties of ${\hat{Θ}}_{F}$ and the covariance matrix of ${\hat{Θ}}_{F}$ are stated in Supplementary Material (which can be found at http://smr.sagepub.com/supplemental/). However, when researchers want to use the one-stage MRR design to collect data, they could estimate $Θ$ by using the function $U_{n} (Θ)$ of the two-stage MRR technique, which yields very similar results.

Methods for Handling Missing Covariates in POM

Developing methods for regression analysis with missing covariates has been an active research area in the past decades (for a review, see Little 1992; Little and Rubin 2002). A closely related problem arises when the covariate X is missing and a surrogate variable for X is available. For this problem, several estimation methods have been proposed (see, e.g., Breslow and Cain 1988; Hsieh, Lee, and Shen 2009; Lee et al. 2011; Wang et al. 1997, 2002). Lee et al. (2011) proposed semiparametric methods to estimate the parameters of a POM for ordinal response data with missing covariates. These approaches include the conditional estimation method, joint conditional method, and weighted method.

We consider the problem of estimating the parameters of the POM for the two-stage MRR data with missing covariates. To simplify the presentation, we assume that X_i is a univariable, but the structure can be easily extended to a multivariable case. Suppose X_i is a covariate that may be MAR. Let $V_{i} = (Z_{i}, S_{i})$ be a covariate vector of the $i th$ subject that is always observed, where S_i is a surrogate variable for X_i . With regard to the missing data, let $δ_{i}$ indicate whether X_i is observed ${(δ}_{i} = 1)$ or not $(δ_{i} = 0)$ . The selection probability $P (δ_{i} = 1 | D_{i}, X_{i}, V_{i}) = π (D_{i}, V_{i})$ then does not depend on X_i under the MAR mechanism.

We can only observe $(D_{i}, X_{i}, V_{i})$ or $(D_{i}, V_{i})$ , $i = 1, . . ., n$ . Based on the validation data set ( $δ_{i} = 1$ ) that consists of $(D_{i}, X_{i}, V_{i})$ , the CC estimator, denoted by ${\hat{Θ}}_{C}$ , is the solution of the following estimating equations:

U_{c n} (Θ) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} δ_{i} Ψ_{i} (Θ) = 0 .

Due to deleting the incomplete data, the CC analysis method has two potential disadvantages: (a) loss of efficiency and (b) the potential to yield inconsistent estimates when the validation data set is not a random subsample of the original cases. Note that the estimator ${\hat{Θ}}_{C}$ is consistent for the case of MCAR, that is, the selection probability does not depend on D or X. However, when the missingness mechanism is MAR, the naive CC method may lead to an unreliable inference. Note that $E {δ_{i} Ψ_{i} (Θ)} = E {E [δ_{i} Ψ_{i} (Θ) | D_{i}, X_{i}, V_{i}]} = E {π (D_{i}, V_{i}) Ψ_{i} (Θ)}$ , $i = 1, . . ., n$ , which is not zero in general.

Next, we propose an IPW method and nonparametric MI method to deal with missing covariate data under the two-stage MRR design.

IPW Method

When the missingness mechanism is MAR, Flanders and Greenland (1991) and Zhao and Lipsitz (1992) suggested a weighted method, which uses the inverse of the probability that data are observed as the weight of each respondent. To simplify the notation, we define $π_{i} = π (D_{i}, V_{i})$ and employ a similar approach in the following estimating function:

U_{w n} (Θ; π) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} \frac{δ_{i}}{π_{i}} Ψ_{i} (Θ) .

In practice, the selection probabilities $π_{i} s$ are generally unknown and have to be estimated from the data. Additionally, Wang et al. (1997) and Hsieh et al. (2010) showed that using estimates of the $π_{i} s$ can result in an efficiency improvement of the estimation even when the selection probabilities are known. When V _i is categorical, the nonparametric estimator of $π_{i}$ is given by

{\hat{π}}_{i} = \frac{\sum_{s = 1}^{n} δ_{i} I (D_{s} = D_{i}, V_{s} = V_{i})}{\sum_{r = 1}^{n} I (D_{r} = D_{i}, V_{r} = V_{i})}, i = 1, . . ., n,

where $I (\cdot)$ is an indicator function. As in Hsieh et al. (2010), we consider a nonparametric model for the selection probability $π_{i}$ to estimate the parameter vector $Θ$ by using the estimating function $U_{w n} (Θ; \hat{π}) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} \frac{δ_{i}}{{\hat{π}}_{i}} Ψ_{i} (Θ)$ . The IPW estimator of $Θ$ , denoted by ${\hat{Θ}}_{W}$ , is the solution of $U_{w n} (Θ; \hat{π}) = 0$ . However, subject i in the estimating function $U_{w n} (Θ; \hat{π})$ is not independent of the other subjects because of ${\hat{π}}_{i}$ . If $π_{i}$ is estimated nonparametrically, then, as Wang et al. (1997) pointed out, the IPW estimator is equivalent to the mean-score estimator, which solves the following equations:

\frac{1}{\sqrt{n}} \sum_{i = 1}^{n} [\frac{δ_{i}}{{\hat{π}}_{i}} Ψ_{i} (Θ) + (1 - \frac{δ_{i}}{{\hat{π}}_{i}}) {\hat{Ψ}}_{i}^{*} (Θ)] = 0,

where ${\hat{Ψ}}_{i}^{*} (Θ) = \frac{\sum_{s = 1}^{n} δ_{s} Ψ_{s} (Θ) I (D_{s} = D_{i}, V_{s} = V_{i})}{\sum_{r = 1}^{n} δ_{r} I (D_{r} = D_{i}, V_{r} = V_{i})}$ . The asymptotic properties of ${\hat{Θ}}_{W}$ and the covariance matrix of ${\hat{Θ}}_{W}$ are stated in Supplementary Material (which can be found at http://smr.sagepub.com/supplemental/).

MI Method

The MI method was proposed by Rubin (1987) and Rubin and Schenker (1986). This method involves repeatedly generating random values of the missing data X from the conditional distribution $F (x | D_{i}, V_{i})$ . We propose a nonparametric MI method (Wang and Chen 2009) and use the empirical conditional distribution.

\hat{F} (x | D_{i}, V_{i}) = \frac{\sum_{r = 1}^{n} δ_{r} I (D_{r} = D_{i}, V_{r} = V_{i}) I (X_{r} \leq x)}{\sum_{k = 1}^{n} δ_{k} I (D_{k} = D_{i}, V_{k} = V_{i})} .

When the covariate X_i is missing, we impute its value by generating random observations from the empirical conditional distribution $\hat{F} (x | D_{i}, V_{i})$ . This imputation procedure is then repeated several times. Let M be the number of replications. The MI approach is summarized as follows:

Step 1: To impute the missing value X_i (i.e., $δ_{i} = 0$ ), generate data, denoted by ${\tilde{X}}_{q i}$ , $q = 1, 2, . . ., M$ , from the empirical conditional distribution $\hat{F} (x | D_{i}, V_{i})$ constructed based on all observed data.

Step 2: Given each imputation, the estimating score ${\tilde{Ψ}}_{q i} (Θ)$ is obtained by using ${\tilde{X}}_{q i}$ to impute the missing value X_i in $Ψ_{i} (Θ)$ . Let $U_{q i} (Θ) = [δ_{i} Ψ_{i} (Θ) + (1 - δ_{i}) {\tilde{Ψ}}_{q i} (Θ)]$ . The estimating function ${\tilde{U}}_{q n} (Θ) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} U_{q i} (Θ)$ is then used to estimate the regression parameters.

Step 3: Let ${\hat{Θ}}_{q}$ , which is an estimator of $Θ$ , be the solution of ${\tilde{U}}_{q n} (Θ) = 0$ . The covariance matrix of ${\hat{Θ}}_{q}$ is estimated by using the sandwich estimator ${\hat{V}}_{q} = G_{q}^{- 1} ({\hat{Θ}}_{q}) M_{q} ({\hat{Θ}}_{q}) G_{q}^{- T} ({\hat{Θ}}_{q})$ , where $G_{q} ({\hat{Θ}}_{q}) = \frac{- \partial {\tilde{U}}_{q n} (Θ)}{\partial Θ} |_{Θ = {\hat{Θ}}_{q}}$ and $M_{q} ({\hat{Θ}}_{q}) = \frac{1}{n} \sum_{i = 1}^{n} U_{q i} ({\hat{Θ}}_{q}) U_{q i} ({\hat{Θ}}_{q})$ .

Step 4: Repeat steps 2 and 3 M times.

After M imputations, we use the average of these estimators ${\hat{Θ}}_{1}, . . ., {\hat{Θ}}_{M}$ , ${\hat{Θ}}_{M_{1}} = \frac{1}{M} \sum_{q = 1}^{M} {\hat{Θ}}_{q}$ , as the first-type MI (MI₁) estimator of $Θ$ . Our numerical experience indicates that setting $M = 20$ worked quite well in our simulation experiments reported in the next section.

Next, we consider the second-type MI (MI₂) estimator of $Θ$ . We use the average of these M estimating scores ${\tilde{Ψ}}_{q i} (Θ)$ s, ${\bar{Ψ}}_{i} (Θ) = \frac{\sum_{q = 1}^{M} {\tilde{Ψ}}_{i q} (Θ)}{M}$ , to construct the estimating function $U_{m n} (Θ) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {δ_{i} Ψ_{i} (Θ) + (1 - δ_{i}) {\bar{Ψ}}_{i} (Θ)}$ . The MI₂ estimator of $Θ$ , denoted by ${\hat{Θ}}_{M_{2}}$ , is the solution of $U_{m n} (Θ) = 0$ . Moreover, we show that the two MI estimators and the IPW estimator are all asymptotically equivalent. The proofs and the covariance matrices of ${\hat{Θ}}_{M_{1}}$ and ${\hat{Θ}}_{M_{2}}$ are stated in Supplementary Material (which can be found at http://smr.sagepub.com/supplemental/).

Overall, when V and X are discrete, here the two nonparametric MI methods are similar to the multiple hot deck imputation method (Rubin and Scheker 1986). A multiple hot deck imputation method is a method for handling missing data in which each missing value is replaced by an observed data point from an individual with a similar unit (Little and Rubin 2002; Rubin 1987).³ In this study, although the main results are presented for the case where V and X are discrete, it can be extended to the continuous case by using the approach of Wang and Wang (1997). Nonparametric kernel techniques are required for extending our approach because the nuisance components involve the estimators of the selection probability, and the conditional distribution depends on $(D, V)$ .

Simulation Study

A simulation study was carried out to evaluate the finite-sample performance of the full data, CC, IPW, MI₁, and MI₂ methods. Note that the $(D, X, V)$ was observed for all respondents as full data in which the analysis results of the full data were used as the benchmark results. The number of replications is 1,000. The sample size is $n = 1, 000$ or $1, 500$ . For each estimation method in Table 1, we computed the bias that is the average of estimates of an parameter minus the parameter true value (Bias), the sample standard deviation (SD), the average of asymptotic standard errors (SE), and the sample coverage probability (CP), meaning the proportion of 95 percent Wald-type confidence intervals covering a true parameter value.⁴

Table 1.

Simulation Results for Various Values of Selection Probability.

		Full Data	Scenario 1: $π (D, V)$ , $m r = 30 Percent$				Scenario 2: $π (V)$ , $m r = 36 Percent$
		Full Data	CC	IPW	MI₁	MI₂	CC	IPW	MI₁	MI₂
${\hat{α}}_{0}$	Bias	−.0034	.0680	−.0050	−.0055	−.0056	−.0062	−.0046	−.0048	−.0049
	SD	.1161	.1439	.1201	.1203	.1203	.1539	.1222	.1222	.1221
	SE	.1174	.1447	.1219	.1201	.1201	.1553	.1236	.1207	.1207
	CP	.9650	.9240	.9590	.9550	.9560	.9620	.9620	.9590	.9590
${\hat{α}}_{1}$	Bias	.0003	.0059	−.0014	−.0017	−.0021	−.0032	−.0005	−.0005	−.0010
	SD	.1621	.1916	.1658	.1660	.1660	.2074	.1694	.1695	.1694
	SE	.1575	.1903	.1626	.1606	.1606	.2035	.1648	.1615	.1614
	CP	.9440	.9490	.9440	.9380	.9380	.9500	.9380	.9350	.9350
${\hat{α}}_{2}$	Bias	.0085	.1378	.0071	.0069	.0064	.0156	.0084	.0086	.0079
	SD	.2135	.2665	.2179	.2183	.2182	.2759	.2209	.2210	.2209
	SE	.2112	.2629	.2162	.2143	.2143	.2701	.2187	.2153	.2152
	CP	.9500	.9310	.9480	.9440	.9440	.9440	.9510	.9440	.9460
${\hat{β}}_{1}$	Bias	−.0051	.0129	−.0036	−.0033	−.0027	−.0108	−.0049	−.0051	−.0043
	SD	.1362	.1666	.1600	.1614	.1613	.1824	.1733	.1744	.1742
	SE	.1404	.1710	.1598	.1530	.1529	.1827	.1686	.1559	.1558
	CP	.9540	.9490	.9470	.9320	.9320	.9460	.9420	.9180	.9180
${\hat{β}}_{2}$	Bias	−.0012	−.0355	−.0009	−.0007	−.0006	−.0003	−.0011	−.0012	−.0010
	SD	.1438	.1712	.1437	.1438	.1437	.1844	.1442	.1443	.1443
	SE	.1375	.1672	.1385	.1381	.1380	.1786	.1388	.1382	.1382
	CP	.9390	.9430	.9460	.9440	.9440	.9500	.9410	.9430	.9430

Note: $Θ = (α_{0}, α_{1}, α_{2}, β_{1}, β_{2})^{T} = (- 0.2, 1, 2, - 0.8, - log {(2))}^{T}$ and sample size $n = 1, 000$ . $(D, X, V)$ is observed for all respondents as full data in which $m r$ is 0. SE denotes the average of asymptotic standard errors. CC = complete-case; ML = maximum likelihood; IPW = inverse probability weighting; MI = multiple imputation.

The case of bivariate covariates X and Z was considered. The binary covariates X and Z were generated independently with $P (X = 1) = 0.45$ and $P (Z = 1) = 0.6$ . U was generated from the normal distribution $N (0, 0.25)$ . Given U, the binary surrogate covariate of X was defined by $S = I [(X + U) \geq 0.5]$ . Here, we present results for $L = 5$ and $m = 3$ . To obtain the outcome variable, we first generated the randomization device R from 1 to $L = 5$ with probability distribution $(P_{1}, P_{2}, P_{3}, P_{4}, P_{5}) = (0.2, 0.1, 0.2, 0.4, 0.1)$ . The ordinal response Y was generated as $m + 1 = 4$ categories (coded as 0, 1, 2, 3) with $P (Y \leq j | X, Z) = H (α_{j} + β_{1} X + β_{2} Z)$ , where $j = 0, 1, 2$ and $Θ = (α_{0}, α_{1}, α_{2}, β_{1}, β_{2})^{T} = (- 0.2, 1, 2, - 0.8, - log {(2))}^{T}$ . Given Y, $\tilde{Y}$ can be coded as follows:

\tilde{Y} = {\begin{array}{l} 0, & i f Y = 3, \\ 6, & i f Y = 2, \\ 7, & i f Y = 1. \end{array}

The outcome variable D was then the absolute difference between $\tilde{Y}$ and R. We considered two types of selection probabilities: one that depends on $(D, V)$ and the other that does not depend on D. To explore the effect of the selection probability, the binary indicator $δ$ was simulated from $P (δ = 1 | D, V) = H [γ_{0} + γ_{1} I (D \leq 3) + γ_{2} S + γ_{3} Z]$ , where $(γ_{0}, γ_{1}, γ_{2}, γ_{3})$ is set as $(0.5, 0.5, - 1, 1)$ and $(0.5, 0, - 1, 1)$ , which are labeled as scenarios 1 and 2, respectively.

The simulation results given in Table 1 show that the efficiencies of all estimators were increased as the sample size was increased. In scenario 1, the selection probability depended on $(D, V)$ , and the missing rate (mr) was about $30 percent$ . The CC estimator was seriously biased. The three estimation methods, IPW, MI₁, and MI₂ exhibited small bias and were very similar in terms of SD and SE. This is expected because these three estimators are asymptotically equivalent, as shown in Supplementary Material (which can be found at http://smr.sagepub.com/supplemental/).

In scenario 2, the selection probability did not depend on D, and $m r = 36 percent$ . The CC estimation method was able to yield a better estimate of $Θ$ . However, when we used the selection probability and the conditional distribution that depend on $(D, V)$ , in terms of SE, the IPW, MI₁, and MI₂ methods still outperformed the CC approach. The performance of the IPW, MI₁, and MI₂ methods was very similar, except for the estimation of $β_{1}$ . Compared to the IPW method, the MI₁ and MI₂ methods presented slightly smaller SEs and 92 percent CP for $β_{1}$ . This illustrates that a good understanding of the principle of the MI methods is crucial for developing estimation of the variances of estimators of the parameters of a POM with missing covariates.⁵ Although not reported here, the results for $n = 1, 500$ were similar to those for $n = 1, 000$ .

Empirical Analysis

The proposed methods are applied to the regular monthly income data from the TSCS, which was administered in face-to-face interviews by the Center for Survey Research at Academia Sinica in 2012. The TSCS, which is the first nationally representative survey in Taiwan, was established in 1985. Since 1990, the annual TSCS, which consists of two independent survey modules, has been conducted continuously. This long-lasting series of surveys aims to track the long-term trends of social change and provide nationally representative survey data that cover political, economic, social, and other aspects of Taiwan. To facilitate time series comparisons, the TSCS devoted one of the two annual survey modules to repeat major research topics every five years. For this example, we use the 2012 TSCS (round 6, year 3), which consists of two modules: Social Stratification and Gender. This study had 4,206 respondents in both modules, but only 2,470 ( $58.7 percent$ ) respondents were working for pay during the survey period, which included 1,256 respondents for the first module (Social Stratification) and 1,214 respondents for the second module (Gender). This data set consists of 1,417 males and 1,053 females.

Respondents were asked what their average regular monthly income from their job was when using the two-stage MRR technique. The 2012 TSCS had the following two questions, Q ₁ and Q ₂, in order to ask for information about the average regular monthly income:

Q₁ : Is your regular monthly income (including your salary, compensation, and bonus) from your job NT$30,000 or more?

Q₂ : Which one of the following numbers best indicates your regular monthly income?

Number 7: income between NT$30,000 and NT$59,999,

Number 6: income between NT$60,000 and NT$79,999,

Number 0: income greater than or equal to NT$80,000.

All the respondents were required to answer the first question Q ₁. When the $i th$ subject’s answer of Q ₁ was “Yes”, she or he needed to report the absolute difference D_i between the number indicating one of the aforementioned three levels of her or his average regular monthly income ${\tilde{Y}}_{i}$ from Q ₂ and the number (from 1 to 5) generated from the randomization device, denoted by R_i . That is, there are forty cards, with eight 1’s, four 2’s, eight 3’s, sixteen 4’s, and four 5’s.⁶ Here, ${\tilde{Y}}_{i}$ is defined as follows:

\tilde{Y} = {\begin{matrix} 0, & if Y = 3 meaning the income is greater than or equal to NT $ 80, 000, \\ 6, & if Y = 2 meaning the income is between NT $ 60,000 and NT $ 79,999, \\ 7, & if Y = 1 meaning the income is between NT $ 30,000 and NT $ 59,999 . \end{matrix}

From $({\tilde{Y}}_{i}, R_{i})$ , D_i can be $1, 2, 3, 4, 5$ , or 6. In this study, we define $D = 0$ when the answer to Q ₁ was “No”, meaning $Y = 0$ because the income is less than $NT $ 29, 999$ , and, hence, Q ₂ could not be answered. D was used as the outcome variable, and the observed frequencies of $D = 0, 1, 2, 3, 4, 5$ , and 6 were $1, 057$ , $86$ , $265$ , $388$ , $301$ , $184$ , and $189$ , respectively.

In addition, these three explanatory variables, years of working, gender, and education were included in the POM. Assume X is the answer to the question Q_X : “For how many years in total have you been working, starting from your first job to the current (last) job?” which is defined as 1 if working experience is greater than or equal to 14.5 years and 0 otherwise. However, the first module of the 2012 TSCS did not contain the question Q_X , so 1,256 participants did not answer this question Q_X because of the questionnaire design. In the second module, 24 participants refused to answer this question Q_X . There were $1, 190$ subjects in the validation data set, which consists of $(D, X, V)$ , where $V = (Z_{1}, Z_{2}, S)$ , and the $m r$ of X was $51.8 percent$ . The surrogate S for the missing X is defined as 1 if the potential working experience (EXP) is greater than or equal to 14.5 years and 0 otherwise. The EXP of female respondents was calculated by the age minus 6 years and the years of education, but the EXP of male respondents was further reduced by two years due to mandatory military service. Let Z ₁ denote gender, which is 1 if male and 0 otherwise. Let Z ₂ denote education, which is 1 if a bachelor’s degree, 2 if a master’s or doctorate degree, and 0 otherwise. We then consider the following POM:

P (T_{i j} = 1 | X_{i}, Z_{1 i}, Z_{2 i}, S_{i}) = H (α_{j} + β_{1} X_{i} + β_{2} Z_{1 i} + β_{3} D Z_{1 i} + β_{4} D Z_{2 i}),

$i = 1, 2, . . ., n$ , $j = 0, 1, 2$ , where ( $D Z_{1 i}$ , $D Z_{2 i}$ ) are dummy variables for education $Z_{2 i}$ , which is $(0, 0)$ if $Z_{2 i} = 0$ , $(1, 0)$ if $Z_{2 i} = 1$ and $(0, 1)$ if $Z_{2 i} = 2$ . Note that the values of X_i , $Z_{1 i}$ , $D Z_{1 i}$ , and $D Z_{2 i}$ are 0 or 1. Hence, for $k = 1, 2, 3$ , or 4, a negative estimate of $β_{k}$ indicates that respondents with the value 1 (or status 1) were more likely to have a higher average regular monthly income compared with respondents with the value 0 (or status 0). In contrast, a positive estimate of $β_{k}$ indicates that respondents with the value 1 (or status 1) were less likely to have a higher average regular monthly income compared with respondents with the value 0 (or status 0).

Estimation of Regression Parameters

The analysis results are given in Table 2. We used estimates of the selection probability that is a function of D and V . The analysis results of all these methods were similar except that the CC method had the worst performance in terms of asymptotic SE. The three variables, years of working, gender, and education had highly significant and negative effects. The results show that the odds of a lower average regular monthly income were decreased as the effects of the three explanatory variables increased. Hence, ${\hat{β}}_{1} < 0$ indicates that the respondents with working experience greater than or equal to 14.5 years were more likely to have a higher average regular monthly income compared with the respondents with working experience less than 14.5 years. ${\hat{β}}_{2} < 0$ also indicates that the male respondents were more likely to have a higher average regular monthly income compared with the female respondents. In addition, ${\hat{β}}_{3} < 0$ indicates that the respondents with a bachelor’s degree were more likely to have a higher average regular monthly income compared with the respondents with non-university education. Finally, ${\hat{β}}_{4} < 0$ indicates that the respondents with a master’s or doctorate degree were more likely to have a higher average regular monthly income compared with the respondents with non-university education.

Table 2.

Analysis Results of Average Regular Monthly Income Data by Using POM.

Variable	Parameter	CC	IPW	MI₁	MI₂
Intercept $_{0}$	$α_{0}$	1.6569 (.1664)	1.6620 (.1224)	1.5958 (.1164)	1.5954 (.1163)
Intercept $_{1}$	$α_{1}$	3.5892 (.2439)	3.5546 (.1724)	3.4265 (.1672)	3.4261 (.1672)
Intercept $_{2}$	$α_{2}$	4.8426 (.3786)	5.2863 (.3110)	5.0669 (.2957)	5.0660 (.2949)
X	$β_{1}$	−1.0271 (.1437)	−1.0091 (.1132)	−0.9728 (.1038)	−0.9724 (.1038)
Z ₁	$β_{2}$	−1.1599 (.1270)	−1.0772 (.0890)	−1.0360 (.0882)	−1.0360 (.0882)
$D Z_{1}$	$β_{3}$	−1.1234 (.1426)	−1.3111 (.1020)	−1.2848 (.1002)	−1.2845 (.1001)
$D Z_{2}$	$β_{4}$	−3.2825 (.3427)	−3.2091 (.2361)	−3.0959 (.2291)	−3.0955 (.2289)

Note: The value in a parenthesis is the asymptotic standard error (ASE) of an estimator. $D Z_{1}$ and $D Z_{2}$ denote the dummy variables for education variable Z ₂ ( $1 = bachelo r^{'} s degree$ ; $2 = maste r^{'} s or doctorate degree$ ; $0 = otherwise$ ). CC = complete-case; ML = maximum likelihood; IPW = inverse probability weighting; MI = multiple imputation.

As shown in Figure 1, based on the MI₂ method, we generate plots of the fitted POM for the effects of years of working by gender and education. It can be easily seen that the predicted probability of the lowest income is the highest for a female respondent with non-university education and working experience less than 14.5 years. On the other hand, a male respondent who had a master’s or doctorate degree had significantly different income than a male respondent with non-university education. The income gap between the male respondents and female respondents was also apparent. Overall, for each educational level, the female respondents had a lower average regular monthly income compared with the male respondents.

Figure 1.

Predicted probabilities for income. (A) Working experience less than 14.5 years. (B) Working experience greater than or equal to 14.5 years.

Estimation of Personal Mean Income

We conceptualize income as a measure of the per capita income. To this end, we produced income data according to $m + 1$ categories, with an upper open-ended income category. The number of residents living on this income was also reported based on the two-stage MRR technique. Nonetheless, the categorical measurement does not allow for the calculation of per capita personal income. To address this issue, the midpoint of an individual’s corresponding income category interval was used as per capita personal income. Because there was no midpoint for the upper open-ended income category interval, three common options to solve this problem are (a) to adopt the lowest value of the upper income category interval as per capita personal income, (b) to use an arbitrary income value as per capita personal income, and (c) to estimate a midpoint for the income category interval based on the Pareto curve. The last option is believed to be better because it is based on data rather than an arbitrary value defined by the researcher. Income (as well as reference points of median income and the poverty line) is made equivalent for economies of scale by dividing family income by the square root of family size.

For all analyses we let the midpoints of the first m income category (coded as $0, 1, . . ., m - 1$ ) intervals be $η_{0}, . . ., η_{m - 2}$ , and $η_{m - 1}$ , respectively, and calculate the midpoint of the upper income category interval, denoted by $η_{m}$ , by using a Pareto distribution of family income per the standard methodology of Parker and Fenwick (1983), hence, $η = (η_{m}, η_{m - 1}, . . ., η_{0})^{T}$ to denote the midpoints of the income on an ordinal scale $(m, m - 1, . . .,1,0)$ . We estimate the mean income by using the estimator $\hat{Θ} \in {{\hat{Θ}}_{C}, {\hat{Θ}}_{W}, {\hat{Θ}}_{M_{1}}, {\hat{Θ}}_{M_{2}}}$ . Based on $η$ and $π_{i}^{Y} (\hat{Θ}) = e_{1} + C H_{i} (\hat{Θ})$ , we can have the estimator of the $i th$ individual’s mean income $μ_{i} (\hat{Θ}) = η^{T} π_{i}^{Y} (\hat{Θ}) .$ We apply a Taylor’s series expansion of $μ_{i} (\hat{Θ})$ at $Θ$ to have

\begin{array}{l} μ_{i} (\hat{Θ}) = μ_{i} (Θ) + \frac{\partial μ_{i} (Θ)}{\partial Θ} (\hat{Θ} - Θ) \\ = μ_{i} (Θ) + η^{T} C diag (H_{i}^{(1)} (Θ)) X_{i}^{T} (\hat{Θ} - Θ) . \end{array}

The delta method is used to approximate the variance of $μ_{i} (\hat{Θ})$ as follows:

Var (μ_{i} (\hat{Θ})) = η^{T} C diag (H_{i}^{(1)} (Θ)) X_{i}^{T} Δ X_{i} d i a g (H_{i}^{(1)} (Θ)) C η^{T} / n,

where $Δ \in {Δ_{C}, Δ_{W}, Δ_{M_{1}}, Δ_{M_{2}}}$ , which is the asymptotic variance of $\sqrt{n} (\hat{Θ} - Θ)$ .

We use $η = (NT $ 100, 000, NT $ 70, 000, NT $ 45, 000, NT $ 24, 000)^{T}$ to analyze the regular monthly income data in the 2012 TSCS by using the POM under the two-stage MRR technique, where NT$100,000 was obtained by adding NT$20,000 to the lower bound of the highest regular monthly income interval NT$80,000, and NT$24,000 was the average of the basic wage of about NT$18,000 and the upper bound of the lowest regular monthly income interval, NT$30,000. Table 3 shows the estimation of personal mean regular monthly income for the subpopulation. The results of all the estimation methods for personal mean regular monthly income were similar, except that the CC estimation method had a larger asymptotic SE. The asymptotic SEs of all the estimators were bigger when the education level was increased. The results show that, for individuals with working experience less than 14.5 years, the gap in earnings between the male respondents and female respondents was from NT$9,000 to NT$15,000 for university or higher education. The earnings of the female respondents or male respondents increased when the education level and years of working increased. However, the earnings of the female respondents with working experience greater than or equal to 14.5 years were slightly better than those of the male respondents with working experience less than 14.5 years. The regular monthly income gaps between the high-experience female respondents and low-experience male respondents were around NT$400 to NT$800.

Table 3.

Results of Personal Mean Regular Monthly Income.

$(D Z_{1}, D Z_{2})$	Method	$Z_{1} = 0$				$Z_{1} = 1$
		$X = 0$		$X = 1$		$X = 0$		$X = 1$
		Mean	ASE	Mean	ASE	Mean	ASE	Mean	ASE
$(0, 0)$	CC	28,271	659	33,736	924	34,704	1,252	44,126	1,395
	IPW	28,195	469	33,418	618	33,890	862	42,550	877
	MI₁	28,515	482	33,802	641	34,256	858	42,860	891
	MI₂	28,516	482	33,802	641	34,260	858	42,860	892
$(1, 0)$	CC	34,432	1,041	43,737	1,662	45,169	1,648	57,726	2,374
	IPW	35,627	785	44,941	1,205	45,660	1,155	57,314	1,628
	MI₁	36,176	807	45,484	1,227	46,174	1,176	57,775	1,619
	MI₂	36,177	806	45,482	1,228	46,176	1,175	57,774	1,620
$(0, 1)$	CC	57,354	4,381	71,529	4,796	73,349	4,690	85,791	3,667
	IPW	55,016	2,706	67,662	3,098	68,528	3,037	80,798	2,961
	MI₁	55,290	2,726	67,827	3,069	68,651	3,027	80,740	2,865
	MI₂	55,292	2,726	67,825	3,069	68,655	3,027	80,740	2,865

Note: ASE denotes the asymptotic standard error of an estimator. CC = complete-case; ML = maximum likelihood; IPW = inverse probability weighting; MI = multiple imputation.

Conclusion

Income is a sensitive topic for most people, and some researchers have even ranked income as one of the hardest-to-ask questions. The literature documents a large variety of imputation methods for missing values in income (e.g., Aßmann et al. 2017; Frick and Grabka 2005). Nonetheless, it is possible to develop and apply statistical modeling approaches that provide more accurate measures of income. Motivated by this awareness, we provide a two-stage MRR technique to investigate the true level of income as well as to protect the privacy of income information. To the best of our knowledge, it is the first time that the two-stage MRR design is applied in a large-scale survey. The goal is to contribute our work to the applied statistics literature and to propose a regression model that can be used for a wide range of applications. We do this through a POM for a two-stage MRR variable and fit the model to the income data to study its relationship with a set of covariates. Regression models with missing data have been actively studied in practical applications for decades. It has been shown that a naive CC analysis method yields inaccurate results for estimating regression coefficients in general. As shown in our simulation study, the large bias due to missing data cannot be ignored.

We have presented three estimation approaches, IPW and two types of nonparametric MI, to account for two-stage MRR with covariate data MAR. The two MI estimation approaches are easy to implement and give consistent results, but they are more computationally intensive compared with the IPW method because they involve repeatedly imputing the missing data. The two MI estimation methods have been shown to be asymptotically equivalent, but we recommend using the second MI estimation method to lessen the computational burden because it is an easy-to-use procedure. The IPW approach is asymptotically equivalent to the two MI estimation approaches and can be calculated very quickly. An important feature of the IPW approach is that it is unnecessary to make an additional assumption for the nuisance components such as the selection probability. However, if the missing rate is high, we cannot obtain enough information about questions of interest, which can result in divergence when estimating the model parameters. The other problem of the IPW method is that very small estimated selection probabilities give very large weights. To deal with this issue, it is customary to collapse classes by certain criteria (e.g., Eltinge and Yansanch 1997; Haziza and Beaumont 2007; Little 1986; Thomsen 1973).

Although the information of $m + 1$ income categories can be obtained via the two-stage MRR technique, in practice, we suggest $m = 3$ or $m = 4$ to reduce the response errors in operation. However, the number of income categories affects the accuracy of the estimates of income distribution because the income is categorized. To address this issue, we suggest extending the two-stage MRR technique to a three-stage MRR technique to obtain the information of more income categories. Moreover, because the proportion of a higher income category is small, one possible extension is to replace the POM with a multinomial probit model. We can compare the difference between the POM and the multinomial probit model and address some fitting issues in future research. Finally, we believe that our study may help to open up viable alternatives to complex survey sampling designs.⁷

Supplemental Material

SupplementaryMaterial_SMR_041719 - A Two-stage Multilevel Randomized Response Technique With Proportional Odds Models and Missing Covariates

SupplementaryMaterial_SMR_041719 for A Two-stage Multilevel Randomized Response Technique With Proportional Odds Models and Missing Covariates by Shu-Hui Hsieh, Shen-Ming Lee and Chin-Shang Li in Sociological Methods & Research

Supplemental Material

SupplementaryMaterial_SMR_20180621 - A Two-stage Multilevel Randomized Response Technique With Proportional Odds Models and Missing Covariates

SupplementaryMaterial_SMR_20180621 for A Two-stage Multilevel Randomized Response Technique With Proportional Odds Models and Missing Covariates by Shu-Hui Hsieh, Shen-Ming Lee and Chin-Shang Li in Sociological Methods & Research

Footnotes

Acknowledgments

The authors are grateful to an associate editor and two referees for their helpful comments that improved the presentation.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research of S.H. Hsieh and S.M. Lee was supported by the Ministry of Science and Technology (MOST) of Taiwan (106-2118-M-001-005 and 107-2118-M-035-004-MY2).

ORCID iDs

Shu-Hui Hsieh

Shen-Ming Lee

Supplemental Material

Supplemental material for this article is available online.

Notes

References

Aßmann

Würbach

Goßmann

Geissler

Bela

. 2017. “Nonparametic Multiple Imputation for Questionnaires with Individual Skip Patterns and Constraints: The Case of Income Imputation in the National Educational Panel Study.” Sociological Methods & Research 46:864–97.

Anthony

B. A

. 2002. “Performing Logistic Regression on Survey Data with the New SURVEYLOGISTIC Procedure.” Paper presented at the Twenty-seventh Annual SAS User Group International Conference, Cary, NC.

Böckenholt

Van der Heijden

P. G. M.

. 2007. “Item Randomized-response Models for Measuring Noncompliance: Risk-return Perceptions, Social Influences, and Self-protective Responses.” Psychometrika 72:245–62.

Boruch

R. F.

1971. “Assuring Confidentiality of Responses in Social Research: A Note on Strategies.” The American Sociologist 6:308–11.

Breslow

N. E.

Cain

K. C.

. 1988. “Logistic Regression for Two-stage Case-control Data.” Biometrika 75:11–20.

Christofides

T. C.

2003. “A Generalized Randomized Response Technique.” Metrika 57:195–200.

Corstange

2009. “Sensitive Questions, Truthful Answers? Modeling the List Experiment with LISTIT.” Political Analysis 17:45–63.

Cruyff

M. J. L. F.

Böckenholt

Van den Hout

Van der Heijden

P. G. M.

. 2008. “Accounting for Self-protective Responses in Randomized Response Data from a Social Security Survey Using the Zero-inflated Poisson Model.” The Annals of Applied Statistics 2:316–31.

Cruyff

M. J. L. F.

Van den Hout

Van der Heijden

P. G. M.

. 2008. “The Analysis of Randomized-response Sum Score Variables.” Journal of the Royal Statistical Society: Series B 70:21–30.

10.

Eltinge

J. L.

Yansanch

I. S.

. 1997. “Diagnostics for Formation of Nonresponse Adjustment Cells, with an Application to Income Nonresponse in the U.S. Consumer Expenditure Survey.” Survey Methodology 23:33–40.

11.

Esponda

Huerta

Guerrero

V. M.

. 2016. “A Statistical Approach to Provide Individualized Privacy for Surveys.” PLoS One 1–14. doi: 10.1371/journal.pone.0147314.

12.

Flanders

W. D.

Greenland

. 1991. “Analytic Methods for Two-stage Case-control Studies and Other Stratified Designs.” Statistics in Medicine 10:739–47.

13.

Franklin

L. A.

1989. “A Comparison of Estimators for Randomized Response Sampling with Continuous Distributions from a Dichotomous Population.” Communication in Statistics Theory and Methods 18:489–505.

14.

Frick

J. R.

Grabka

M. M.

. 2005. “Item Non-response on Income Questions in Panel Surveys: Incidence, Imputation and the Impact on Inequality and Mobility.” Allgemeines Statistiches Archiv 89:49–61.

15.

Glonek

G. F. V.

McCullagh

. 1995. “Multivariate Logistic Models.” Journal of the Royal Statistical Society: Series B 57:533–46.

16.

Godambe

V. P.

1960. “An Optimum Property of Regular Maximum Likelihood Estimation.” Annals of Mathematical Statistics 31:1208–12.

17.

Greenberg

B. G.

Abul-Ela

Simmons

W. R.

Horvitz

D. G.

. 1969. “The Unrelated Question Randomized Response Model: Theoretical Framework.” Journal of the American Statistical Association 64:520–39.

18.

Haziza

Beaumont

J. F.

. 2007. “On the Construction of Imputation Classes in Surveys.” International Statistical Review 72:147–48.

19.

Horton

N. J.

Laird

N. M.

. 1999. “Maximum Likelihood Analysis of Generalized Linear Models with Missing Covariates.” Statistical Methods in Medical Research 8:37–50.

20.

Houston

Tran

. 2001. “A Survey of Tax Evasion using the Randomized Response Technique.” Advances in Taxation 13:69–94.

21.

Hsieh

S. H.

Lee

S. M.

Shen

P. S.

. 2009. “Semiparametric Analysis of Randomized Response Data with Missing Covariates in Logistic Regression.” Computational Statistics and Data Analysis 53:2673–92.

22.

Hsieh

S. H.

Lee

S. M.

Shen

P. S.

. 2010. “Logistic Regression Analysis of Randomized Response Data with Missing Covariates.” Journal of Statistical Planning and Inference 140:927–40.

23.

Hsieh

S. H.

Lee

S. M.

S. H.

. 2018. “Randomized Response Techniques for a Multi-level Attribute using a Single Sensitive Question.” Statistical Papers 59:291–306.

24.

Ibrahim

J. G.

Chen

M. H.

Lipsitz

S. R.

Herring

A. H.

. 2005. “Missing Data Methods for Generalized Linear Methods: A Comparative Review.” Journal of the American Statistical Association 100:332–46.

25.

Lee

S. M.

Gee

M. J.

Hsieh

S. H.

. 2011. “Semiparametric Methods in the Proportional Odds Model for Ordinal Response Data with Missing Covariates.” Biometrics 67:788–98.

26.

Lee

S. M.

Hwang

W. H.

de Dieu Tapsoba

. 2016. “Estimation in Closed Capture-recapture Models When Covariates Are Missing at Random.” Biometrics 72:1294–1304.

27.

Lensvelt-Mulders

G. J. L. M.

Hox

J. J.

Van der Heijden

P. G. M.

Maas

C. J. M.

. 2005. “Meta-analysis of Randomized Response Research: Thirty-five Years of Validation.” Sociological Methods & Research 33:319–48.

28.

Lensvelt-Mulders

G. J. L. M.

Van der Heijden

P. G. M.

Laudy

Van Gils

. 2006. “A Validation of a Computer-assisted Randomized Response Survey to Estimate the Prevalence of Fraud in Social Security.” Journal of the Royal Statistical Society: Series A 169:305–18.

29.

Lillard

Smith

J. P.

Welch

. 1986. “What Do We Really Know about Wages? The Importance of Nonreporting and Census Imputation.” Journal of Political Economy 94:489–506.

30.

Little

R. J. A.

1986. “Survey Nonresponse Adjustments for Estimates of Means.” International Statistical Review 54:139–57.

31.

Little

R. J. A.

1992. “Regression with Missing X’s: A Review.” Journal of the American Statistical Association 87:1227–37.

32.

Little

R. J. A.

Rubin

D. B.

. 2002. Statistical Analysis with Missing Data. 2nd ed. Hoboken, NJ: Wiley.

33.

Lui

Koirala

. 2013. “Fitting Proportional Odds Models to Educational Data with Complex Sampling Design in Ordinal Logistic Regression.” Journal of Modern Applied Statistical Methods 12:26.

34.

Mangat

N. S.

1994. “An Improved Randomized Response Strategy.” Journal of the Royal Statistical Society: Series B 56:93–95.

35.

Moors

J. J.

1971. “Optimization of the Unrelated Question Randomized Response Model.” Journal of American Statistical Association 66:627–29.

36.

Parker

R. N.

Fenwick

. 1983. “The Pareto Curve and Its Utility for Open-ended Income Distributions in Survey Research.” Social Forces 61:872–85.

37.

Perri

P. F.

Cobo

Rueda

. 2017. “A Mixed-mode Sensitive Research on Cannabis Use and Sexual Addiction: Improving Self-reporting by Means of Indirect Questioning Techniques.” Quality & Quantity 52:1593–1611.

38.

Righi

Falorsi

Fasulo

. 2014. “Methods for Variance Estimation under Random Hot Deck Imputation in Business Surveys.”Rivista di statistica ufficialeItalian: National Institute of Statistics(Rome, Italy) 16:45–64.

39.

Riphahn

R. T.

Serfling

. 2002. “Item Non-response on Income and Wealth Questions.” IZA Discussion Paper No. 573, Bonn, Germany.

40.

Ronning

2005. “Randomized Response and the Binary Probit Model.” Economics Letters 86:221–28.

41.

Rubin

D. B.

1976. “Inference and Missing Data.” Biometrika 63:581–92.

42.

Rubin

D. B.

1978. “Multiple Imputation in Sample Surveys: A Phenomenological Bayesian Approach to Nonresponse.” Proceedings Survey Research Method Section of the American Statistical Association, 20–34.

43.

Rubin

D. B.

1987. Multiple Imputation for Nonresponse in Surveys. New York: Wiley.

44.

Rubin

D. B.

Schenker

. 1986. “Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse.” Journal of the American Statistical Association 81:366–74.

45.

Scheers

N. J.

Dayton

C. M.

. 1988. “Covariate Randomized Response Models.” Journal of the American Statistical Association 83:969–74.

46.

Thomsen

1973. “A Note on the Efficiency of Weighting Subclass Means to Reduce the Effects of Nonresponse when Analyzing Survey Data.” Statistik Tidskrift 11:278–83.

47.

Tian

G. L.

Yuen

K. C.

Tang

M. L.

Tan

M. T.

. 2009. “Bayesian Non-randomized Response Models for Surveys with Sensitive Questions.” Statistics and Its Interface 2:13–25.

48.

Van den Hout

Van der Heijden

P. G. M.

Gilchrist

. 2007. “The Logistic Regression Model with Response Variables Subject to Randomized Response.” Computation Statistics and Data Analysis 51:6060–69.

49.

Wang

C. Y.

Chen

J. C.

Lee

S. M.

S. T.

. 2002. “Joint Conditional Likelihood Estimator in Logistic Regression with Missing Covariate Data.” Statistica Sinica 12:555–74.

50.

Wang

Chen

S. X.

. 2009. “Empirical Likelihood for Estimating Equations with Missing Values.” The Annals of Statistics 37:490–517.

51.

Wang

C. Y.

Wang

. 1997. “Semiparametric Methods in Logistic Regression with Measurement Error.” Statistica Sinica 7:1103–20.

52.

Wang

C. Y.

Wang

Zhao

S. T.

. 1997. “Weighted Semiparametric Estimation in Regression Analysis with Missing Covariate Data.” Journal of the American Statistical Association 92:512–25.

53.

Warner

S. L.

1965. “Randomized Response: A Survey Technique for Eliminating Evasive Answer Bias.” Journal of the American Statistical Association 60:63–69.

54.

R. R.

L. A.

. 2011. “Imputation of Non-ignorable Nonresponse for Income Analysis of a Panel Study on Taiwan.” Quality and Quantity 45:875–84.

55.

Zhao

Lipsitz

. 1992. “Designs and Analysis of Two-stage Studies.” Statistics in Medicine 11:769–82.

56.

Zweimüller

1992. “Survey Non-response and Biases in Wage Regressions.” Economics Letters 39:105–9.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.07 MB

0.23 MB