Abstract
This article is an empirical contribution to the evaluation of the randomized response technique (RRT), a prominent procedure to elicit more valid responses to sensitive questions in surveys. Based on individual validation data, we focus on two questions: First, does the RRT lead to higher prevalence estimates of sensitive behavior than direct questioning (DQ)? Second, are there differences in the effects of determinants of misreporting according to question mode? The data come from 552 face-to-face interviews with subjects who had been convicted by a court for minor criminal offences in a metropolitan area in Germany. For the first question, the answer is negative. For the second, it is positive, that is, effects of individual and situational determinants of misreporting differ between the two question modes. The effect of need for social approval, for example, tends to be stronger in RRT than in DQ mode. Interviewer experience turns out to be positively related to answer validity in DQ and negatively in RRT mode. Our findings support a skeptical position toward RRT, shed new light on long-standing debates within survey methodology, and stimulate theoretical reasoning about response behavior in surveys.
Keywords
Introduction
Since its beginnings in the 1930s, modern survey methodology has continuously been preoccupied with the problem of “sensitive questions.” How is it possible to avoid respondents answering such questions evasively and/or not truthfully? It has been proven by numerous studies (e.g., Hyman 1944; Jones and Forrest 1992; Kreuter, Presser, and Tourangeau 2008; Tourangeau and Yan 2007; van der Heijden et al. 2000) that respondents tend to underreport undesirable or negatively connoted behaviors (such as criminal offences, drug abuse, or unconventional sexual activities) and overreport desirable ones (such as voting, church attendance, or altruistic behavior). Misreporting leads to two problems: First, prevalence estimates of the sensitive behavior are systematically biased. Second, analyses of associations between independent variables and the sensitive behavior can be distorted if the extent of misreporting varies systematically with the predictors under investigation (Bernstein, Chadha, and Montjoy 2001; Ganster, Hennessey, and Luthans 1983).
A well-known method promising more valid estimates of sensitive topics is the randomized response technique (RRT; see Fox and Tracy 1986 for an overview). With this technique, first introduced by Warner (1965), a randomization device decides how the interviewee answers the sensitive question. The randomization device is managed exclusively by the respondents, and neither the interviewers nor the researchers know whether a “yes” to a sensitive, undesirable characteristic originates from the randomization process or is a real “true” answer to the question. Many different versions of RRTs have been developed. The idea of all RRTs is that—by guaranteeing anonymity—influences inducing respondents not to admit to sensitive behaviors in direct questioning (DQ) are eliminated. However, empirical evidence on the performance of RRT is mixed. Although two meta-analyses by Lensvelt-Mulders et al. (2005) find a mean positive effect of RRT on response validity compared to conventional questioning techniques, some central issues remain unresolved.
First, despite the aforementioned meta-analyses and a huge amount of literature on the subject, it is still controversial whether RRT provides any benefit to response validity at all. Recently, Holbrook and Krosnick (2010:328) even questioned “whether this technique has ever worked properly to achieve its goals.” Most studies on the performance of RRT are based on the “more is better assumption,” presuming that higher prevalence estimates of undesirable behaviors are more valid. However, without information about the true prevalence of a sensitive characteristic in a sample, it remains unclear how significant the response bias really is. There are only very few validation studies on RRTs, which check respondents’ survey answers against confirmed outside records. All researchers agree that such individual validation studies are the best way to shed light on the matter of the validity and usefulness of RRT (Lensvelt-Mulders et al. 2005; Umesh and Peterson 1991).
Second, the mean positive effect from the meta-analyses is associated with a pronounced variability of results on RRT performance. Several studies find positive effects of RRT, compared to DQ, on answer validity (e.g., de Jong, Pieters, and Fox 2010; Reckers, Wheeler, and Wong-On-Wing 1997; Tezcan and Omran 1981; van der Heijden et al. 2000; Wimbush and Dalton 1997), while others find no effect at all (e.g., Lamb and Stem 1978; Ostapczuk, Musch, and Moshagen 2009; Tracy and Fox 1981) and still others even find negative effects (Beldt, Daniel, and Garcha 1982; Coutts and Jann 2011; Holbrook and Krosnick 2010). Furthermore, the results seem to vary unsystematically, rather than being influenced in consistent ways by study characteristics, including RRT variants, and substantial topics under investigation. Lensvelt-Mulders et al. (2005:323) express this point clearly when they summarize that “a thorough look at the literature on RRTs reveals that 35 years of research have not led to a consensus or a description of best practices.”
Third, to our knowledge, no study has ever published multivariate models of determinants of response behavior in RRT mode using individual validation data. The most recent validation study by van der Heijden et al. (2000) also performed only bivariate analyses. The motivation to analyze determinants of response behavior in multivariate models is twofold: On one hand, theoretical knowledge of the exact reasons why respondents answer evasively is not yet fully developed (Umesh and Peterson 1991:131-32). The standard argument is that respondents mainly distort their answers for reasons of social desirability (SD). Since RRT assures anonymity, SD effects should be lower in RRT than in DQ mode. This hypothesis, however, has not yet been tested using individual validation data. The second motivation is the aforementioned insight that—besides biased prevalence estimates—the results of empirical analyses on relationships between determining factors and sensitive items can be distorted. Even if RRT is not able to improve prevalence estimates of sensitive behaviors, it would be a success if it could at least reduce the impact of other factors causing response editing in DQ mode.
This article will present findings from an individual validation study comparing RRT with DQ. It is based on 552 face-to-face interviews with subjects who had been convicted by a court for minor criminal offences in Germany. The sensitive validation question was whether the respondents had ever been convicted under criminal law.
The analyses will focus on two issues: The first is concerned with the question of how large the response bias is depending on the question mode. Does RRT lead to higher prevalence estimates than DQ and thus improve data validity? The second pertains to determinants of response behavior. What attributes of the respondent and situational characteristics of the interview affect the tendency to misreport in DQ and RRT? Are there differences between the two question modes? Is misreporting in dependence of other predictors of response bias lower in RRT than in DQ mode?
By providing findings to these questions, the article also aims to throw some light on topics that have been fundamental to research on RRT over the past few decades: Why is the evidence on the performance of RRT so inconclusive? And should RRT be maintained as a tool for posing sensitive questions in surveys at all?
The next section reviews developments and results of the literature on RRT, including some theoretical aspects of explaining respondents’ misreporting in surveys. This will be followed by a description of the design of our validation study. Subsequently, the empirical findings will be presented. The article closes with a discussion and conclusions of our results.
Conceptual, Empirical, and Theoretical Aspects of RRT
Initiated by the invention of RRT by Warner (1965), a large body of literature has focused on the development of more refined RRT designs and statistical or methodological aspects of the technique. Furthermore, a considerable number of empirical studies have been conducted in order to assess the performance of the technique. Less attention, however, has been devoted to theoretical aspects of RRT and potential response biases.
Methodology of RRT
Particularly in the first two decades of research on RRT, many studies proposed alternatives to Warner’s original method (e.g., Folsom et al. 1973; Greenberg et al. 1971; Kuk 1990; Moors 1971; Reinmuth and Geurts 1975). These new variants aimed to improve the practicability of RRT procedures, investigated statistical properties of RRT estimators, and developed RRT designs for quantitative dependent variables. The currently most prominent variant, which we also used in our validation study, is the “forced response RRT,” introduced by Boruch (1971). As in all RRTs, the forced response method first asks the respondent to use a randomization device (e.g., dice, coin flip, or playing cards). The result of the randomization device is only known to the respondent; the interviewer does not know the outcome. Subsequently, the respondent is asked either to give a predetermined answer—“yes” or “no”—regardless of the true value, or to answer the sensitive question truthfully (e.g., consumption of illegal drugs). When dice are used, for instance, the design might tell the respondent: “If you have thrown a 1, please answer ‘no’; if you have thrown a 6, please answer ‘yes’; if you have thrown a 2, 3, 4 or 5, please answer the question truthfully: Have you ever taken illegal drugs, ‘yes’ or ‘no’?” With this procedure, it cannot be established for certain whether a “yes” indicates a confession of drug consumption or is a predetermined answer. However, as the probabilities of the outcome of the randomization device are known (in the above example, 4/6 for the “real” question, 1/6 for a “forced yes,” and 1/6 for a “forced no”), an estimation of the prevalence of the sensitive behavior is possible. An unbiased estimate of the sensitive item π and its sampling variance are calculated as follows (Tourangeau and Yan 2007:872):
where λ is the observed proportion of “yes” answers, p yes the probability given by the randomization device to respond with a “forced yes,” and p question = 1 − p yes − p no the probability of being instructed to answer the sensitive question truthfully. The forced response RRT has several positive features: It is relatively easy to administer for both respondents and interviewers, the design parameters (p values) can be tailored to the specific demands in a field application, and its statistical properties (efficiency) are good (Fox and Tracy 1986; Lensvelt-Mulders, Hox, and van der Heijden 2005).
A more recent result of methodological RRT research was the development of regression models for RRT data in order to conduct analyses of determinants of sensitive behaviors surveyed by RRT procedures (Maddala 1983; van der Heijden and van Gils 1996). Due to the artificial error variance in the dependent variable, traditional regression techniques (e.g., logistic regression for binary responses) are not suitable for RRT data. This was considered to be a major weakness of RRTs because elaborate analyses of determinants of sensitive topics would not be possible (even de Jong et al. 2010 and Lara et al. 2004 recently expressed this objection). Meanwhile, convenient regression modules for binary RRT data are available in standard statistical software like Stata (Jann 2005, 2008).
Empirical RRT Research
There is a discrepancy between the ample methodological and statistical knowledge about RRTs and the empirical knowledge about the simple matter of whether the technique really pays off, that is, is actually successful in reducing response bias. Umesh and Peterson (1991:106 and 121) summarize the results of research as follows—an assessment that is still up to date 20 years later: “Although the large number of published statistical extensions are intellectually interesting, expectations about the validity of RRM, and hence its practical utility, may have been overblown. [ … ] An underresearched topic is the validity of RRMs. Regardless of their statistical elegance, if the methods do not provide valid estimates, they are of no practical value.” Examining the literature, we can identify different types of empirical RRT studies. Table 1 is an attempt to classify them.
Classification of Empirical RRT Studies and Examples.
Note: RRT = randomized response technique.
Substantial applications of RRT refer to what the technique was originally intended for: to gather valid data about sensitive topics in surveys. Ironically, such applications are rare. Most empirical research has concentrated on methodological comparison studies. On one hand, there are studies based on the “more is better” assumption (or, in the case of socially desirable behaviors, “less is better”). Here, higher (lower) RRT estimates of the sensitive topic, as compared to DQ or other techniques, are treated as being more valid. One of the two meta-analyses by Lensvelt-Mulders et al. (2005) summarizes the results of 32 studies that use the “more is better” assumption and finds a mean positive effect of RRT compared to traditional questioning methods. On the other hand, there are two types of validation studies. Validations on the aggregate level compare survey estimates with the known prevalence of a sensitive behavior in a population. Individual validation studies compare respondents’ answers with outside records available for every respondent. Both types of validation studies can be further distinguished according to whether comparisons with DQ are carried out or not. For assessing the performance of RRT, “individual validation studies are doubtlessly the gold standard from a methodological point of view” (Lensvelt-Mulders et al. 2005:341).
Focusing our attention on these validation studies, we count seven of them in the literature. Table 2 presents a synopsis of these studies. The study by Folsom (1974) remained unpublished, and we were unable to retrieve it. One of the seven studies is on the aggregate level, while the others provide individual-level validations. 1
Direct Questioning Versus Randomized Response: Validation Studies.
Notes: DQ = direct questioning; RRT = randomized response technique.
aUQT = unrelated question technique (Horvitz et al. 1967), RG = two-step procedure by Reinmuth and Geurts (1975), FR = forced response technique, LC = RRT design by Liu and Chow (1976a, 1976b), Kuk = procedure by Kuk (1990). bPercentage. Insignificant differences at p < .10 are printed in italics. For five comparisons, exact prevalence estimates could not be calculated because ordinal or quantitative RRT procedures were used. In these cases, only the direction and the significance of the mode effects are reported. cHere, DQ is not the result of direct questioning but the externally validated value in the sample.
As the table shows, the results in favor of RRT are not convincingly strong. In fact, only the most recent validation study by van der Heijden et al. (2000) arrives at significantly higher prevalence estimates for RRT vis-à-vis DQ. The conclusion drawn by Lensvelt-Mulders et al. (2005) from their meta-analysis of validation studies that RRT has positive effects mainly results from this study by van der Heijden et al. According to the meta-analysis, the mean deviation of respondents’ answers from the true values over all studies is 0.42; in DQ mode, the mean deviation is 0.49 and in RRT mode 0.38. Independent of the question whether RRT is better than DQ, these figures demonstrate that we have to be aware that RRTs will normally be far away from eliminating misreporting completely. Generally, the findings on RRT performance show an inconsistent pattern. The technique sometimes works as intended, sometimes not, and we do not know much about when and why the technique succeeds or fails. To date, it has not been possible to determine which factors clearly influence the effectiveness of RRT.
Theoretical Aspects of Misreporting in Surveys
Apart from the small number of validation studies and the limited empirical knowledge of factors influencing response behavior in the RRT setting, there is also a theoretical desideratum. Although many claims can be found in the literature, most of them referring to the SD response bias (see below), the exact reasons behind misreporting are still controversial. It is not possible to develop a comprehensive theory within the constraints of this article, but we will present some considerations that may be helpful.
One way to provide a conceptual framework for response behavior in surveys is the application of rational choice theory (Esser 1986, 1991; Rasinski et al. 1999; Stocké 2004). The rational choice approach assumes that response behavior in surveys is an individual behavior like any other behavior and can thus be understood by arguments of subjective expected utility theory. According to this theory, a respondent will misreport if the subjectively expected utility from giving a true answer is lower than the expected utility of an edited and wrong answer.
A crucial problem of the rational choice heuristic is identifying the costs and benefits for the respondents. Undoubtedly, the most prominent cost and benefit factors pertain to the well-known concept of SD. Respondents misreport in order to avoid social disapproval of the interviewer and others, and to present themselves in a socially favorable light. In the literature, two main components of SD are differentiated: need for social approval and trait desirability (Phillips and Clancy 1972; Stocké 2004, 2007b). The need for social approval (SD need) refers to a personality trait presuming that some people tend more than others to strive for social approval by their social environment. Trait desirability (SD belief) is the perceived desirability of an attitude or behavior. For example, admitting to marihuana consumption might be seen as undesirable by certain persons, but less so by others, and perhaps even as desirable by others still.
The hypotheses regarding SD-motivated response editing and the effect of RRT are obvious: The more a respondent who actually has committed the “undesirable behavior” strives for social approval and the more he or she believes that this behavior is undesirable, the more he or she will be prone to deny it. These effects should be lower in RRT mode because the procedure guarantees anonymity and thus cancels out costs that make respondents misreport in DQ mode (see Wolter 2012 for a detailed elaboration of this theoretical argument). Stocké (2004) has articulated the proposition that SD effects arise only in a three-way interaction between SD need, SD belief, and anonymity of the survey situation. This means that respondents would only misreport if they have a pronounced need for approval and believe that something is undesirable and the interview situation is not anonymous. Despite the “popularity” of SD considerations, empirical literature on SD effects shows inconclusive findings. Many studies could not confirm the supposed effects of SD need, SD beliefs, anonymity of the survey situation, and their proposed interactions (Burris, Johnson, and O’Rourke 2003; Johnson, Fendrich, and Mackesy-Amiti 2012; King and Bruner 2000; Moorman and Podsakoff 1992).
This leads to the question of what other factors beyond SD cause misreporting. Based on rational choice theory, we can propose a “loss hypothesis.” Respondents vary in their stakes of what they stand to lose by admitting to a sensitive (in our case: a delinquent) behavior. The higher the potential losses, the stronger the tendency of misreporting will be (Bernstein et al. 2001).
Applying this general idea to sociodemographic characteristics of the respondent (gender, age, education), we may expect that delinquent women, delinquent older people, and delinquent respondents with higher educational credentials will confess their delinquency less often than delinquent men, delinquent younger people, and delinquent respondents with lower educational merits. The societal norm not to engage in criminal activities is presumably stronger for women, older persons, and highly educated subjects; and, additionally, older people as well as those with more schooling are on average socially better situated. There is some, but again inconclusive empirical evidence in the literature that the misreporting bias with respect to criminal behavior is more pronounced for these three groups (Johnson et al. 2012; Skarbek-Kozietulska, Preisendörfer, and Wolter 2012; van der Heijden et al. 2000).
Respondents who have committed severe criminal offences—as compared to minor ones—also have more to lose by sticking to the truth in the survey situation. When a respondent has to confess serious criminal behavior, the interaction with the interviewer tends to become critical and the interviewee clearly is at risk to lose face. Furthermore, the presence of a third person during the interview reduces anonymity and thus might be connected with additional risks and endanger response validity. Contrary to this simple expectation, however, theoretical and empirical work indicates that the presence of third parties is not associated with stable effects (Aquilino, Wright, and Supple 2000; Silver, Abramson, and Anderson 1986; Tourangeau and Yan 2007). Depending on the identity of the third person, his or her knowledge about the true value of the respondent and the strength of the relationship between respondent and third person, either negative or positive effects on answer validity can be predicted.
As opposed to risks and potential losses, benefits should have a positive effect on the general cooperation of respondents. Survey research nowadays increasingly offers monetary incentives to stimulate the willingness of people to participate in questionnaires and interviews. Whereas the success of this strategy with respect to survey participation and response rates has been documented by numerous studies (e.g., Diekmann and Jann 2001; Stadtmüller 2009), its effect on response validity is controversial. Monetary incentives may strengthen the desire of the respondent to be a “good subject” and not to reveal “bad guy behavior.”
A well-established observation in survey research is that distortions in face-to-face interviews often have more to do with the interviewer’s discomfort and worries that asking the question will be problematic than with discomfort of the interviewee (Bradburn and Sudman 1979:chap. 4; Hox and de Leeuw 2002; Schnell and Kreuter 2005). This leads to the expectation that interviewers need experience with surveys in general and also experience with a particular survey to become more relaxed and less anxious about asking sensitive questions. We assume that general interviewer experience is less important than experience in an ongoing survey that contains sensitive questions. The hypothesis is that an interviewer will get more valid answers the more interviews he or she has already conducted successfully in an ongoing survey.
Finally, when subjects do not give a valid answer to a sensitive question on misbehavior in the past, it does not necessarily mean that they “lie.” They simply may have forgotten the “unpleasant event”—supported by psychological processes of suppression and self-deception (Groves et al. 2004:213-18; Paulhus and Reid 1991). Such memory problems presumably become more influential the longer the criminal behavior dates back. We will examine this proposition by taking into account the length of time between criminal behavior and interview in our following empirical analyses.
Study Design and Methods
Study Design and RRT Procedure
A face-to-face survey with interviewer-administered paper-and-pencil questionnaires was carried out in a German metropolitan area among subjects who had been convicted by a court for criminal offences in the last few years prior to the interview. The validation data were taken from court records. These records included information about the address of the subjects, their age, and the type of deviant behavior they had been convicted for. Only persons who had committed “minor” offences such as shoplifting, repeated fare dodging on public transportation, driving under the influence of alcohol, drug abuse, or social welfare fraud were part of the sample. A design and a title were chosen for the survey so that it appeared to be a conventional population survey. It was managed and implemented by the researchers themselves, without involving a commercial survey institute. The validation question and other relevant variables (SD need, SD belief, etc.) were hidden among filler items which were not of primary interest for the study. The survey used a double-blind design, that is, neither the respondents nor the interviewers had information about the sample composition and the real aim of the study. Of course, such a design raises ethical and data protection concerns. To address these, we took numerous precautionary measures in close cooperation with German data protection authorities. The study featured an experimental design, 40 percent of contacts were randomly assigned to DQ mode and 60 percent to RRT mode. Approximately 75 interviewers were recruited. All interviewers participated in a half-day interviewer training session.
The survey started in February 2009 and lasted until October 2010. Undoubtedly, this is a relatively long field period. The difficulties of surveys among special populations such as petty criminals have been observed in other studies (Locander, Sudman, and Bradburn 1976; van der Heijden et al. 2000). Respondents often cannot be localized, and their willingness to participate is low. Our study also had to grapple with these problems. Alarmed by low cooperation rates in the first months of the field period, we decided to offer an incentive of 20€ for those participating in the study.
The final response rates of the survey are documented in Table 3. Of the 3,372 cases contacted by an advance letter, 2 there were 647 “wrong addresses” and 479 cases “not approached by the interviewer.” These latter cases were due to interviewers who decided to quit their job after initial experiences with the survey. When calculated over all cases contacted by letter (total sample), the response rate is 17.2 percent; when calculated over all cases approached (net sample), it is 25.8 percent. Only one person broke off the interview. A small group of 27 cases were “potential fakes.” To identify such fakes (or cases where someone else was interviewed instead of the validation contact person), the respondent’s date of birth according to the questionnaire was checked against the validation records. Only cases for which the two years of birth were identical were retained for analysis. If we exclude “break offs” and “potential fakes,” the response rate of the net sample falls slightly to 24.6 percent.
Response Rates of the Survey.
Note: DQ = direct questioning; RRT = randomized response technique.
Table 3 does not show the effect of the 20-€ incentive on the willingness to cooperate. Overall, 39 percent of those contacted by letter were approached without and 61 percent with this incentive. The response rate in the net sample (analyzable validation data) without the incentive was 17.5 percent as compared to 29.1 percent in the incentive situation. This indicates that the incentive contributed to a higher response rate.
The analyzable data include 552 cases; 219 of them (39.7 percent) are DQ and 333 (60.3 percent) RRT interviews. This corresponds almost exactly to the 40/60 distribution of the experimental design, that is, dropouts are not related to question mode. Apart from the aforementioned difficulties regarding low response rates and some unreliable interviewers, there were no particular problems during the field period. Neither interviewers nor respondents articulated doubts or suspicions concerning the aims of the study and/or the sample composition.
There were two versions of the questionnaire, one for DQ and one for RRT. Both versions were absolutely identical up to the point where the sensitive topic about the respondent’s delinquent/criminal behavior began. The validation question was part of a “criminality module,” starting with questions on fear of crime, attitudes with regard to criminality, and victimization experience. Then, the sensitive issue of the person’s own delinquency was introduced with the sentence: “In addition to becoming a victim of criminal offences, people sometimes get into trouble with the law themselves.” Four sensitive questions on the interviewee’s own delinquent activities followed. The final one of these was the validation question which is the main focus of this article: “Have you ever been—by penalty order or in a court case—convicted under criminal law of a minor or more serious offence? By ‘convicted under criminal law’ we mean that the issue was handled by a public prosecutor.” In DQ, respondents were asked to answer this question directly and verbally with “yes” or “no.”
As already mentioned above, our RRT design was a forced response procedure. Its methodological details are described in the online Appendix (which can be found at http://smr.sagepub.com/supplemental/). As a randomization device, we used 16 playing cards with differently colored symbols (10 cards had a black, 3 a red, and 3 a blue symbol). The respondents were asked to shuffle the 16 cards, draw one randomly without showing it to the interviewer, and follow the instructions on an answer list handed over by the interviewer. With probability p yes = .1875 (3/16), respondents were assigned the answer “yes,” with probability p no = .1875 (3/16) “no,” and with probability p question = .625 (10/16) they were instructed to answer the question truthfully. In our RRT format, the probability of being directed to answer the sensitive question is slightly lower than the mean value of p = .67 reported by Lensvelt-Mulders et al. (2005:335) for 38 RRT studies.
Subjects assigned to RRT mode did not have the possibility to switch to DQ mode. When a respondent hesitated to follow the RRT procedure, the interviewer asked him or her to do so because only this would guarantee reliable data enabling comparisons between different population groups. This instruction worked well. Ultimately, none of the respondents refused to employ the RRT procedure.
Variables
In addition to the dependent variable, data on a set of independent variables were collected for empirical analyses of potential determinants of response behavior. Table 4 gives an overview. The variable of crucial interest is the answer to the validation question of whether the respondent has ever been convicted under criminal law. As we knew from the court records, everyone in the sample had actually been convicted. This means that a “no” answer to the question indicates misreporting.
Dependent and Independent Variables.
Notes: DQ = direct questioning; RRT = randomized response technique; SD = social desirability.
Items for measuring SD need and SD belief are given in Note 3. See Wolter (2012) for more information about scale construction and scale statistics.
Besides the experimental variable “question mode” (RRT vs. DQ), we will take into account three sociodemographic variables: gender, age, and education. Age is measured in years (decades). Education refers to years of general schooling. The theoretically important SD variables cover “SD need” and “SD belief.” SD need (need for social approval) is captured by a subset of items from the German version of the Crowne–Marlowe SD scale (Crowne and Marlowe 1960; Stocké 2007a). SD belief (trait desirability) pertains to the degree to which respondents think that being convicted under criminal law is disapproved of in “our society.” 3
The variable “seriousness of offence” is an attempt to distinguish more or less severe crimes the respondents had been convicted for. This variable is based on a rating conducted by a convenience sample of external raters who had been asked to rank nine different groups of criminal offences regarding their seriousness. Fare dodging on public transportation was rated as the least serious crime and bodily injury as the most severe one. The mean values of the ratings were assigned to every respondent according to the offence which he or she had been convicted for.
Furthermore, we have two dummy variables: The first registers whether a third person was present during the interview, the second whether the respondent received the incentive of 20€ for participating in the interview. “Interviewer experience” is the running number of interviews an interviewer has conducted in the current survey. To account for potential memory problems, a final covariate counts the number of months elapsed between the date of conviction of the respondent and the date of interview.
Methods
Apart from descriptive statistics, our empirical analyses will present prevalence estimates and inferential statistics for the validation question, and regression models on determinants of response behavior. The formula for calculating prevalence estimates and standard errors for the forced response RRT have already been given above (equation 1 and 2). To calculate z scores, we use the equation:
To estimate binary logistic regressions with RRT data, we refer to the model developed by Maddala (1983) and implemented in Stata by Jann (2005). The software routine takes the probabilities of forced responses (p yes, p no) into account; if these probabilities are zero, the model corresponds to a conventional logistic regression equation. By enacting a variable which sets the p values as .1875 for RRT cases and 0 for DQ cases, respondents from both question modes can be analyzed simultaneously. Running separate models or including interaction effects between question mode and predictors, we can test whether—depending on the mode (RRT vs. DQ)—predictors differ in their effect on response behavior.
Empirical Results
Beginning with the dependent variable, Table 5 shows prevalence estimates for the validation question and accompanying statistics. Only one person refused to answer the validation question. In all, 57.5 percent answered the question truthfully in DQ and 59.6 percent in RRT mode. This means that there is virtually no difference between DQ and RRT. Although the RRT estimate of true answers is slightly higher than the DQ estimate, the confidence intervals and the z score confirm that the difference is far from significant. In addition to this “no effect of RRT,” we have to point out the fact that—in accordance with existing literature—there is a huge response bias in both question modes. One-sided z tests prove that the sample estimates of the validation question in both modes differ significantly from the true value of 100 percent. Therefore, neither DQ nor RRT is able to yield proper estimates of the validation item. One should also note the higher standard error of the RRT estimate, although the number of RRT cases was higher. This pattern arises due to the artificial error variance included in the RRT estimates.
Proportion of Respondents Admitting Conviction by Question Mode.
Notes: DQ = direct questioning; RRT = randomized response technique.
Standard errors in parentheses.
***p < .001.
A comparison with the validation study by van der Heijden et al. (2000) might be useful for classifying and evaluating our result. The authors of this study found rates of true answers of 19 to 25 percent in different DQ modes and 43 to 49 percent in different RRT modes. These values are clearly lower than ours. 4 Percentages of about 60 percent true answers to the question on conviction under criminal law can be regarded as comparatively high. Nonetheless, such values demonstrate once more the presence of considerable misreporting on sensitive questions in surveys.
Table 6 shows descriptive statistics of our independent variables: mean values and standard deviations for the whole sample and for the DQ and RRT subsamples and significance tests for differences in the distributions between DQ and RRT mode. Only the measure for SD need has a significantly different mean between the two subsamples (t = 2.541, p < .05); respondents in DQ mode score higher on this scale than RRT respondents. However, the difference is small and should not be taken as a substantial result. 5
Descriptive Statistics of Independent Variables.
Notes: DQ = direct questioning; RRT = randomized response technique; STD = standard deviation; SD = social desirability.
The column “Diff.” shows the result of significance tests for differences between the distributions of the variables in DQ and RRT mode. T-tests and robust F-tests (Levene statistic) were performed for metric variables, χ2 tests for categorical variables. Variances of variables did not show significant differences between question modes (DQ vs. RRT).
*p < .05.
Looking at the sociodemographic variables, it can be seen in Table 6 that 25 percent of the respondents were women, the average age was about 40, and the mean education was 11 years of general schooling. The finding that women are underrepresented in the sample is due to the fact that women generally have a lower propensity to commit crimes. The value of 25 percent corresponds exactly to the official proportion of women of all suspects of criminal offences, as published by German police authorities (Bundeskriminalamt 2009:72). Concerning SD need and SD belief, respondents score “high,” that is, near to the right end of the scales. This means most interviewees strive for social approval and tend to believe that being convicted under criminal law is judged as undesirable in society. Nevertheless, the standard deviations prove that subjects differ in their SD scores. The covariate “seriousness of offence” roughly follows a normal distribution with a mean of 5.5 (on a scale from 1 = low to 9 = high). In 28 percent of all interviews, a third person was present during the interview. The majority of our respondents, namely 72 percent, received the 20-€ incentive for participating in the study. With respect to the variable “interviewer experience,” we see that on average every interviewer conducted 14 interviews. The high standard deviation of this variable indicates that interviewers differed considerably in their success. Based on the skewed distribution, the natural logarithm (ln) of this variable will be used in the regression analyses. Finally, the average time between conviction and interview was about 39 months.
Results of binary logistic regressions estimating the effects of our covariates on the probability of a true response (i.e., admitting to a conviction under criminal law) are summarized in Table 7. The table presents three models. The first pertains to all respondents and contains—in addition to the combined effects of the other independent variables over both experimental groups—a dummy for the question mode (RRT vs. DQ). The second model confines the analysis to respondents in DQ, the third to respondents in RRT mode. Based on a joint model for both groups which additionally includes all interaction effects between question mode and the other covariates, the final column of Table 7 registers whether there are significant interaction effects, that is, whether the effects of the covariates significantly differ between DQ and RRT mode.
Factors Affecting True Answers to the Validation Question.
Notes: DQ = direct questioning; RRT = randomized response technique; SD = social desirability.
Unstandardized regression coefficients and their standard errors in parentheses.
+p < .1. *p < .05. **p < .01.
A first and rudimentary inspection of Table 7 demonstrates that misreporting occurs systematically with several predictors analyzed. This basically supports the above-mentioned assumption of “artifacts” in conventional analyses of relationships between individual and situational attributes and sensitive items.
According to the model for all respondents, we observe the following: Women and older subjects less often confess their criminal behavior; the effect of education is not significant. The two SD variables show the assumed negative effects; however, only SD belief proves to be statistically significant. Also in line with our expectations are the significantly negative coefficients of seriousness of offence and presence of third persons during the interview. No remarkable effects in the joint model for all subjects are connected with the incentive variable, interviewer experience, and the time lag between conviction and interview.
When we compare the influences of our covariates in the two question modes, DQ and RRT, we do not find confirmation for the general expectation that RRT eliminates or at least reduces response biases associated with individual attributes of the respondent and/or situational characteristics of the interview. For some covariates (e.g., gender) this expectation tends to hold, but for others (e.g., SD need or third-party presence) it is more the other way around.
Women are less likely to tell the truth in both question modes. The significant gender effect in DQ becomes smaller and loses significance in RRT mode. However, the interaction effect between gender and question mode fails the 10 percent significance level. The age effect is negative in DQ and RRT, although it is no longer significant in these group-specific models. Like in the model for all respondents, education remains without influence in the DQ and RRT models.
The interaction effect between SD need and question mode is significant in the form that SD need proves to be stronger in RRT mode than in DQ mode. This neither fits with theoretical expectations nor with the basic rationale of most techniques applied to elicit more valid answers on sensitive questions. Need for social approval is normally seen as a main source of misreporting, and it should clearly be lower when anonymizing techniques like RRTs are employed. The second SD variable, SD belief, has negative effects in both question modes (though not significant in RRT mode). While this is congruent with theory with respect to DQ, one would have expected a positive interaction between SD belief and RRT. Taken together, our results regarding SD effects (especially those of SD need) raise doubts about the validity of the popular SD argumentation. On one hand, we do not observe a significant effect of SD need in DQ; on the other hand, it is present and negative in RRT mode. While this contradicts theoretical arguments (see however our proposed explanation in the Discussion section), it is a further contribution to the inconclusive state of empirical literature on SD effects. 6
The effect of seriousness of the offence which a respondent had been convicted for has a negative value for both question modes, but is significant only in RRT. Respondents having committed more serious crimes are generally more inclined to misreport than those who have been convicted for “petty” crimes. Although this effect seems plausible, its causality is ambiguous: Based on the models in Table 7, it is not clear whether the actual offence has caused the response behavior or whether persons generally more prone to misreporting have a higher probability of committing severe crimes (or both). 7
The presence of a third person during the interview increases the probability of misreporting in RRT mode; in DQ mode, the effect is also negative but not significant. This effect tends to be more pronounced in RRT, although the DQ and RRT coefficients are not significantly different. That means, RRT is not able to eliminate misreporting induced by the presence of third persons; on the contrary, it tends to increase it. A similar pattern, but now with positive regression coefficients, shows up for the incentive variable. Here, we have to concede that it is hard to find a plausible argument why an incentive should preferably contribute to more valid answers in the RRT constellation.
The influences of interviewer experience turn out to be counterintuitive, too. Coefficients in Table 7 yield a significantly positive effect in DQ (i.e., the longer an interviewer is active in the current survey, the more valid answers he or she gets to the validation question), but a significantly negative effect in RRT (i.e., more experienced interviewers get less valid answers than “fresh” ones who have just started their job as an interviewer in our survey). Therefore, RRT apparently works better when administered by inexperienced interviewers than by experienced ones. 8 Two factors could account for this pattern: the decreasing impact of the interviewer training over time and the increasing degree of routinization when the same questionnaire is processed repeatedly. Concerning RRT, it seems plausible to assume that at the beginning, the procedure is handled slowly and carefully by an interviewer, which favors its success. However, the more often an interviewer runs through the procedure, the less careful he or she will present and explain it to the respondent. In contrast, routinization seems to be good in DQ mode. We may speculate that interviewers following the instructions of the interviewer training very carefully get less valid answers in conventional DQ interviews. Of course, this reasoning is ad hoc, but—in a more general vein—it directs our attention to the importance of interviewer behavior for response validity and the need to investigate evolving interviewer habits during a survey. 9
The final variable in Table 7, the time lag between conviction and interview, gives the result that it does not have a significant influence in DQ and RRT mode. Since the direction of the effects is different in DQ and RRT, however, the interaction “time lag × question mode” turns out to be significant at the 10 percent level. Again, this is a finding that does not fit in a theoretical framework.
Before we continue to discuss the results in a concluding section, let us return to the problem of inconclusiveness of empirical RRT research in general: Why does the technique apparently work in some applications but not in others? Does RRT only pay off in special constellations and for special subgroups of respondents? Our regression models demonstrate that response behavior in both DQ and RRT is indeed dependent on parameter values of the predictors investigated. To further illustrate this point, selected point estimates and 95 percent confidence intervals of true answers to the validation question, calculated from the regression models, are shown in Table 8.
Probabilities of True Answers for Selected Constellations of Covariates.
Notes: DQ = direct questioning; RRT = randomized response technique; SD = social desirability.
Prevalence estimates (p.e.) are calculated from a model for all respondents (“Total” in Table 7) which additionally included all interaction effects between question mode and the other covariates. If SD need, seriousness of offence, education, and age are varied (with the terminology high/low), estimates were calculated at the mean plus/minus one standard deviation. If covariates are not mentioned, estimates are calculated at the mean for metric variables and at zero for dummy variables.
The first three rows of Table 8 represent constellations where DQ estimates of truthful responses are higher than RRT estimates. For example, in the first row, male respondents with a high SD need (all other metric variables are at their mean and other dummies at zero) answer truthfully with an estimated proportion of 0.68 in DQ and 0.40 in RRT. Confidence intervals overlap, so the difference is not significant. When in rows 2 and 3 parameter values are added that are not in favor of RRT, the gap between DQ and RRT widens. The constellation in row 3 yields estimates of 0.67 versus 0.14 and this is a significant difference.
The last three rows of Table 8 describe—conversely—constellations where RRT performs better than DQ. All three comparisons yield significant differences of the two estimates. For the bottom row, when highly educated women who received a 20-€ incentive were interviewed by an inexperienced interviewer, only 0.27 answer truthfully in DQ, but 0.85 in RRT mode.
Empirical results as illustrated in Table 8 confirm a pronounced sensitivity of RRT performance due to respondent and situational characteristics. Obviously, the success of RRT varies systematically depending on the interview situation and the actors involved. This may lead us to the conclusion that the question of whether RRT generally works or not can probably never be answered. One simply cannot expect the technique to function in a similar way for all respondents. A further conclusion pertains to the inconclusiveness of previous RRT research. Our results show that the impact of factors determining response behavior varies by question mode. Interpreted inversely, this also means that the impact of question mode varies by parameter values of other determinants of response behavior. If this is true, differences in the composition of samples across RRT studies are likely to result in different effects of RRT. This may at least partly explain the heterogeneity of past research on the effectiveness of RRT.
Discussion
The aim of this study was to provide empirical evidence regarding the usefulness of RRT to elicit more valid answers to sensitive questions. Using individual validation data, we found that—all in all—RRT did not have a positive effect on answer validity. This contradicts the result of two meta-analyses by Lensvelt-Mulders et al. (2005). In a future publication, an update of their meta-analysis of individual validation studies should investigate whether the result of a positive RRT effect still holds when our results are included.
The main strength of our study is that it refers to individual validation data. Such data are the best way to assess the degree of misreporting and the effectiveness of techniques such as RRT. However, because outside information on sensitive behavior of people is difficult to acquire, individual validation studies are rare. Compared to the small number of previous RRT validation studies (mostly dating from the 1960s and 1970s), our study is the first to present multivariate regression models of misreporting in the context of RRT.
Nevertheless, there are at least two weaknesses. First, our validation data did not cover “negatively validated” cases, that is, subjects who are known not to be convicted under criminal law. Their response behavior would have been of especial interest regarding the forced response RRT: Employing this procedure, “innocent” respondents can be forced to answer “yes” to the sensitive question—a property that is often seen as a drawback of the forced response format (Edgell, Himmelfarb, and Duchan 1982; Lensvelt-Mulders and Boeije 2007). A second problem is the low response rate of our survey. This is a (presumably negative) feature which we share with other studies aiming at comparable samples of “petty criminals.” It is worth mentioning at this point, however, that the use of a monetary incentive increased the response rate by about 12 percentage points and that—for the total group of respondents—this incentive did not affect the results concerning the validity of answers to the sensitive question. We also conducted additional empirical analyses (not shown here) to explore the selectivity of our final sample. The results do not indicate evident selectivity. For example, no significant effects of the seriousness of offence could be found for different aspects of the dropout process (no contact to subject, refusal, etc.). Finally, it should be born in mind that even if our sample was biased by selectivity, this bias would affect the DQ and RRT subsamples equally.
In addition to the finding of no overall effect of RRT, an important result of our study is that we should be aware that the effects of individual and situational determinants of misreporting may differ between DQ and RRT. Interviewer experience turned out to be positively related with answer validity in DQ and negatively in RRT—an unexpected finding which we already speculated on in the previous section. A similar, but less clearly pronounced pattern emerged for SD need and for the time which elapsed between criminal offence and interview.
SD and third-party presence have always been viewed as essential factors that cause misreporting in surveys. Our findings that these two factors tend to affect response validity in RRT to a greater extent than in DQ mode certainly need further consideration. With respect to third-party effects, the following explanation seems to be plausible: Because the third person is in most cases not informed about the details of the RRT procedure, a “yes” answer to the sensitive question is critical. The third person does not know that it can be a “forced yes,” but tends to qualify it as a confession of the undesirable behavior. This means that the anonymizing function of RRT is thwarted and forced response produces an additional bias.
Concerning the role and influence of SD need, we think that SD literature to date has not sufficiently taken into consideration that misreporting and lying in a survey situation is itself “undesirable.” This especially applies to face-to-face interviews, when an interviewer is physically present. Interpreted in terms of the rational choice approach (introduced above), misreporting involves potential costs for the respondent: Lying may be revealed, for example, by a respondent turning red in the face or through questions following later on in the interview. If this interpretation holds, the role of RRT appears in a new light and has the opposite effect to that intended. Respondents with a high SD need who want to misreport can do so more safely under the umbrella of RRT. The technique anonymizes the interview situation and therefore reduces the “danger” of being caught out lying by the interviewer. Again, this explanation is ad hoc and preliminary, but the simple idea that “lying” does not come without risk may also explain the inconclusiveness of general research into SD effects.
What implications do all these have for the future of RRT? Holbrook and Krosnick (2010:336) recently concluded that existing research and their own empirical findings “call[s] into question interpretations of all past RRT studies and raise[s] serious questions about whether the RRT has practical value for increasing survey reporting accuracy.” Undoubtedly, this is a radical point of view which could also quote our study for further evidence. However, we believe that such a view might be overhasty: First, two meta-analyses summarizing research up to 2003/2004 (Lensvelt-Mulders et al. 2005) report an overall positive effect of RRT. Second, it follows clearly from our results that DQ is not better than RRT and—conversely—RRT does not come off worse. RRT represents at least an equitable alternative to DQ (when the slightly higher standard errors of estimates are neglected). The objection that the procedure takes too much interview time finds no support in our survey. RRT interviews were on average only less than four minutes longer than DQ interviews. Moreover, our analyses reveal that RRT can succeed in reducing systematic misreporting linked to certain respondent characteristics. Our recommendation is that further research should not concentrate so much on the question of whether RRT generally improves answer validity but rather focus on the issue under what circumstances (for what respondents and in which interview situations) the technique works best.
Footnotes
Authors’ Note
A fellowship at the Institute for Advanced Study at the University of Konstanz (Germany) presented the opportunity for the second author to complete this article.
Acknowledgments
For helpful comments, we thank Andreas Diekmann, Ulf Liebe, and anonymous reviewers of SMR.
Declaration of Conflicting Interests
The author(s) declared no conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the German Research Foundation (DFG). It is part of the project “Asking Sensitive Questions” (grant PR 237/6) within the DFG priority program “Survey Methodology” (SPP 1292).
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
