Abstract
This study evaluated three types of bias—total, measurement, and selection bias (SB)—in three sequential mixed-mode designs of the Dutch Crime Victimization Survey: telephone, mail, and web, where nonrespondents were followed up face-to-face (F2F). In the absence of true scores, all biases were estimated as mode effects against two different types of benchmarks. In the single-mode benchmark (SMB), effects were evaluated against a F2F reference survey. In an alternative analysis, a “hybrid-mode benchmark” (HMB) was used, where effects were evaluated against a mix of the measurements of a web survey and the SB of a F2F survey. A special reinterview design made available additional auxiliary data exploited in estimation for a range of survey variables. Depending on the SMB and HMB perspectives, a telephone, mail, or web design with a F2F follow-up (SMB) or a design involving only mail and/or web but not a F2F follow-up (HMB) is recommended based on the empirical findings.
Keywords
Introduction
Sequential mixed-mode designs have become an important alternative in international survey research. Sequential designs provide nonrespondents or non-covered persons in a single-mode survey (e.g., web) with another opportunity to respond by offering at least one other mode as a response option (e.g., face-to-face [F2F]). This follow-up can increase response and coverage while achieving cost efficiency by sequencing inexpensive before expensive modes (De Leeuw 2005; Dillman, Smyth, and Christian 2009; Lynn 2013).
Since there is a direct trade-off of mode-specific errors in the total survey error (TSE) of sequential mixed-mode designs, mixed-mode surveys may offer an optimal error balance (Bethlehem and Biffignandi 2011:235; De Leeuw 2005; Groves et al. 2009:176). Selection errors (i.e., coverage error or nonresponse error) and measurement errors have received particular attention in this context, because they are often systematic (i.e., they cause bias) and differ in size across modes (Biemer and Lyberg 2003:59; De Leeuw 2005, 2008; De Leeuw, Dillman, and Hox 2008; Klausch, Hox, and Schouten 2013; Kreuter, Presser, and Tourangeau 2008). In practice, survey designers need to estimate these biases to monitor survey accuracy and to decide about the impact of changes to fieldwork or questionnaire designs on accuracy. In doing so, two basic research questions (RQs) are of concern:
Given equal budget and time constraints, the design featuring lowest survey error is preferred (Biemer 2010; Biemer and Lyberg 2003:44; Groves and Lyberg 2010; Kreuter, Müller, and Trappmann 2010). To evaluate the accuracy of a mixed-mode design, the expected total bias (TB) needs to be estimated for different candidate mixed-mode designs and compared to the bias of single-mode surveys without the mixed-mode follow-up. The mixed-mode survey is only needed, if the TB of the single-mode survey is decreased by the follow-up.
The primary motivation for sequential mixed-mode surveys is the reduction in selection bias (SB). If the size of SB before and after the follow-up is estimated, the actual need and success of reducing SB can be evaluated. However, if one mode in a mixed-mode design measures with less accuracy, reductions in SB may be offset by increases in measurement bias (MB) of the mixed-mode estimate. To prevent this problem, questionnaires of the mode evoking higher MB can be redesigned once the size of bias is known (Dillman et al. 2009:326).
In this article, we answer these RQs for the case of the Dutch Crime Victimization Survey (CVS), a national survey conducted by Statistics Netherlands, based on a large-scale experiment with three mixed-mode designs: web, mail, and telephone followed up by F2F, reanalyzing data collected by Schouten et al. (2013). We estimate TB of the modes before and after the mixed-mode follow-up (RQ1) and decompose it into SB and MB components (RQ2). A major problem in the evaluation of survey bias is its estimation. Past research has discussed a series of approaches to this problem. The next section reviews this research and outlines the approach of the present study.
Background
Past research has mainly discussed three different approaches to evaluating bias in mixed-mode surveys: the record-check approach, the relative mode-effect approach, and the benchmark survey approach. Exact bias estimation is only possible using the record-check approach, which supposes that true scores are available from an external database (Biemer and Lyberg 2003:289; Groves 2006). A growing number of case studies evaluate mode differences in bias using record checks (Kirchner and Felderer 2013; Körmendi 1988; Kreuter, Müller, and Trappmann 2013; Kreuter et al. 2008; Olson 2006; Sakshaug, Yan, and Tourangeau 2010; Tourangeau, Groves, and Redline 2010) and the approach is also applied to mixed-mode surveys (Fowler et al. 2002; Kreuter et al. 2010; Link and Mokdad 2006; Sakshaug et al. 2010; Voogt and Saris 2005). In practice, however, records are hardly ever available for the survey variables of interest and involve various other practical problems. Records may not coincide with the study time period and may themselves contain errors or be incomplete, survey questions may suffer from specification problems compared to information encoded in records, and there may be problems arising from incomplete matches of respondents to records (Biemer and Lyberg 2003:291; Körmendi 1988; Miller and Groves 1985). Furthermore, records are seldom available to all researchers due to privacy limitations and sometimes can concern very specific subpopulations, such as students (Kreuter et al. 2008).
Vannieuwenhuyze and Loosveldt (2013) suggested evaluating mixed-mode surveys by “relative mode effects” (Vannieuwenhuyze, Loosveldt, and Molenberghs 2010, 2014). Relative effects are defined as difference in TB, MB, and SB between mode-specific response groups in a mixed-mode design, referred to as overall, measurement, and selection effects. However, relative mode effects cannot be interpreted in an absolute sense. For example, a relative selection effect in a sequential mixed-mode design does not suggest that SB is reduced or increased by the follow-up, but only that different “types” of respondents are reached.
This problem is addressed when biases are evaluated against a “preferred” single-mode survey whose measurements are considered valid and which is also considered optimal on SB (Biemer 1988; Biemer and Lyberg 2003:287; De Leeuw 2005; De Leeuw et al. 2008; Körmendi 1988; Vannieuwenhuyze 2014; Vannieuwenhuyze et al. 2010). This mode is called the “single-mode benchmark” (SMB). A relevant application of SMBs is the redesign of repeated cross-sectional surveys to mixed-mode designs. To assure comparability, the change in bias relative to established SMB time series is an important concern (addressing RQ1). Furthermore, preventing shifts after assessing selection and measurement effects against the SMB may be possible (RQ2).
Using the SMB approach, Schouten et al. (2013) and Klausch, Schouten, and Hox (2014) defined differences in bias between an SMB and other modes as “single-mode effects”. The primary difference to the relative mode effect approach is that effects are defined between single-mode surveys and a reference survey (the SMB) and not between mode-specific response groups in a mixed-mode design. In an empirical study of the CVS, the authors used an F2F survey as SMB against which telephone, mail, and web modes were compared. However, in evaluating sequential mixed-mode surveys (RQs 1 and 2) against an SMB, the impact of the follow-up mode on single-mode effects needs to be considered as well, which has not been accomplished so far. The bias of a mixed-mode survey against the benchmark is called a “mixed-mode effect”. This extension is provided in the present study.
A limitation of the SMB approach is that single surveys may not be fully appropriate as benchmark. For the case of F2F, for example, we assume that due to generally high response rates SB is acceptable, but F2F measurements may be considered too erroneous to be used as benchmark (e.g., due to social desirability bias; Tourangeau, Rips, and Rasinski 2000:257; Tourangeau and Yan 2007). In particular, self-administered modes, such as “web,” may be more precise for sensitive questions (Kreuter et al. 2008). In this case, a combined benchmark of the SB of F2F and measurements of web may thus appear superior to the SMB. We call this combined benchmark the “hybrid-mode benchmark” (HMB). Even though the SMB has a long-standing tradition in evaluating survey errors (Biemer and Lyberg 2003:291), the HMB may be more immediate to survey practice when assuring comparability to single-mode time series is not central (e.g., when questions are sensitive or interviewer effects are strong in the SMB). The present article is the first to consider an HMB besides the SMB. Previous literature also established the general conditions under which unbiased estimation of measurement and selection effects is possible (Klausch et al. 2014; Vannieuwenhuyze and Loosveldt 2013; Vannieuwenhuyze et al. 2014). The primary difficulty in estimation is the confounding of both effects in the overall mode effect. To disentangle the effects, exogenous auxiliary information needs to be available, conditional on which measurements and the selection mechanism into mode-specific response groups are independent (Imbens 2004; Morgan and Winship 2007; Pearl 2009; Rubin 2005). In mixed-mode surveys, finding variables that allow both unconfoundedness and exogeneity appears difficult, however. To address this problem, Schouten et al. (2013) and Klausch et al. (2014) introduced a reinterview design. In the reinterview, repeated measures of survey target variables as well as other auxiliary information are collected. Conditional on the repeated measures, the unconfoundedness assumption appears more plausible (cf. fourth section). The present study reanalyzes these data for the case of sequential mixed-mode surveys using both an SMB and HMB.
The Crime Victimization Survey Case Study
The case study was conducted in the context of the CVS, administered by Statistics Netherlands in 2011 in an experiment conducted independently from the regular CVS (Klausch et al. 2013, 2014; Schouten et al. 2013). In this section, we argue for possible choices of benchmarks (SMB and HMB) in the case study and provide details on the fieldwork.
Traditionally, the F2F mode has been considered an ideal mode for the CVS. Whereas the CVS has been an F2F survey upon initialization, in past decades, it had to be redesigned multiple times using different mixed-mode protocols. However, predecessors of the CVS were F2F and the measurements from F2F are still regarded as accurate or desirable by many. The social control and additional explanations provided by interviewers may increase validity of measurement, whereas high response rates suggest small SB. Arguably, F2F surveys often take on this “benchmark role” in survey research for these and similar reasons. Setting F2F as SMB appeared plausible from this point of view.
For this reason, a split-ballot mode experiment was conducted, where in parallel a single-mode F2F survey was compared to three single-mode designs (telephone, mail, and web; Table 1). In addition, the experiment entailed a sequential mixed-mode component reapproaching the nonrespondents in telephone, mail, and web by F2F. This procedure may yield mixed-mode estimates that are similar to the F2F benchmark and provide inexpensive designs compared to F2F alone. Evaluating the success of this procedure is the objective of the present case study (following RQs 1 and 2).
Response Rates and Mixture Weights in the CVS Mixed-mode Experiment (Weighted).
Note: CVS = Crime Victimization Survey; F2F = face-to-face; Prop. = proportion; resp. = respondents; SMB = single-mode benchmark.
aThe telephone response rates are taken against the net sample of all sampled units (including persons without known telephone number).
Each mode-specific survey was based on a probability sample drawn from the national register 1 (a person sampling frame). The sample size was chosen such that minimal observable total single-mode bias against F2F is equal to the required precision of the CVS at the national level. The regular CVS is much larger because of detailed publications for a range of subpopulations. The fieldwork was conducted in the time from April until June 2011. First, all respondents received mailed prenotifications. In the two self-administered conditions, these letters contained a paper questionnaire with return envelope or information on how to access the survey online. In the interviewer modes, the contact attempt by an interviewer on the phone or in person was announced. The fieldwork period of this first wave was four weeks and, subsequently, the nonresponse follow-up in F2F was administered. The F2F single-mode survey remained active for the same period as telephone, mail, and web.
The F2F survey showed the largest response rates of the four modes (64.5 percent, Table 1). The response rates in telephone, mail, and web were lower than in F2F, but the F2F follow-up increased the response rates in all modes (up to 65.2, 67.3, and 59.7 percent, respectively). Since the single-mode response rate in the web mode was lower than in the telephone or mail conditions, the relative proportion of F2F respondents in the web mixed-mode sample was substantially higher (51.4 percent vs. 25.4 and 26.7 percent, fourth row in Table 1). These relative proportions (called π) suggest that the estimates from the mixed-mode web sample may be impacted more strongly by the follow-up than in the mixed-mode mail or telephone conditions. This parameter is, therefore, relevant in estimation and interpretation of results, discussed in more detail in the fourth and fifth sections.
The high response rate of F2F gives reason for choosing F2F as selection benchmark. In addition, Klausch, Hox, and Schouten (2015) showed that representativeness of F2F on sociodemographic background variables was high in the CVS experiment. Nevertheless, the F2F mode may not always be an “optimal” mode of measurement. For the case of the CVS, it can be argued that the questionnaire contains a large number of attitudinal and sensitive questions (cf. Online Appendix), which may be particularly susceptible to stronger MB due to the presence of an interviewer (cf. second section). Measurement in the anonymous situation of self-administration (e.g., web) may be more appropriate. However, web surveys are traditionally looked at critically with regard to their selective properties due to, for example, lower response rates (29.0 percent in the present study) and incomplete population coverage. Therefore, a web-F2F HMB was evaluated as an alternative to the F2F SMB, which assumes F2F as the benchmark for selection and web as the benchmark for measurement.
It should be noted that the choice of web as measurement benchmark potentially may be as flawed as choosing F2F or any other mode as measurement benchmark. Whereas web may indeed suffer from less social desirability bias, it may produce other measurement problems. For example, interviewers may act as instance of social control and provide motivation to respondents, keeping satisficing and response effects low (Klausch et al. 2013). Web interviews lack this advantage due to self-administration. Further disadvantages of web surveys, such as respondent ability (e.g., problems with eye sight and literacy), have been described in the literature as sources of measurement error and satisficing (Couper 2000; Couper, Traugott, and Lamias 2001; Fricker et al. 2005; Krosnick 1991). In practice, the choice of benchmark is, therefore, based on a plausibility argument—that is, in the absence of any other information on measurement accuracy of a particular mode and survey (questionnaire), it has to be plausibly argued for or against a particular mode as measurement benchmark. In this case study, we present results for both an F2F measurement benchmark (SMB) and an HMB with web as measurement benchmark. We note that for the CVS—a government survey—social desirable responding to authorities may be a substantial problem on a great number of questions. This conjecture lets the web mode appear a logical alternative choice as measurement benchmark. Furthermore, we do not consider questions in the analyses that typically are associated with strong measurement problems in web surveys, such as open-ended questions or questions with many answer categories. Nevertheless, the measurement benchmark is an assumption made by the analyst and we therefore compare the results of the SMB and HMB in the case study, discussing differences in conclusions about the optimal design depending on measurement benchmark choice. In the discussion, we also return to the important aspect of choosing benchmarks for the evaluations of biases.
Finally, as noted in the second section, the respondents in the mixed-mode design also received a follow-up in the F2F mode (administered in parallel to the nonresponse follow up in F2F). The response rates across the full reinterview (second wave in F2F) are shown in Table 1 (last row). How these data are used to estimate MB and SB is discussed in more detail in the next section.
Definition and Estimation of Single- and Mixed-mode Effects
In this section, we explain how single- and mixed-mode bias components required to answer the RQs were estimated using the SMB and HMB benchmark. We first define all biases following the TSE framework, but in a mixed-mode context some extensions of the TSE notations are necessary (Biemer 2010; Biemer and Lyberg 2003; Groves and Lyberg 2010; Groves et al. 2009:48). Subsequently, we explain estimation of the biases against the SMB and HMB.
Defining Single- and Mixed-mode Bias of the Sample Mean
To simplify notation, we define the bias of a mixed-mode telephone–F2F survey, but the definitions for the other modes follow likewise. We are interested in the bias of the estimator of the response sample mean of a continuous or discrete survey variable Y. Let
where the binary random variable Stel represents the response mechanism of telephone and Stel = 1 indicates the group of respondents. The TB of
Of course, μ is not observed without external validation data. When using the SMB and HMB approach, the population mean, therefore, will be substituted by an estimator that is least biased based on the respective benchmark assumptions (see Estimation of single- and mixed-mode effects against the SMB subsection).
The TB can now be decomposed into its systematic components, where we distinguish SB and MB. In limiting our elaboration to these biases, we also assume that other sources of bias discussed under the TSE framework, such as specification and data processing error, are negligible or at least equal across modes. Then measurement and selection bias can be said to add up to total bias. Single-mode SB and MB follow as (addressing RQ2)
and
It can be seen that, following the TSE framework, SB is defined as the difference in sample mean of the true score and the population mean, whereas MB represents the difference of sample mean of the true score and measured mean answers. 2 MB and SB add up to TB.
In the mixed-mode survey, the answers of follow-up respondents in F2F are added to the single-mode response set. The mean from the pooled mixed-mode survey can then be regarded as the mean of a mixture distribution of Ytel and Sf2f,
where the mixture constant π is defined by the expected proportion of single-mode respondents introduced in the third section (cf. Table 1 for sample estimates from the case study). The total bias of the mixed-mode mean now can be expressed as (addressing RQ1)
where
represents the TB of the follow-up mode. It can be seen that the relative impact of single-mode biases is reduced by the size of the response proportion π, but the sign and size of bias of the follow-up sample is crucial for the overall mixed-mode TB (for a numerical example, see Bethlehem and Biffignandi 2011:259). The mixed-mode SB and MB depend on the (weighted) follow-up mode bias in the same manner, that is (addressing RQ2)
where
and
where
In the next section, we discuss how these biases can be estimated. We address estimation against an SMB and HMB in turn.
Estimation of Single- and Mixed-mode Effects Against the SMB
Setting a benchmark implies choosing two components. First, a mode is chosen that substitutes the true scores of Y by the measurements of the benchmark mode. These observed scores are close or equal to the true scores (cf. second section). Second, a mode is chosen that evokes acceptable selection bias. While acknowledging that the selection benchmark may not be free of SB (because itself suffers from unit nonresponse), it is considered the mode with best selection properties for the variable at hand. In principle, the measurement and selection benchmark modes may differ, in which case the combined benchmark is called hybrid. The SMB is the special case when both components are taken from the same mode.
In the present study, the F2F survey represents the SMB. In doing so, we set the F2F answers as the true score to assess comparability of single- and mixed-mode estimates of telephone (or the other modes, RQ1) as well as reasons for incomparability (RQ2). Thus, the population parameter μ is substituted by the F2F mean
and
Here, we use the operator
The bias estimates, equations (11) and (12), can also be called total mode effects (alternatively, “overall mode effects” or “mode system effects”; Biemer 1988; De Leeuw 2005; Schouten et al. 2013; Vannieuwenhuyze and Loosveldt 2013), and in answering RQ1 it is important to distinguish between the single- (equation (11)) and the mixed-mode effect (equation (12)). The primary difficulty in estimating the single-mode selection and measurement effect components of the total effect (addressing RQ2) is that after substitution of
is not observed in a simple comparative mixed-mode design. This can be seen when illustrating the missing data pattern for the SMB and mixed-mode samples (Figure 1a).

Missing data pattern of a simple comparative mixed-mode design (a) and a within-subject design (b) for estimating bias components against a single-mode benchmark (SMB; example of a telephone–face-to-face (F2F) mixed-mode design with F2F benchmark).
In the SMB sample, respondents provide answers on
In doing so, we assume that potential outcomes and unit nonresponse at the reinterview are missing at random
3
(MAR) given observed outcomes Ytel and
After imputation, the potential outcomes
which is indicated by shaded area B in Figure 1. These potential outcomes were imputed as part of the estimation procedure under the MAR assumption discussed above.
Estimation of Single- and Mixed-mode Effects Against the HMB
As discussed in the third section, we chose to use the web mode as a measurement benchmark for the HMB case, while keeping F2F as selection benchmark. Likewise the SMB, now web measurements substitute the true score, but the selection mechanism of F2F is still deemed optimal, so that the population parameter μ is substituted by the potential outcome

Missing data pattern of an extended within-subject design with two different mixed-mode samples (telephone and web) for use in the hybrid-mode benchmark (HMB) case (nonrespondents in the single-mode benchmark [SMB] sample and double-nonrespondents in the mixed-mode samples omitted).
To estimate single-mode selection and measurement effects of the telephone sample against the HMB, the potential outcomes
Practical Implementation of the Multiple Imputation Procedure
In the context of the missing data pattern shown in Figure 2, auxiliary information
In MICE, prediction models with an appropriate link function for each imputed variable were specified. Since the CVS variables were measured on polytomous, dichotomous, or interval scales, we applied multinomial, logistic, and normal regression models, respectively.
6
A crucial element of the multivariate imputation is the selection of predictor variables. Given the large number of possible predictors in the data set, cautioning of overspecified models was important. To restrict the number of predictors, we applied the following procedure: Mode-specific potential outcomes of Y at occasion 1 were predicted by their repeated measure
Any Eight sociodemographic background characteristics were available as additional predictors from the national population register for all units.
7
Any of these background characteristics exceeding a small association of V > .15 was included as predictor for any Y variable.
Fifty data sets were multiply imputed. The proportion of missing data was high in the present study, due to the fact that potential outcomes were imputed across four samples. However, the fraction of missing information, a model-based estimate of missingness (Rubin 1987; van Buuren 2012:41), was below 50 percent in most cases. This fraction suggests that when using 50 multiply imputed data sets, estimates of total variance are precise. 8 To estimate the within-imputation variances, all effects were bootstrapped by 1,000 iterated draws within each imputed data set. The within- and between-imputation variances were pooled to the total variance using Rubin’s rules (Schafer 1997:109-10). Two-sided significance tests were executed using t-tests with adjusted degrees of freedom.
Another relevant issue in estimation was the treatment of item nonresponse due to “don’t know” (DK) or refused answering. In principle, there is not a unique way of handling this problem, because DK answers may be either regarded as missing information or substantial answer (i.e., as an answering category). The results presented are based on imputed item nonresponse as part of the missing data correction. However, the alternative results not shown here did not differ strongly when treating DK as a separate answering category instead.
Results
In this section, we answer RQ1 and RQ2 for the case of the CVS experiment, discussing the F2F SMB and the web/F2F HMB separately. We note again that these analyses are based on different benchmark assumptions and research interests, as exposed in the previous sections. After presenting results, we discuss implications for the CVS under both perspectives.
Effects Against the F2F SMB
RQ1 (Is the mixed-mode survey needed or would a single-mode survey suffice in terms of accuracy?) requires estimating the total effect of using a different focal mode (web, mail, telephone) than the F2F benchmark and assessing the impact of the mixed-mode design on the single-mode effect. Of the 30 included CVS variables, 2 variables were measured on dichotomous scale and 23 variables were measured on polytomous answering scales. The polytomous variables were dichotomized similar to reporting by Statistics Netherlands (cf. Online Appendix for an overview), so that the dichotomized target statistic is a proportion (e.g., the proportion “agree or completely agree”). Four further variables are summary scales measured on interval level, and one represents a count of victimizations in the past year.
Table 2 presents a summary of the significant single- and mixed-mode effects on these variables. A majority of variables showed total effects against F2F in web (20) or mail (16), but fewer variables were affected in telephone (7). The counts beneath these numbers inform about the impact of the mixed-mode follow-up. We distinguish four possibilities: A mixed-mode effect is insignificant after the follow-up (
Count of Significant and Nonsignificant Single-mode Total Effects and the Change Induced by the F2F Follow-up (Mixed-mode Effects) for the SMB and the HMB Case.
Note: F2F = face-to-face; HMB = hybrid-mode benchmark; SMB = single-mode benchmark.
Significance tests on p < .05.
Table 2 does not allow an assessment of the size of effects, however. For this purpose, and answering RQ2 subsequently, we employ three scatterplots of single-mode against mixed-mode effects showing estimates for the 25 variables available on dichotomous scale (Figure 3, upper row). The lower row of scatterplots presents t-statistics for each variable and may be evaluated against critical values from the t-distribution. Critical values for a two-sided test (p < .05) are provided by horizontal (single-mode) and vertical (mixed-mode) dashed lines. 10 The diagonal line in all plots has slope one (i.e., it is not a regression line). Deviations from the line, therefore, imply change in effects by the follow-up.

Scatterplots of single-mode against mixed-mode effects for the single-mode benchmark (SMB) case (upper row: unstandardized effects; lower row: standardized effects (t-statistics), where dashed lines indicate critical values; p < .05).
Consider first the plot of single- against mixed-mode total effects (upper left hand). Single-mode total effects of web and mail are substantially larger than for telephone in many cases. Moreover, the seven significant single-mode total effects for telephone (Table 2) are found to be of smaller magnitude than for mail and web. Second, the impact of the mixed-mode follow-up is apparent for both web and mail, but not telephone, as estimates are moved toward zero mixed-mode effects (i.e., the horizontal axis). This effect is particularly pronounced for web, suggesting that the mode profits more strongly from the F2F follow-up in reducing total effects.
RQ2 asks about the sources of the total effects we identified (What are the major systematic sources of error in single-mode surveys and how are they impacted by the mixed-mode follow-up?). This question is addressed by the middle (selection effects) and right plots (measurement effects). It is immediately clear that selection effects were very small and that measurement effects were the dominant component in creating effects between the SMB and the three focal modes. It is important to emphasize that the reduction in measurement effects by the F2F follow-up seemed to be effective, because the follow-up is conducted in the measurement benchmark mode. The follow-up mode measurement effect (equation (10)) was indeed small in all cases (not shown here). The web mixed-mode sample showed a substantially higher amount (51.4 percent) of F2F follow-up respondents suggesting that single-mode measurement effects were, roughly, reduced by this factor, whereas mail and telephone were impacted less strongly (26.7 percent, 25.4 percent; cf. Table 1).
Effects Against the HMB
The HMB takes web measurements as benchmark while allowing for the selection mechanism of F2F. Significance tests of the total effects against the HMB are provided on the right-hand side of Table 2. In addition to the three mixed-mode designs, effects for the single-mode F2F survey are shown against the HMB. It is apparent that for telephone and F2F, a large number of variables (21 and 23, respectively) showed significant total mode effects, whereas for mail fewer variables (10) reached significant level. There were no significant total effects for web. With respect to web, it should be noted that only single-mode selection effects could have caused a total effect, given that web measurements are used as benchmark.
The F2F follow-up to the three modes was mainly ineffective or harmful. For web, it was even very harmful, increasing total effects in 10 cases. Similarly in mail, it increased total effects on six variables, while only reducing it on two. It was mainly ineffective to reduce the telephone total effect against the hybrid web-F2F benchmark.
For a more detailed picture, we consider the scatterplots of the three single- and mixed-mode effects for the HMB case (Figure 4). Since the selection benchmark did not change, selection effects against F2F were small and insignificant, likewise the SMB case. However, telephone showed large measurement effects against the web measurement benchmark and the F2F follow-up was ineffective. Mail showed smaller single-mode measurement effects against web or no effects. However, it can be seen that the F2F follow-up increased measurement and total effects on many mail variables (cf. points below the diagonal line) reflecting that the F2F follow-up was not beneficial to the mail single-mode bias.

Scatterplots of single-mode against mixed-mode effects for the hybrid-mode benchmark (HMB) case (upper row: unstandardized effects; lower row: standardized effects (t-statistics), where dashed lines indicate critical values; p < .05).
In addition, it can be seen that web, as measurement benchmark, does not exhibit single-mode measurement effects. Measurement effects are caused by the F2F follow-up, however, reflecting that F2F measurements showed a bias against the web benchmark. For this reason, the mixed-mode total effect of web is determined by the follow-up measurement effect of F2F against the HMB (i.e.,
Evaluation of Effects by Variable Groups
Finally, we consider measurement and total effects by the type of variables and statistics reported in the abovementioned analyses (Table 3). Significance of measurement and total effects is evaluated separately. It can be seen that for some cases a total effect did not imply a measurement effect (e.g., 11 of the 17 significant web total effects imply a measurement effect). For these cases, a clear conclusion about the source of TB against the benchmark cannot be drawn (i.e., a selection effect may represent an alternative explanation). However, for the majority of variables, a total effect did imply a measurement effect, reflecting the observations from Figures 3 and 4. In addition, considering the clear measurement effect pattern of both figures, it is plausible that measurement effects underlay also the insignificant cases, where a total effect is observed but remains within observable differences.
Count of Significant Single-mode Measurement/Total Effect Estimates against the SMB and HMB by Variable Types.
Note: For full details on all items, see the Online Appendix. incl. = including.
aLikert scale items: percentage (completely) agree (five answering categories).
bFrequency scale items: percentage (frequently or sometimes; three answering categories).
cInsecurity feelings: two items (percentage yes), two items (percentage frequently or sometimes).
dPercentage very satisfied or satisfied.
ePercentage victim in the past 12 months (aggregated across multiple items).
fCount of victimization past 12 months (aggregated across multiple items).
gScore on 10-point scale from very low (1) to very high (10).
hAggregated summary indices based on multiple social quality and neighbourhood problems items.
Significance tests on p < .05.
We found that, regardless of mode and benchmark, all CVS variables may have been subject to a measurement effect, regardless of type of variable and answering scale. In particular, the two groups of questions on the social quality and problems of the neighborhood appeared susceptible to measurement effects. However, not all questions of any given group show clear measurement (or total) effects. We may conclude that measurement effects appear to be a general, but still a question-dependent phenomenon in the CVS. However, the strong presence of web and mail measurement effects against F2F, and telephone and F2F effects against the HMB (web measurements) is again evident.
The only characteristics that were not affected by measurement across modes were two “victimization” variables (victim of crime past 12 months/count of victimization), which represent key statistics in the CVS. The insensitivity to effects is generally a positive result. However, it should be noted that not all standard victimization questions could be included in the reinterview design and the two index variables are based on shortened versions of the standard statistic. For this reason, the present findings about victimization should be interpreted with care.
Conclusions for the CVS
In drawing conclusions for the CVS, it is important to recall the objectives of the SMB and HMB. In the SMB case, both measurement and selection mechanisms are taken from the same mode (F2F, in this study). F2F can be considered a historical benchmark for the CVS, which was an F2F survey upon first introduction. This study revealed that when using the web or the mail mode instead of F2F, it is impossible to avoid a strong change in statistics for a large number of CVS variables (RQ1). The reason for these effects was an increase in MB in web and mail relative to F2F. For telephone, measurement effects were also present, but on much smaller scale and in smaller number. In conclusion, when F2F is the desired benchmark estimate, use of single-mode web and mail should be avoided. Use of telephone may be viable when accepting some smaller systematic changes.
The classical motivation of a sequential F2F follow-up is reducing single-mode SB (RQ2). Selection effects against F2F, however, were not identified on a statistically significant level in this study. Still, estimates from the web and mail mixed-mode surveys were often closer to the SMB than the single-mode estimate alone, because the F2F follow-up provided measurements that were very similar to the single-mode F2F benchmark, as can be expected. To achieve this effect, an F2F follow-up would be chiefly desirable for both modes, but not for reducing single-mode selection effects. However, there remain a number of relatively large total effects suggesting that this procedure cannot fully compensate the MB created by web and mail. The response proportion π determines the strength of follow-up mode impact. In future research, it is important to evaluate the mix of biases and the role of π further.
The objective of the HMB is to optimize a benchmark with respect to both measurements and selection mechanisms. Under the assumption that web measurements are superior to F2F, for example, due to anonymous answering, we used web as measurement benchmark instead of F2F. This change strongly affects conclusions about the CVS. Since we did not identify any selection effects against F2F, the optimal mode would now be a single-mode web survey. Furthermore, using telephone or F2F would suggest increasing MB and should be avoided. However, the mail mode showed only smaller effects against the HMB. These were primarily limited to a single group of questions (“neighborhood problems,” Table 3). In many cases, mail may, therefore, evoke similar estimates as web. However, for both mail and web, the F2F follow-up would only introduce MB. Therefore, a mixed-mode design involving F2F should be avoided. Moreover, given these findings web and mail may be compatible for sequential mixed-mode surveys themselves. In the absence of selection effects, this may seem unnecessary. However, web yielded a small response rate (29.0 percent), which may be raised by including a mail follow-up, for example (Millar and Dillman 2011). The mode effects in this design could not be evaluated in the present study.
In sum, these results reflect theoretical and empirical arguments in the literature that self-administered and interviewer modes often form a dichotomy with respect to measurement bias (De Leeuw 1992, 2008; Klausch et al. 2013), but also suggest that there may be exceptions to the rule and mode effects remain question dependent phenomena. A question-specific evaluation of effects is, therefore, necessary for any mixed-mode survey. In the CVS, it was eventually decided for a ‘non-interviewer mode only’ redesign based on these results and practical considerations.
Discussion
Evaluating TB, MB, and SB of mixed-mode designs before their introduction or redesign is a problem of great concern (RQ1 and RQ2). In the absence of true scores, we suggested using measurements and selection mechanisms as benchmark, which are defined as optimal yielding either SMB or HMB. It is important to distinguish between single-mode and mixed-mode effects against the SMB and HMB. Evaluating single-mode effects is relevant to assess the need for mixed-mode designs to reduce bias, while mixed-mode effects inform about the success of the procedure. In doing so, selection and measurement effects indicate the sources of the observed total mode effects.
A crucial first step in the evaluation of mode effects is the choice of measurement and selection benchmarks, which lead to either an SMB, if both elements are taken from the same mode, or an HMB, if elements are taken from different modes. In the third section, we provided some discussion how choice of benchmarks was motivated in the present case study. However, in general terms, the choice strongly depends on the particular survey, questionnaires, and properties of the population, and thus may be different in other contexts. In the absence of any further information on mode-specific quality of measurement or selection, a choice of benchmark needs to be based on heuristic arguments, such as the degree of sensitivity of questions (e.g., choose self-administered mode as measurement benchmarks if sensitivity is high) and response rates (e.g., choose modes with high response rates as selection benchmark). In addition, comparing conclusions under different choices of benchmarks, as in the present study, may give analysts an idea on the importance and implication of benchmark choices.
Given the large number of potential sources of measurement and selection bias, however, heuristic arguments alone may be insufficient. A relevant path for future research is, therefore, the development of methodology for choosing selection and measurement benchmarks. For example, alternative indicators of representativity, such as “R-indicators” and the fraction of missing information (Schouten, Cobben, and Bethlehem 2009; Wagner 2012), and measurement quality, such as answering behaviors and satisficing (Krosnick 1991) or complex measurement models (Klausch et al. 2013; Revilla 2010; Saris and Gallhofer 2007), may be useful to support benchmark mode choice.
A second step in the evaluation of mode effects is estimation of total effects and their disentanglement into selection and measurement components. In the SMB case, total mode effects may be estimated from simple comparative designs. However, evaluating measurement and selection effects, as well as effects against a HMB, requires estimating potential outcomes (cf. Estimation of single- and mixed-mode effects against the SMB subsection). This objective necessitates additional auxiliary data. We suggested using reinterview data for this purpose. Our design is related to other reinterview methods for bias estimation including the basic question approach (Kersten and Bethlehem 1984), the callback approach (Elliott, Little, and Lewitzky 2000; Hansen and Hurwitz 1946; Keeter et al. 2000), and test–retest designs (Biemer and Lyberg 2003:291), but also features important differences. The basic question and callback approach involve reinterviews of nonrespondents to estimate nonresponse bias. Test–retest designs normally aim at estimating measurement error. In doing so, measurement independence is often assumed. Our design allows estimating both measurement and selection effects against benchmarks. Furthermore, by modeling potential outcomes at the first occasion, we explicitly allowed for the possibility that reinterview measurements and selection mechanisms can change between initial interview and follow-up. Change may occur, if follow-up F2F respondents provide different answers than in the benchmark mode (e.g., due to experienced response burden). An alternative explanation is substantial change in statistics across time, but it is often likely to be small in the study period of sequential mixed-mode surveys (two months, in the present study).
An advantage of the design is that it is tailored for use in parallel to sequential mixed-mode surveys and can be implemented even for ongoing surveys without affecting the standard fieldwork (i.e., the benchmark sample is independent and the reinterview does not impact the standard mixed-mode fieldwork). Furthermore, our design does not require the follow-up to be conducted in the measurement and selection benchmark mode. Although in the current case study, F2F served as both the SMB and follow-up mode, estimation would still be possible with a different follow-up or benchmark mode. For example, a design using telephone as follow-up to web, while F2F remains the SMB, can also be evaluated, if the reinterview is conducted in telephone.
Another advantage of our design is that it allows for a structured view on design decisions in mixed-mode surveys. We demonstrated how single- and mixed-mode designs can be evaluated against SMB and HMB and how a top-down evaluation can be performed (i.e., from total mode effects to its selection and measurement components). Such a top-down approach supports data collection and questionnaire designers. A decisive role is played by the size of the experimental samples and effect sizes that are required to be observable. Future research should evaluate minimum mode-specific sample size requirements, also considering costs of the experimental design.
Our results and design should be judged against a number of limitations that show up paths for further research. First, the size of effects should be evaluated against costs and budgets. For example, F2F is the most expensive mode in data collection, whereas web is inexpensive. The web-F2F mixed-mode design could reduce bias compared to single-mode web (SMB case), but the F2F follow-up response (cf. Table 1) may cause substantial additional costs over single-mode web. Further research, therefore, needs to weigh off mode effects against budget constraints (Vannieuwenhuyze 2014).
Second, the reinterview measurements,
Third, our results may depend on the specific estimation procedure (multiple imputation) and related modeling decisions. For example, we included
The evaluation of bias in mixed-mode surveys will continue to be of concern for methodologists and practitioners. In this respect, further development of our method is desirable. If the strong prevalence of measurement effects should prove a problem in many mixed-mode surveys, developing adjustment methodology for measurement effects is necessary. Such methodology could use Bayesian approaches that yield multiple imputations of potential outcomes in mixed-mode surveys. Developing and evaluating these methods appears urgent in face of our empirical results.
Footnotes
Acknowledgment
The authors would like to thank Stef van Buuren, Shahab Jolani, and Gerko Vink for their helpful discussions on the missing data problem described in this article. The comments of one anonymous reviewer greatly helped to improve the manuscript.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The PhD research during which this article was written by Thomas Klausch at Utrecht University was financed partly by Statistics Netherlands.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
