Evaluating Relative Mode Effects in Mixed-Mode Surveys:

Abstract

In order to investigate the advantage of mixed-mode (MM) surveys, selection effects between the modes should be evaluated. Selection effects refer to differences in respondent compositions on the target variables between the modes. However, estimation of selection effects is not an easy task because they may be completely confounded with measurement effects between the modes (differences in measurement error). Publications concerning the estimation of these mode effects are scarce. This article presents and compares three methods that allow measurement effects and selection effects to be evaluated separately. The first method starts from existing publications that avoid the confounding problem by introducing a set of mode-insensitive variables into the analysis model. However, this article will show that this method involves unrealistic assumptions in most practical research. The second and the third methods make use of an MM sample extended by comparable single-mode data. The assumptions, advantages, and disadvantages of all three methods are discussed. Each method will further be illustrated using a set of six variables relating to opinions about surveys among the Flemish population. The results show large differences between the methods.

Keywords

mixed-mode selection effects measurement effects mode effects opinion about surveys

Introduction

Mixed-mode (MM) surveys are surveys in which data from different respondents are collected by different data collection modes (e.g. Computer-Assisted Personal Interviewing [CAPI], Computer-Assisted Telephone Interviewing [CATI], Mail Self-Administered Questionnaires [MSAQ′s], or Web Self-Administered Questionnaires [WSAQ′s]). Such survey designs are increasingly used to collect data from large populations (Börkan 2010; de Leeuw 2005; Hayashi 2007; Roberts 2007) because they might be advantageous over single-mode surveys for two reasons (Biemer 2001; de Leeuw 2005; Dillman, Phelps, et al. 2009; Voogt and Saris 2005). First, MM surveys may reduce nonresponse error because certain sample members with particular mode preferences might be willing or be able to respond to the MM survey, while they might not be willing or not be able to respond to a single-mode survey. In this case, the MM survey offers greater (external) validity than the single-mode survey. Second, MM surveys may lower the total survey cost or increase statistical power because some respondents respond by a cheap mode. In this case, the MM survey increases the (external) reliability of the data compared with the single-mode survey.

Both reasons show that, in order to determine whether MM surveys are advantageous over single-mode surveys, differences in selection error between the modes should be evaluated. Selection error occurs when the sample is not fully representative of the population on the target variables. The difference in selection error between the modes is called the selection effect. Selection effects thus refer to differences on the target variables between the respondents allocated to the different modes of the MM design. If selection effects are absent, an alternative single-mode design exists providing data with equal representativeness at lower costs compared with the MM design.

However, evaluating selection effects in MM data is difficult because they may be completely confounded by another type of mode effects, that is, measurement effects (de Leeuw 2005; Dillman, Smyth, and Christian 2009; Voogt and Saris 2005; Weisberg 2005). Measurement effects are differences in measurement error accompanying different survey modes (Voogt and Saris 2005; Weisberg 2005). Put differently, measurement effects occur when the answers of the same respondents differ across modes. As a consequence, differences between the respondents of the different mode groups may be either due to differences in respondent characteristics (i.e., a selection effect) or due to different measurement of responses (i.e., a measurement effect). Ideally, no measurement effects should occur because they may counteract the advantageous effect of selection effects and preclude comparability with other survey data.

The existing literature pays relatively little attention to disentangling selection effects from measurement effects. This article aims to fill this gap by presenting and illustrating three different possible procedures to evaluate both mode effects separately. The third section describes these three methods and examines their underlying assumptions, advantages, and disadvantages. One of these methods solely relies on an MM data set, whereas the other two make use of an additional single-mode comparative sample. In the fourth section, we will compare the mode effects estimates of all three methods on 6 items relating to opinions about surveys. However, we start this article with an overview of some workable formal definitions of mode effects.

Defining Mode Effects

The existing literature on MM surveys does not provide any formal definitions of mode effects. This section will fill this gap by providing some possible formalizations. For simplicity, we restrict the number of survey modes to two, referred to as mode $a$ and mode $b$ .

The problem of MM surveys relates to the statistical literature on causal inference (Morgan and Winship 2009; Pearl 2009; Weisberg 2010) and can be represented by graphs (Pearl 2009). Figure 1a summarizes the process underlying survey data within the ideal situation where the responses of all respondents are observed in both modes. $X$ denotes the observed set of target variables and relates to two important sources. First, $X$ depends on the survey administration mode denoted by $A$ . The effect of $A$ on $X$ thus denotes the measurement effect between both modes because it means that population members’ outcomes may differ between the modes. We further assume that peoples’ responses do not vary, given a particular mode and this assumption is part of the so-called stable-unit-treatment-value assumption within the causal inference framework (Rubin 1991). Second, $X$ may differ between the mode group $G_{δ}$ to which a population member would be allocated whenever he was a sample member of the MM survey design $δ$ . The subscript $δ$ reminds us that the mode group variable is design-specific. For example, some respondents might respond by a different mode in a concurrent MM design compared to a sequential MM design. The relation between $G_{δ}$ and $X$ reflects differences between both groups and thus implies a selection effect.

Figure 1.

Process underlying data in a mixed-mode context. (a) When all modes are observed for all respondents. (b) In a mixed-mode data set.

The marginal measurement effect on a function $f (X)$ of $X$ (e.g., the probability function, the mean, . . .) can now be defined as the difference between both modes of administration:

M (f (X)) = f (X | A = a) - f (X | A = b) .

However, within MM data, the researcher might only be interested in a measurement effect conditional on the mode group $G_{δ}$ rather than the marginal measurement effect. For instance, one mode may be taken as the standard because the researcher believes that this mode does not involve measurement error. If mode $b$ is the standard mode, the researcher might only be interested in the measurement effect for the population members in mode group $a$ . The conditional measurement effects are thus defined as

M_{a} (f (X)) = f (X | A = a, G_{δ} = a) - f (X | A = b, G_{δ} = a),

and

M_{b} (f (X)) = f (X | A = a, G_{δ} = b) - f (X | A = b, G_{δ} = b) .

Depending on the interest of the researcher, one of these three formulas can be chosen to evaluate the measurement effect.

Likewise, the selection effect can be defined in three ways. The marginal selection effect can be defined as the difference between both mode groups:

S (f (X)) = f (X | G_{δ} = a) - f (X | G_{δ} = b) .

However, it makes more sense to condition the selection effect on the mode of administration $A$ , because survey data are usually observed through a single mode for each respondent. The conditional selection effects are

S_{a} (f (X)) = f (X | A = b, G_{δ} = a) - f (X | A = b, G_{δ} = b),

and

S_{b} (f (X)) = f (X | A = a, G_{δ} = a) - f (X | A = a, G_{δ} = b) .

Once again, the analyst can choose any of these three definitions according to his needs.

Ordinary MM data, however, do not allow estimation of selection effects and measurement effects when only variables $X$ , $A$ , and $G_{δ}$ are considered. As illustrated in Figure 1b, the particular survey design implies that the allocated mode group $G_{δ}$ fully determines the mode of administration $A$ for every respondent. Indeed, by definition, all respondents in mode group $a$ complete the survey by mode $a$ instead of mode $b$ , and vice versa. This perfect relation is represented by the double arrow. As a consequence, a collinearity problem arises and selection effects and measurement effects are completely confounded (Pearl 2009). It can be shown that the overall mode effect in MM data (i.e., the observed difference between the respondents of both modes) can be expressed in terms of the conditional measurement and selection effects:

\begin{aligned} f (X | A = a, G_{δ} = a) - f (X | A = b, G_{δ} = b) \\ = M_{a} (f (X)) + S_{a} (f (X)) = M_{b} (f (X)) + S_{b} (f (X)) . \end{aligned}

In this equation, the conditional measurement and selection effects however are unknown because they require estimating $f (X | A = b, G_{δ} = a)$ or $f (X | A = a, G_{δ} = b)$ . Within the causal inference literature, these latter quantities are called counterfactuals because they are never observed. Indeed, these quantities only refer to respondents’ potential outcomes if a different mode had been used (Holland 1988; Rosenbaum and Rubin 1983; Rubin 1974).

Three Methods to Evaluate Selection Effects and Measurement Effects

This section discusses three methods to disentangle selection effects from measurement effects. The first method, MM Calibration, starts from the existing MM literature which usually tries to control for mode-insensitive variables in order to neutralize selection effects. However, we will argue that the assumptions of this method are probably unrealistic in most practical situations. The second method, Extended MM Comparison, avoids the assumptions of the first method by comparing MM data with comparable single-mode data. The third method, Extended MM Calibration, in turn, tries to estimate the mode effects by predicting the mode group of the respondents in the single-mode data set if they had been sample members of the MM data set.

Method 1: MM Calibration

Existing research into mode effects reported in the literature generally tries to avoid confounding of mode effects in ordinary MM data sets by rendering both mode groups comparable on a set of variables $Z$ (e.g., among others, Fricker et al. 2005; Greenfield, Midanik, and Rogers 2000; Hayashi 2007; Heerwegh and Loosveldt 2011; Holbrook, Green, and Krosnick 2003; Jäckle, Roberts, and Lynn 2010; Lugtig et al. 2011) using techniques from the nonresponse and causal inference literature (Little and Rubin 2002; Morgan and Winship 2009; Pearl 2009; Rubin 1976; Schafer and Graham 2002). It is assumed that, after controlling for $Z$ , the remaining differences between the mode groups are caused by measurement effects, as both mode groups contain a comparable set of respondents. Mode groups can be made comparable by, for instance, matching respondents in both mode groups with respect to $Z$ , weighting the mode groups on $Z$ , or including $Z$ in the analysis model explaining $X$ . We refer to these techniques as MM Calibration because they all try to calibrate the mode groups of an MM sample.

The introduction of $Z$ starts from the model in Figure 2. This model allows us to evaluate selection effects and measurement effects separately because the selection effects are completely channeled through $Z$ . The remaining differences between the respondents of both mode groups are thus measurement effects. However, this only occurs under two assumptions which are reflected by the absence of some arrows (Pearl 2009).

Figure 2.

Analysis model for mixed-mode data controlled for mode-insensitive variables $Z$ .

First, there is no arrow between $G_{δ}$ and $X$ . This means that, after controlling for $Z$ , the mode group $G_{δ}$ should only relate to $X$ through $A$ , or, put differently, given $Z$ and $A$ , $G_{δ}$ and $X$ should be independent. This assumption is called the ignorable treatment assignment assumption (Rosenbaum and Rubin 1983) because controlling for Z renders both mode groups comparable with respect to their respondent composition. The less this assumption holds, the greater the likelihood that remaining differences between the respondents might still be caused by selection effects in addition to measurement effects. Moreover, it remains unclear to what extent these remaining differences are caused by measurement effects or selection effects as long as both are present.

Second, $Z$ must be mode-insensitive, which means that respondents will always give the same answers to these variables, regardless of the mode by which they complete the survey. We call this the mode-insensitivity assumption and this assumption is reflected by the absence of an arrow between $A$ and $Z$ . The less this assumption holds, the greater the likelihood that part of the relation between $G_{δ}$ and $X$ through $Z$ is caused by measurement effects in addition to selection effects. Once again, it then remains unclear to what extent the relation between $G_{δ}$ and $X$ is caused by measurement effects or selection effects.

In most practical situations, these assumptions are probably problematic because it is very difficult to find a set of mode-insensitive variables which explain the relation between $G_{δ}$ and $X$ to a satisfactory extent. Even the use of registry data or paradata might only provide a limited solution because of the ignorable treatment assignment assumption. Moreover, it remains unclear in which way and to what extent violation of both assumptions affects the outcomes. As a consequence, existing research generally failed to uncover both measurement and selection effects.

Method 2: Extended MM Comparison

A main problem of MM Calibration is the perfect relation between $G_{δ}$ and $A$ which occurs because data from comparable respondents (i.e., respondents from the same mode group $G_{δ}$ ) are always measured by the same mode. As a consequence, disentangling mode effects in MM data may require an adaptation of the survey design to break the link between $G_{δ}$ and $A$ . Vannieuwenhuyze, Loosveldt, and Molenberghs (2010) provide a possible solution by adding a subsample to the survey design that is solely approached by one single mode, for example mode $a$ . This subsample is called the comparative sample because it is compared with the MM data. This solution extends the original model by adding variable $D$ , which indicates whether a respondent is a member of the single-mode ( $c o m p$ ) subsample or the MM ( $m m$ ) subsample (see Figure 3). Variable $D$ is called an instrumental variable (Bowden and Turkington 1990; Morgan and Winship 2009; Pearl 2009) and it affects $A$ because membership of the single-mode sample automatically implies that data are collected by mode $a$ . As a consequence, the relation between $G_{δ}$ and $A$ is not a perfect relation any more.

Figure 3.

Analysis model for mixed-mode data combined with comparative single-mode data.

However, the use of an instrumental variable also starts from two assumptions which are reflected by the lack of arrows in the graph (Vannieuwenhuyze et al. 2010). First, there is no direct relation between $D$ and $X$ . This means that the measurement of $X$ only depends on the mode of data collection, but not on the sample to which a respondent belongs. If there was a relation between $D$ and $X$ , measurement error of mode $a$ would be different in the MM and the comparative sample. However, we assume that measurement error of mode $a$ is equal in both samples, and we will refer to this assumption as the measurement equivalence assumption.

Second, there is no direct relation between $D$ and $G_{δ}$ . This assumption corresponds to the statement that the single-mode sample and the MM sample represent the same population, and we refer to it as the representativity assumption. The representativity assumption is in fact equivalent to assuming that the MM sample offers greater reliability than the single-mode sample but equal validity. The validity of this assumption is often hard to check but in some situations reasonable to accept (e.g., the example in the fourth section and in Vannieuwenhuyze et al. 2010). The validity can further also be increased by controlling the differences between both samples for the mode-insensitive variables $Z$ (as shown in Figure 3).

The measurement equivalence assumption and representativity assumption have the advantage that they allow estimating the conditional selection effect $S_{b} (μ)$ and measurement effect $M_{b} (μ)$ , where $μ$ denotes the mean of an ordinal scale variable (Vannieuwenhuyze et al. 2010; Vannieuwenhuyze and Molenberghs 2010). Vannieuwenhuyze et al. provide methods to make inferences about these estimates using the Delta method. Note however that these methods require sufficiently large sample sizes to make normal approximations to proportion parameter estimates, as otherwise inferences can be flawed. In case of problematic small sample sizes, resampling methods can be used instead.

Method 3: Extended MM Calibration

Unfortunately, the Extended MM Comparison method cannot cope with MM designs involving more than two modes if only one single-mode comparative sample is available. In that case, the method only allows us to compare the mode of the comparative sample with the combination of the other modes (see Vannieuwenhuyze et al. 2010). It is impossible to disentangle mode effects between the comparative mode and one single noncomparative mode on one hand or between the noncomparative modes on the other hand, even though they may exist.

The third method, which also makes use of the comparative sample, can cope with more than two modes in the MM sample. This method starts from the observation that the values of $G_{δ}$ are unobserved in the comparative sample (this is denoted by the open circle for $G_{δ}$ in Figure 3). Provided that the representativity assumption holds, it is easy to measure mode effects if it is known whether each respondent in this sample would choose mode $a$ or $b$ if sampled for the MM sample. The mode groups of the comparative sample can be identified as latent classes which can be predicted using the values of the mode-insensitive variables $Z$ . After all, the relationship between $G_{δ}$ and $Z$ is observed in the MM data set. The predicted probability that $G_{δ}$ is $b$ can be used to calculate weights in the comparative sample, so that the composition of this sample reflects the composition of mode group $b$ . Thus, we can obtain $f (X | A = a, G_{δ} = b)$ from the comparative data set, and the selection effect $S_{b} (f (X))$ and measurement effect $M_{b} (f (X))$ can be calculated and evaluated. We call this method Extended MM Calibration because it tries to calibrate the comparative sample against one of the mode groups in the MM data set.

Like the Extended MM Comparison method, Extended MM Calibration relies on the measurement equivalence assumption and the representativity assumption, using the comparative data to draw conclusions about mode effects in the MM data. This method also requires the ignorable treatment assignment assumption and the mode-insensitivity assumption of the $Z$ variables, because it predicts the mode group of the comparative sample members using their $Z$ values. As before, the less these assumptions hold, the more bias this method possibly introduces into the mode effects estimates. Although this method requires more assumptions than MM Calibration, it might obtain better mode effects estimates because it uses a richer data set.

An Illustration Using a Survey on Surveys

Data

Mixed-Mode and Comparative Dataset

The illustrating data stem from a survey about respondents’ opinions about surveys, organized in 2004 in Flanders, Belgium, by the Survey Methodology Research Group of the Centre for Sociological Research, KU Leuven. In this survey, a sequential MM design was used consisting of a mail questionnaire as the main mode and a face-to-face (FTF) interview as the follow-up mode (Storms and Loosveldt 2005).

The sample consisted of 960 Flemish persons aged between 18 and 80 sampled from the national register. The mail questionnaire was presented as a survey about opinion polls in Flanders and asked respondents for their opinions on surveys. Because the use of cash incentives was not allowed, a €5 gift voucher was used as an incentive for returning the questionnaire. The survey started on October 18, 2004; a first reminder was sent by mail two weeks later, and a second reminder—accompanied by a new questionnaire—was sent four weeks after the first reminder. The mail survey phase lasted two months. In a second phase, nonrespondents to the mail survey were recontacted by an interviewer for a FTF interview. Nonetheless, this FTF follow-up was unknown to the sample members during the initial mail phase.

Besides the MM data set, the survey design also included a small comparative sample of respondents who were only invited to participate in a FTF interview. This sample consisted of 240 persons, and the survey questionnaire and data collection strategy were equal to the second phase of the MM group, except that the respondents did not receive the mail questionnaire first.

Because the aim of our article is to compare different techniques to separate selection effects from measurement effects rather than to make judgments about the population, the analyses will only include those respondents who responded to all the variables listed below. Partial responses are thus considered as nonresponse. If we only consider full responses, the initial mail phase of the MM design reached a response rate of 47.2 percent, which the FTF follow-up increased to 66.6 percent, a relatively high response rate for a general population survey. The comparative sample had an even higher response rate, namely 69.5 percent. An overview of all response rates can be found in Table 1.

Table 1.

Response Frequencies and Response Rates.

	Mixed Mode	Comparative
Mail	474
Face-to-face	124	155
Partial response	89	7
Nonresponse	211	61
Not eligible	62	17
Response rate^a	.666	.695

Note: ^aResponse/(total − not eligible).

Variables

We analyzed mode effects on the means of six target variables, each measuring a certain dimension of a short scale representing the respondents’ opinions about surveys (Loosveldt and Storms 2008). This scale includes four dimensions, namely survey enjoyment, survey value, survey cost, and survey privacy. An overview of these dimensions and the corresponding items can be found in Table 2. For all 6 items, respondents could indicate agreement on a 5-point Likert-type scale ranging from completely disagree to completely agree. In the mail questionnaire, these answer categories were listed horizontally in a table but a “don’t know”/“no opinion” option was not provided. In the FTF interviews, the response categories were read out by the interviewer and presented vertically on a showcard, again excluding “don’t know” and “no opinion” options. For the analyses, all items were rescaled such that high values indicate positive opinions and low values indicate negative opinions about surveys.

Table 2.

Opinions About Surveys Items and Their Means in the Three Sample Groups.

Var.	Description	MM Mail	MM FTF	Comp. Sample
Survey value
$X_{1}$	Surveys are useful ways of gathering information	3.63	3.64	3.83
Survey costs
$X_{2}$	Most surveys are a waste of people’s time	3.06	3.06	3.42
$X_{3}$	Surveys stop people doing more important things	3.28	3.29	3.60
Survey enjoyment
$X_{4}$	Surveys are boring for the persons who have to answer the question	3.01	2.93	3.25
$X_{5}$	I do not like participating in surveys	2.95	2.59	3.24
Survey privacy
$X_{6}$	Surveys are an invasion of privacy	3.46	3.44	3.68

Note: MM = mixed-mode.

The respondents were asked to rate these statements on a 5-point Likert-type scale: Completely disagree—disagree—neither agree nor disagree—agree—completely agree.

Source: Loosveldt and Storms (2008).

The particular topic of the survey and the opinion questions might cause selection effects and measurement effects on the means because “employing an instrument to measure its own performance is immediately contradictory” (Goyder 1986:28). First, there might be selection effects, as nonrespondents to the mail questionnaire in the MM sample are likely to be more negative about surveys (Loosveldt and Storms 2008). The mail group data confirmed this expectation: the later a mail questionnaire was returned, the lower the mean opinion score on all six opinion variables (table not included).

Second, we also expect measurement effects as respondents interviewed FTF will probably tend to report more positive opinions about surveys (Dillman, Phelps, et al. 2009; Loosveldt and Storms 2008). The mere presence of the interviewer may lead respondents to give socially desirable answers. Consequently, the positive answers obtained in the FTF follow-up may not reflect the respondents’ real opinions. In contrast to the mail survey, FTF interviews introduce a serious risk of measurement error, which results in a measurement effect (Dillman, Smyth et al. 2009; Voogt and Saris 2005). Again, the data suggest the presence of measurement effects: as Table 2 shows, the mean opinion score in the comparative group is larger than the mean scores in both MM groups for all six variables. If measurement effects were absent, we would expect the mean of the comparative group to fall between the means of the MM groups.

The set of mode-insensitive variables $Z$ includes age $\times$ gender, educational level, ownership of a personal e-mail address, activity status, and the number of adults (>18 years), adolescents (between 12 and 18 years), and children ( $<$ 12 years) in the household. We divided age into categories, each spanning a period of 10 years (18 to 27, 28 to 37, 38 to 47, 48 to 57, 58 to 67, and 68 to 80). The educational level variable contains six categories: no qualification, primary school, lower secondary, upper secondary, college (nonuniversity), or university. Activity status comprises eight categories: full-time employed, >50 percent and <50 percent part-time employed, unemployed, retired, homemaker, disabled, and “other.” The numbers of other persons in the household also constitute different categories: 1, 2, 3, 4, and 5 or more adults, and 0, 1, and 2 or more adolescents or children.

Discussion of the Assumptions

Mode-Insensitivity of the $Z$ Variables

Measurement effects are unlikely to occur between an FTF interview and a mail questionnaire on variables such as gender, age, the number of household members, or ownership of an e-mail address. However, this is less obvious for variables such as educational level and job status. Faced with an interviewer, respondents might tend to overstate their educational attainment and to report themselves as employed because they find these questions embarrassing (Lee and Renzetti 1990; Tourangeau and Yan 2007). Nevertheless, we include these variables because the evidence of mode-sensitivity is weak and because these variables are often included in previous studies.

Ignorable Treatment Assignment Assumption

The ignorable treatment assignment assumption is hard to check. Some insights can however be given by a logit regression of $Z$ on $G_{δ}$ . The lower the correlation between $Z$ and $G_{δ}$ , the less likely that $Z$ fully explains the selection effect and the less likely the ignorable treatment assignment assumption holds. The less this assumption holds, the greater the downward bias in mode effects estimates in the MM Calibration and the Extended MM Calibration methods. The regression model yielded a generalized coefficient of determination of .217 in our data set, which means that the fit of this model is poor (Nagelkerke 1991). As a consequence, weighting or imputing the data using $Z$ might be insufficient to make the data comparable in their compositions. This may result in considerable underestimation of all mode effects. Further note that this fit might even be overestimated if the $Z$ variables are not mode-insensitive as is assumed.

Measurement Equivalence Assumption

The validity of the measurement equivalence assumption cannot be checked because this would require comparing FTF responses between the MM data and the comparative data for identical respondents. Such a comparison is impossible because all respondents belong to only one of both data sets. Hence, this assumption should be taken as being true. The design could provide an argument against this assumption because respondents in the FTF follow-up phase might be annoyed to be recontacted. A possible consequence might be that these respondents are more prone to satisficing or become more negative about surveys.

Representativity Assumption

The representativity assumption cannot be checked directly because $X | A = F T F$ is not observed for the mail group in the MM sample. The assumption can be violated in two ways. First, as this assumption requires all respondents of the MM design to also participate in the FTF comparative design, this assumption is violated when, for example, a group of respondents fill in mail questionnaires but refuse FTF interviews. Second, the assumption also requires respondents of the FTF comparative design to participate in the MM design. However, sample members might be annoyed by the follow-up phase in the MM design and refuse participation whereas they might be willing to cooperate if they were immediately approached FTF.

Nevertheless, some arguments can be put forward in favor of this assumption (Vannieuwenhuyze et al. 2010). First, there is a theoretical argument. The main objective of the MM design was to collect data from as many sample units as possible at the lowest possible price while maximizing representativeness, and, for that reason, the sequential mail-CAPI design was chosen. Indeed, a mail survey is considered a cost effective and simple mean of obtaining data from a large sample, while the more expensive FTF survey is used to increase the response rate and the representativity of the initial mail phase, as FTF interviews generally yield lower nonresponse (de Leeuw 2008; Roy and Berger 2005). So, we can assume on theoretical grounds that the MM data are more reliable but equally valid compared with the comparative data, and this is equivalent to the representativity assumption.

Second, if both data sets contain a similar set of population members, both samples should have similar response rates. This is confirmed by the data, as the difference between the response rates is only 2.9 percent points, which is not significant (SE $= .051$ , $p = .567$ ). This observation also lends support to the representativity assumption, but we should note that this argument does not necessarily hold in the other direction. Equal response rates do not necessarily mean that the samples are comparable.

Finally, both data sets can also be compared on their composition of the mode-insensitive variables $Z$ . Apparently, both data sets only significantly differ from each other on the composition of “educational level” ( $χ^{2} = 13.90$ , df = 5, $p = .016$ ) because the MM data set comprises more respondents without qualification while the comparative data set contains more respondents who had completed primary school. No significant differences could be found on the other variables in $Z$ . Nevertheless, within the Extended MM methods, we adapted the composition of the MM data set to the composition of the comparative group using propensity score weighting based on all variables in $Z$ (Rosenbaum and Rubin 1983; Sato and Matsuyama 2003). Such an adaptation might improve the comparability of both samples, although its effect might be minor.

Results and Discussion of the Illustration

In this section, we report the estimated mode effects on the means for all three methods. This illustration is restricted to mode effects $M_{b}$ and $S_{b}$ , where $a$ is the FTF mode and $b$ the mail mode. This limitation stems from the fact that the other definitions cannot be calculated by the Extended MM methods. With respect to the MM Calibration method, we use a weighting approach. We calculated normalized weights proportional to the propensity scores $p$ of belonging to the mail group given $Z$ (Rosenbaum and Rubin 1983). These propensity scores can be used to transform the composition of the FTF group to the mail group by assigning the weights $p / (1 - p)$ (Sato and Matsuyama 2003). However, despite this transformation to the mail group composition, we still have the values of $X | A = F T F$ for this group. So, we can derive $P (X | A = F T F, G_{δ} = m a i l)$ by considering the weighted FTF group. The selection effect is then calculated by comparing the unweighted FTF respondents with the weighted FTF respondents (i.e., $S_{b}$ ). The measurement effect, on the other hand, is obtained by comparing the unweighted mail respondents with the weighted FTF group (i.e., $M_{b}$ ). These effects are tested with a two-sample t-test.

Comparison of the Methods

The results of the mode effects estimates can be found in Table 3. Bearing in mind that the range of all opinion items goes from one to five, the estimates of the MM Calibration method (method 1) are not large at all and, as a consequence, neither the selection effects nor the measurement effects appear to be statistically significant except for the measurement effect on $X_{5}$ . The MM Calibration method would therefore lead to the conclusion that, on one hand, a mail and a FTF survey generally measure opinions about surveys similarly, and that, on the other hand, nonrespondents to the mail survey phase in the MM sample have the same opinion about surveys as the respondents in this phase. Nevertheless, this impossibility to estimate significant mode effects is possibly caused by a violation of the ignorable treatment assignment assumption, rather than by the absence of real population mode effects.

Table 3.

Mode Effect Estimates on the Means of the Three Methods.

Effect (p)	Method 1	Method 2	Method 3
Measurement effects
$X_{1}$	−.021 ( $0.786$ )	.224 ( $0.006$ )	.169 ( $0.016$ )
$X_{2}$	.007 ( $0.937$ )	.427 ( $< .001$ )	.361 ( $< .001$ )
$X_{3}$	.050 ( $0.576$ )	.365 ( $< .001$ )	.288 ( $< .001$ )
$X_{4}$	−.048 ( $0.615$ )	.320 ( $0.003$ )	.206 ( $0.028$ )
$X_{5}$	−.224 ( $0.034$ )	.458 ( $< .001$ )	.246 ( $0.021$ )
$X_{6}$	−.077 ( $0.394$ )	.242 ( $0.015$ )	.172 ( $0.047$ )
Selection effects
$X_{1}$	.030 ( $0.786$ )	−.237 ( $0.038$ )	−.182 ( $0.046$ )
$X_{2}$	−.002 ( $0.987$ )	−.406 ( $0.004$ )	−.340 ( $0.002$ )
$X_{3}$	−.042 ( $0.729$ )	−.342 ( $0.007$ )	−.265 ( $0.009$ )
$X_{4}$	−.033 ( $0.793$ )	−.519 ( $< .001$ )	−.405 ( $0.001$ )
$X_{5}$	−.141 ( $0.290$ )	−.855 ( $< .001$ )	−.643 ( $< .001$ )
$X_{6}$	.057 ( $0.631$ )	−.248 ( $0.059$ )	−.178 ( $0.089$ )

Note: This table shows estimates for mode effects on the means using the mixed-mode calibration method (method 1), the extended mixed-mode comparison method (method 2), and the extended mixed-mode calibration method (method 3). Selection effects are defined in equation (2c) and measurement effects in equation (1c), where $a$ = FTF and $b$ = mail. The p values refer to two-sided tests of the estimates. For a description of the variables $X_{1}$ to $X_{6}$ , we refer to Table 2.

In contrast to the MM Calibration method, the Extended MM Comparison method (method 2) yields much larger mode effects, which are in line with our expectations. As a consequence, one-sided tests can be used showing that all mode effects are significant at a 95 percent confidence level. The positive measurement effects mean that a FTF interview generally measures a more positive attitude compared to a mail questionnaire. Concerning the selection effects, the negative estimates indicate that the respondents responding to the mail questionnaire are more positive toward a survey than the respondents who were reached in the FTF follow-up phase. Remember, however, that the inferences rest on normal approximations and this might be a problem because the sample sizes are rather small.

Finally, the mode effect estimates of the Extended MM Calibration method (method 3) are considerably larger than those yielded by the MM Calibration method (method 1), but smaller than the estimates obtained in the Extended MM Comparison method (method 2). The latter observation probably arises from the fact that, unlike the Comparison method, the Extended MM Calibration method requires the ignorable treatment assignment assumption. Violations of this assumption may result in underestimations of the mode effects. However, as in the Extended MM Comparison method, all measurement effects are positive and all selection effects are negative. These estimates are thus in line with expectations. Further, all estimates are significant on a one-sided 95 percent confidence level. This method therefore leads to the same conclusions about the existence of selection and the measurement effects as the Extended MM Comparison method.

Interpretation of the Results

Let us now interpret and discuss the mode effects estimates and draw conclusions. Mode effects estimates can be easily interpreted by just applying their definitions in equations (1c) and (2c). The measurement effect on item $X_{1}$ , for example, equals .224 for the Extended MM Comparison method. This means that, on average, the mail group respondents would rate the usefulness of surveys as a tool for gathering information .224 higher in an FTF survey than in a mail survey. The selection effect on this item, on the other hand, equals −.237. This means that the mail group rates this item on average .237 higher than the FTF group when this item was surveyed by an FTF interview in both groups.

Further, the items about survey enjoyment ( $X_{2}$ and $X_{3}$ ) and survey costs ( $X_{4}$ and $X_{5}$ ) are remarkably more susceptible to mode effects than the items on survey value and survey privacy, and this trend is confirmed by both methods 2 and 3. Especially the mode effects on item $X_{5}$ are remarkably larger than the other items, possibly because this item explicitly asks for the respondents’ opinion (it is written in the I-form), whereas the statements relating to other items are more general (see Table 2).

The presence of selection effects in the Extended MM methods points to an advantage of using a sequential mail-FTF design over a single-mode mail or FTF design. If the assumptions of the Extended MM Comparison method hold, the MM design should provide more reliable data with respect to the composition of respondents compared to single-mode FTF data within the same cost constraints. Likewise, the MM data should have a more valid respondent composition than data obtained in a single-mode mail design. In order to determine the exact benefit of the MM design compared with single-mode designs, the relative costs of the modes should be taken into account. This exercise, however, is beyond the scope of this article.

A better composition of respondents in the MM sample, however, does not necessarily mean that the overall data quality of the MM design is better as well. This is only the case if measurement effects are minor. In our example, we found significant measurement effects which might undermine the advantage of the MM design. Nonetheless, the counteracting effect of these measurement effects is difficult to determine because this would require the unobserved “true” opinions of the respondents. An alternative starting point might be to take the mail questionnaire data as a benchmark, because this mode is argued to introduce the smallest measurement error. However, as such analyses require additional assumptions about generalizations of the mode effects estimates, we leave these to future studies.

To conclude this illustration, we want to remark that the estimated mode effects cannot simply be generalized to any MM design combining a mail survey and an FTF interview. Indeed, the definitions of the mode effects $S_{b} (f (X))$ and $M_{b} (f (X))$ not only depend on the mode but also on the design $δ$ through $G_{δ}$ . A concurrent design in which both modes are offered simultaneously instead of sequentially, for example, will probably yield different mode effects, even if it includes the very same survey modes.

General Discussion

The aim of this article was to compare and illustrate three methods that can be used to disentangle measurement effects and selection effects in MM data. The MM Calibration method, which is usually reported in the existing literature, starts from a MM sample and matches the different mode groups on a set of mode-insensitive variables. The Extended MM Comparison method and the Extended MM Calibration method start from MM data extended by a comparable single-mode sample. All three methods have their advantages and disadvantages, which will be discussed in this section. An overview of their characteristics can be found in Table 4.

Table 4.

Comparison of the Three Methods to Evaluate Mode Effects.

				Method
			1	2	3
Assumptions
(1)	Ignorable treatment assignment		Yes	No	Yes

(2)	Mode-insensitivity of $Z$		Yes	No	Yes
(3)	Measurement equivalence		No	Yes	Yes
(4)	Representativity assumption		No	Yes	Yes
(5)	Normal approximation of proportion estimates		No	Yes	No
Data
(6)	Uses comparative data set		No	Yes	Yes
If comparative mode $= a$ , possibility to estimate . . .
(7)	$M (f (X))$	(1a)	Yes	No	No
(8)	$M_{a} (f (X))$	(1b)	Yes	No	No
(9)	$M_{b} (f (X))$	(1c)	Yes	Yes	Yes
(10)	$S (f (X))$	(2a)	Yes	No	No
(11)	$S_{a} (f (X))$	(2b)	Yes	No	No
(12)	$S_{b} (f (X))$	(2c)	Yes	Yes	Yes
If > 2 modes, possibility to estimate effects between . . .
(13)	The standard and one nonstandard mode		Yes	No	Yes
(14)	Two nonstandard modes		Yes	No	No

Note: MM = mixed-mode.

The methods are the mixed-mode calibration method only using the MM sample (method 1), the extended mixed-mode comparison method using both the MM sample and the comparative group (method 2), and the extended mixed-mode calibration method using both the MM sample and the comparative group (method 3).

Because of its simplicity, the MM Calibration method is very flexible as it only requires a MM data set and it allows mode effects to be evaluated between all modes. The Extended MM methods require additional comparable single-mode data meeting specific requirements and do not allow evaluating mode effects between all modes. The Extended MM Calibration method only allows mode effects to be separately measured between the comparative mode and each other mode, while the Extended MM Comparison only allows mode effects to be measured between the comparative mode and a combination of the other modes. Further, in the Extended MM methods, it is not possible to calculate the mode effects as defined in equations (1a), (1b), (2a), and (2b) because all these definitions require an estimate of $f (X | A = b, G_{δ} = a)$ . As a consequence, the Extended MM methods always take the mode of the comparative sample as the standard mode. In some situations, however, this might be surprising because this mode entails the largest measurement error. In the survey on surveys, for example, we expect the FTF interview to add social desirability bias to the responses, so it would seem self-evident to select the “unbiased” mail questionnaire as the standard mode. The Extended MM methods, however, preclude this choice and force us to select the FTF mode as the standard.

However, even though MM Calibration is a very flexible solution to disentangle mode effects, it may come with some serious problems because it starts from unrealistic assumptions. Indeed, in practice it is very difficult to find a set of mode-insensitive variables that can explain the selection effect in a satisfactory way, although this is required by the ignorable treatment assignment assumption. The survey on surveys example of this article reveals the impact of the MM Calibration method on the mode effects estimates. Almost none of the mode effects estimates on a set of six variables about opinions about surveys were significant using this method, even though we can reasonably assume the survey topic to be extremely susceptible to mode effects. Moreover, the Extended MM Comparison and Extended MM Calibration methods did yield significant selection and measurement effects on almost all variables. This suggests that MM Calibration is insufficient to detect measurement effects and selection effects in MM data. The Extended MM Comparison and the Extended MM Calibration methods seem to be valuable alternatives provided that a comparative single-mode sample is available. We should remark, however, that the survey on surveys data probably forms an atypical example of regular MM survey data. The validity of the assumptions must always be evaluated thoroughly when using other MM data. Further research should be done on the stringency of the assumptions and the applicability of the methods under different circumstances using, for example, a simulation study and sensitivity analysis.

Footnotes

Acknowledgments

The authors like to thank Geert Molenberghs for his helpful suggestions. This article was awarded the 2011 European Survey Research Association (ESRA) award for best paper by an early career researcher.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors received financial support from the Flemish Research Council (FWO) for the data collection.

References

Biemer

P. P.

2001. “Nonresponse Bias and Measurement Bias in a Comparison of Face To Face and Telephone Interviewing.” Journal of Official Statistics 17:295–320.

Börkan

2010. “The Mode Effect in Mixed-Mode Surveys: Mail and Web Surveys.” Social Science Computer Review 28:371–80.

Bowden

R. J.

Turkington

D. A.

. 1990. Instrumental Variables. Cambridge, UK: Cambridge University press.

de Leeuw

E. D

. 2005. “To Mix or Not to Mix Data Collection Modes in Surveys.” Journal of Official Statistics 21:233–55.

de Leeuw

E. D

. 2008. “Choosing the Method of Data Collection.” Pp. 113–35 in International Handbook of Survey Methodology, edited by de

E. D.

Leeuw

Hox J. J.

Dillman

D. A.

. New York, NY: Lawrence Erlbaum.

Dillman

D. A.

Phelps

Tortora

Swift

Kohrell

Berck

Messer

B. L.

. 2009. “Response Rate and Measurement Differences in Mixed-Mode Surveys Using Mail, Telephone, Interactive Voice Response (Ivr) and the Internet.” Social Science Research 38:1–18.

Dillman

D. A.

Smyth

J. D.

Christian

L. M.

. 2009. Internet, Mail and Mixed-Mode Surveys: The Tailored Design Method. 3rd ed. Hoboken, NJ: John Wiley.

Fricker

Galesic

Tourangeau

Yan

. 2005. “An Experimental Comparison of Web and Telephone Surveys.” Public Opinion Quarterly 69:370–92.

Goyder

1986. “Surveys on Surveys - Limitations and Potentialities.” Public Opinion Quarterly 50:27–41.

10.

Greenfield

T. K.

Midanik

L. T.

Rogers

J. D.

. 2000. “Effects of Telephone Versus Face-to-Face Interview Modes on Reports of Alcohol Consumption.” Addiction 95:277–84.

11.

Hayashi

2007. “The Possibility of Mixed-Mode Surveys in Sociological Studies.” International Journal of Japanese Sociology 16:51–63.

12.

Heerwegh

Loosveldt

. 2011. “Assessing Mode Effects in a National Crime Victimization Survey Using Structural Equation Models: Social Desirability Bias and Acquiescence.” Journal of Official Statistics 27:49–63.

13.

Holbrook

A. L.

Green

M. C.

Krosnick

J. A.

. 2003. “Telephone Versus Face-To-Face Interviewing of National Probability Samples with Long Questionnaires: Comparisons of Respondent Satisficing and Social Desirability Response Bias.” Public Opinion Quarterly 67:79–125.

14.

Holland

P. W

. 1988. “Causal Inference, Path Analysis, and Recursive Structural Equations Models.” Sociological Methodology 18:449–84.

15.

Jäckle

Roberts

Lynn

. 2010. “Assessing the Effect of Data Collection Mode on Measurement.” International Statistical Review 78:3–20.

16.

Lee

R. M.

Renzetti

C. M.

. 1990. “The Problems of Researching Sensitive Topics: An Overview and Introduction.” American Behavioral Scientist 33:510–28.

17.

Little

R. J. A.

Rubin

D. B.

. 2002. Statistical Analysis with Missing Data. 2nd ed. London, UK: John Wiley.

18.

Loosveldt

Storms

. 2008. “Measuring Public Opinions about Surveys.” International Journal of Public Opinion Research 20:74–89.

19.

Lugtig

Lensvelt-Mulders

G. J. L. M.

Frerichs

Greven

. 2011. “Estimating Nonresponse Bias and Mode Effects in a Mixed-Mode Survey.” International Journal of Market Research 53:669–86.

20.

Morgan

S. L.

Winship

. 2009. Counterfactuals and Causal Inference: Methods and Principles for Social Research. New York: Cambridge University Press.

21.

Nagelkerke

N. J. D.

1991. “A Note on a General Definition of the Coefficient of Determination.” Biometrika 78:691–92.

22.

Pearl

2009. Causality: Models, Reasoning and Inference. 2nd ed. New York, NY: Cambridge University Press.

23.

Roberts

2007. Mixing Modes of Data Collection in Surveys: A Methodological Review. Retrieved March 2, 2010 (http://eprints.ncrm.ac.uk/418/1/MethodsReviewPaperNCRM-008.pdf).

24.

Rosenbaum

P. R.

Rubin

D. B.

. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70:41–55.

25.

Roy

Berger

P. D.

. 2005. “E-mail and Mixed Mode Database Surveys Revisited: Exploratory Analysis of Factors Affecting Response Rates.” Database Marketing & Customer Strategy Management 12:153–71.

26.

Rubin

D. B

. 1974. “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.” Journal of Educational Psychology 66:688–701.

27.

Rubin

D. B

. 1976. “Inference and Missing Data.” Biometrika 63:581–92.

28.

Rubin

D. B

. 1991. “Practical Implications of Modes of Statistical Inference for Causal Effects and the Critical Role of the Assignment Mechanism.” Biometrics 47:1213–34.

29.

Sato

Matsuyama

. 2003. “Marginal Structural Models as a Tool for Standardization.” Epidemiology 14:680–86.

30.

Schafer

J. L.

Graham

J. W.

. 2002. “Missing Data: Our View of the State of the Art.” Psychological Methods 7:147–77.

31.

Storms

Loosveldt

. 2005. Procesevaluatie Van Het Veldwerk Van Een Mixed Mode Survey Naar Het Surveyklimaat In Vlaanderen. Leuven: KUL, Centrum voor sociologisch onderzoek.

32.

Tourangeau

Yan

. 2007. “Sensitive Questions in Surveys.” Psychological Bulletin 133:859–83.

33.

Vannieuwenhuyze

J. T. A.

Loosveldt

Molenberghs

. 2010. “A Method for Evaluating Mode Effects in Mixed Mode Surveys.” Public Opinion Quarterly 74:1027–45.

34.

Vannieuwenhuyze

J. T. A.

Molenberghs

. 2010. A SAS Macro to Disentangle Mode Effects on Proportions and Means of a Categorical Variable in an Extended Mixed-Mode Dataset. Retrieved August 20, 2010 (http://perswww.kuleuven.be/jorrevannieuwenhuyze).

35.

Voogt

R. J.

Saris

W. E.

. 2005. “Mixed Mode Designs: Finding the Balance Between Nonresponse Bias and Mode Effects.” Journal of Official Statistics 21:367–87.

36.

Weisberg

H. F.

2005. The Total Survey Error Approach: A Guide to the New Science of Survey Research. Chicago, IL: University of Chicago.

37.

Weisberg

H. F.

2010. Bias and Causation: Models and Judgment for Valid Comparisons. Hoboken, NJ: Wiley.

Evaluating Relative Mode Effects in Mixed-Mode Surveys:

Abstract

Keywords

Introduction

Defining Mode Effects

Three Methods to Evaluate Selection Effects and Measurement Effects

Method 1: MM Calibration

Method 2: Extended MM Comparison

Method 3: Extended MM Calibration

An Illustration Using a Survey on Surveys

Data

Mixed-Mode and Comparative Dataset

Variables

Discussion of the Assumptions

Mode-Insensitivity of the Z Variables

Ignorable Treatment Assignment Assumption

Measurement Equivalence Assumption

Representativity Assumption

Results and Discussion of the Illustration

Comparison of the Methods

Interpretation of the Results

General Discussion

Footnotes

Acknowledgments

Declaration of Conflicting Interests

Funding

References

Mode-Insensitivity of the $Z$ Variables