Measurement Effects of Survey Mode on the Equivalence of Attitudinal Rating Scale Questions

Abstract

This study applies ordinal confirmatory factor analysis for multiple groups to assess equivalence of scale, random errors and systematic (nonrandom) errors of attitudinal questions surveyed on rating scales under different survey modes (Face-to-Face [F2F], Telephone, Paper, and Web). Empirical findings from a large-scale experiment are presented. Consistent with theoretical expectations, interviewer- and self-administered surveys measured all assessed questions on systematically different scales, with different systematic bias, and with differing extents of random error. These measurement effects were absent when comparing Paper with Web or F2F with Telephone. It is concluded that modes impact primarily systematic measurement effects affecting multiple items equally. Interviewer- and self-administered modes should only be combined with great care in mixed-mode surveys that focus on attitudinal constructs. Combining Paper and Web or Telephone and F2F are the viable options. Thereby choosing the self-administered modes appears more efficient, because these modes exhibited higher indicator reliabilities (smaller random error) than the interviewer modes.

Keywords

mode effects measurement effects mixed-mode surveys measurement equivalence confirmatory factor analysis categorical data modeling

Introduction

Analysts of data generated by different modes of data collection, like Face-to-Face (F2F), Telephone, Paper and Pencil, or online (Web), need to be sure that answers to the same questions asked under different modes are equivalent. This concern has gained increased prominence in the context of cross-sectional or longitudinal mixed-mode surveys, in which two or more survey modes are combined concurrently or sequentially to maximize response rates and to optimize on costs (de Leeuw 2005; Dillman, Smyth, and Christian 2009; Dillman, Phelps, et al. 2009). There are in fact strong theoretical arguments that social and cognitive factors impact the answering processes to very different extents (Tourangeau, Rips, and Rasinski 2000; Bowling 2005; de Leeuw 2008; Dillman, Smyth, et al. 2009:311-29). A sizable body of empirical studies has assessed measurement effects of survey modes, also referred to as mode effects, from mainly two perspectives. The “sampling statistics approach” seeks to assess measurement effects by testing differences in marginal means and, more rarely, variances of separate questions (e.g., Schonlau et al. 2004; Link and Mokdad 2005). The “answering behavior approach” considers differences in indicators of answering behavior, such as acquiescent, extreme, nondifferentiated, socially desirable or “don’t know” (DK) answering (de Leeuw 1992; Holbrook, Green, and Krosnick 2003; Fricker et al. 2005; Christian, Dillman, and Smyth 2008; Greene, Speizer, and Wiitala 2008; Chang and Krosnick 2009).

In this article, we follow a third approach to studying measurement effects of modes on attitudinal questions that defines equivalence as independence of answers to a question from a survey mode conditional on latent true scores (Mellenbergh 1989; Meredith 1993; Millsap 2011). This definition implies that two persons with the same true state on the concept of interest give a particular answer with the same probability when asked under different modes. A model-based approach using ordinal multiple-group confirmatory factor analysis (MCFA) is applied to describe how modes impact this probability differentially, referred to as measurement effects (Jöreskog 1971; Bollen 1989; Skrondal and Rabe-Hesketh 2004; Alwin 2007; Kankaraš, Vermunt, and Moors 2011; Millsap 2011).

A limited number of prior studies have used continuous MCFA approaches to study mode equivalence (de Leeuw, Mellenbergh, and Hox 1996; Buchanan, Johnson, and Goldberg 2005; Cole, Bedeian, and Feild 2006; Deutskens, de Ruyter, and Wetzels 2006; de Beuckelaer and Lievens 2009). One potential problem of this literature is that the ordinal measurement level of the attitudinal questions generally is not taken into account. Continuous MCFA models then are error prone in the detection of measurement effects because ordinal data violate distributional assumptions (Alwin 2007; Kankaraš et al. 2011; Millsap 2011:26-37). More sensible alternative model choices are latent trait models using appropriate link functions (e.g., probit as in ordinal MCFA) or categorical latent traits (Meade and Lautenschlager 2004; Kankaraš et al. 2011; Kim and Yoon 2011). Ordinal MCFA, applied in the present study, can be considered a generalization of polytomous item response theory (IRT) models allowing estimation of error variance, which is not possible in IRT (cf. Statistical Methodology and Assumptions section; Muthén 1984; Muthén and Asparouhov 2002; Millsap and You-Tein 2004; Kamata and Bauer 2008; Muthén and Muthén 2010; Millsap 2011:122-24).

Another limitation of prior work is neglecting to take selection effects of modes into account. Practical implementations of mode experiments have shown that sample compositions often are not homogeneous, across conditions, even if randomization is used (e.g., Dillman, Phelps et al. 2009). This causes a counterfactual situation (Morgan and Winship 2007), in which it is unknown, whether an observed measurement effect is caused by the mode or the selection process (Jäckle, Roberts, and Lynn 2010; Vannieuwenhuyze and Loosveldt 2013). We apply a propensity score adjustment method in the estimation of the ordinal MCFA model to control for the selection problem.

MCFA allows estimating three types of measurement effects. First, modes may cause differences in the scale of a given item that is sensitive to mode by altering the relationship of the true score and the observed answer, that is, in expectation the same respondent would not give the same answer when asked under different modes (Vandenberg and Lance 2000; Millsap 2011:5-7). Second, modes may change the extent of random measurement error of an item sensitive to mode implying differential reliability (precision) and thus attenuated relationship estimates, though the answer probabilities are unbiased (e.g., Fuller 1987; Biemer and Stokes 1991). Third, a previously rather neglected advantage of MCFA is estimation of relative differences in the extent of systematic error across sets of questions, also called nonrandom errors, correlated errors, and method variance (Blalock 1970; Andrews 1984; Gerbing and Anderson 1984; Saris and Andrews 1991; Green and Citrin 1994; Davis 1997; Alwin 2007:41-42). These are person- and method-level sources of systematic bias and systematic variance affecting all indicators equivalently.

The MCFA approach thus allows additional insights over the sampling statistics and the answering behavior approaches. Marginal analyses of means and variances can neither distinguish item-specific scale bias from systematic bias nor can item-specific random error be differentiated from true score variance and nonrandom error variance. The answering behavior approach, furthermore, has described many types of behaviors, but cannot estimate their statistical effects. Answering behaviors are likely causes of systematic error, however. For example, nondifferentiation (Krosnick 1991) or acquiescence (Billiet and McClendon 2000) are causes of systematic bias and variance not accounted for by the true score. MCFA thus establishes a model-based link between the answering behavior and the sampling statistics approaches.

In practice, it is very relevant to know whether measurement effects are item-specific or systematic phenomena. Presence of different extents of systematic error signifies different relationships of true score and observed answers of all items and hence systematically incomparable modes. Such a difference would indicate, for example, that wording and topic of a question are less important in influencing measurement effects. However, if measurement effects only concerned single items, these could be taken into account in design, for example, by changing wording, or in analysis, for example, by allowing for partial nonequivalence or by indicator omission from analyses that require pooling of data across modes.

Our data stem from a large-scale mode experiment based on a probability sample from the general population of the Netherlands implying high external validity (gross n = 8,800). Earlier MCFA studies mainly considered special interest groups.¹ Furthermore, the data allow a comparison of the four major survey modes (F2F, Telephone, Paper, and Web). In prior literature, pairwise comparisons of Web and Paper modes prevailed.

We proceed by our expectations about measurement effects in the second section, followed by a description of our experimental data in the third section. The fourth section introduces the technical background of the methodology. The fifth section presents results on three scales. We discuss and conclude by the sixth section.

Expectations About Measurement Effects of Modes

Historically, a prominent cause of measurement effects of modes was rooted in different traditions of questionnaire design. To eliminate this alternative explanation, researchers should apply “unified designs” of questions across modes suggesting, for example, equal wording of questions and labeling of answer scales (Dillman, Smyth et al. 2009:321-29). Any remaining effects of factors that cannot be equalized are usually attributed to mode (cf. Groves et al. 2010:160-62), of which there are two major ones.

First, the social situation during the answer process naturally differs caused by the presence of an interviewer, in Telephone or F2F modes, or its absence, in self-administered Paper and Pencil or Web modes (Tourangeau, Rips, and Rasinski 2000:289-312; Bowling 2005; de Leeuw 2008; Dillman, Smyth et al. 2009:311-14). A well-known consequence of this is socially desirable answering in interviewer-administered surveys. But interviewers can also provide motivation in the answer process and throughout the interview, can probe answers, clarify, and reassure that respondents focus on the interview. These aspects can enhance attention and depth of cognitive processing. Interviewers, however, are in control of the pace of the interview and the order of questions (Holbrook et al. 2003; Bowling 2005). This may not give respondents sufficient time to consider answers thoroughly. Especially from telephone surveys it is known that respondents may feel pressured to answer questions, because pauses are perceived as undesirable (de Leeuw 2008). Conversely, self-administered surveys allow a self-chosen pace and order in an anonymous situation, but lack interactive advantages of motivation and clarification.

The second major difference lies in primarily aural or primarily visual communication of questions and answers (Tourangeau 2000:289-312; de Leeuw 2008; Dillman, Smyth et al. 2009:314-20). In aural-based modes, question and answer categories need to be fully memorized, whereas in visual modes respondents can reread question elements multiple times. These tasks pose very different cognitive demands and burden (Bowling 2005; Fricker et al. 2005; Greene et al. 2008; Heerwegh and Loosveldt 2008). This difference in cognitive stimulus is likely to impact the full answer process.

On the surface, interviewer modes, like Face-to-Face (“F2F” in the following) or Telephone, have very similar measurement properties, because interviewers are present, and they both rely on aural information transmission (supposing no visual elements are used in F2F, as in our study). Self-administered modes, like Web and Paper and Pencil (“Paper” in the following), are similar due to self-administration and reliance on visual information transmission. Consequently, the measurement processes of interviewer- and self-administered modes differ strongly. It can therefore be generally expected to find small or no measurement effects when comparing Telephone with F2F (Hypothesis 1a) and Web with Paper modes (Hypothesis 1b), respectively. Moreover, measurement effects should primarily be present between the two interviewer- and self-administered modes (Hypothesis 2). We posit these hypotheses for questions’ scales as well as random measurement errors.

In empirical studies using MCFA modeling Hypothesis 1a has been supported in the comparisons of Web- and Paper-based surveys, which found these modes to be fully equivalent (Buchanan et al. 2005; Cole et al. 2006; Deutskens et al. 2006; de Beuckelaer and Lievens 2009). Yet, since the populations of these studies were rather specific (employees in national or international businesses), external validity is not fully assured. De Leeuw and colleagues (1996) additionally assessed equivalence with respect to interviewer modes based on a national random digit dialing survey with Telephone, F2F, and Paper modes. They find nonequivalence across all modes, in particular between Paper and the two interviewer modes (consistent with Hypothesis 2), but also the Telephone and F2F modes were not fully equivalent. The study by de Leeuw et al. thus points to a potential challenge to Hypothesis 1b. For an indication about random error differences across modes, we refer to a meta-analysis of multi-trait-multi-method (MTMM) literature by Saris and Gallhofer (2007). The authors report that reliability of measurement differs between interviewer- and self-administered surveys, which is consistent with Hypothesis 2. In particular, reliability was lower in interviewer-administered surveys (cf. Braunsberger, Wybenga, and Gates 2007 for similar results from a web–telephone comparison).

Measurement properties of modes, for example, differential demands of the social situation, motivation, and cognition, are known to impact the occurrence of answering behaviors as well (e.g., Holbrook et al. 2003; Heerwegh and Loosveldt 2008). Studies describing answering behaviors found differences primarily between interviewer- and self-administered modes, which conforms to Hypotheses 1 and 2. Four consistent findings are particularly worth noting. First, if the construct of interest is sensitive, social desirable answering probably affects all indicators of a scale more strongly in the interviewer modes. Second, aural modes have been reported to yield more extreme or more extreme positive responses independent of question content (de Leeuw 1992; Christian et al. 2008; de Leeuw 2008; Dillman, Smyth et al. 2009:316-20; Dillman, Phelps et al. 2009). Third, acquiescent answering behavior was found more often in Telephone than in Web surveys (Holbrook et al. 2003; Greene et al. 2008; however, cf. Heerwegh and Loosveldt 2011). Fourth, nondifferentiation and “straight lining” answering behaviors were found to differ between interviewer modes and Web surveys (Holbrook et al. 2003; Fricker et al. 2005; Heerwegh and Loosveldt 2008; Greene et al. 2008; Chang and Krosnick 2009). These empirical results suggest that the extent of systematic error might also be structured as posed by Hypotheses 1 and 2, that is, primarily be present between interviewer- and self-administered modes (Hypothesis 3) as sources of a difference in the extent of systematic bias (Hypothesis 3a) and variance (Hypothesis 3b).

On the empirical side, systematic variance differences between modes were found in the meta-analysis of Saris and Gallhofer (2007). Interviewer-administered surveys appear to create higher systematic variance than self-administered surveys giving empirical support to Hypothesis 3b. Heerwegh and Loosveldt (2011) report on a systematic bias between a Telephone and a Paper survey, which is interpreted as social desirability effect, consistent with Hypothesis 3a.

Data: The Dutch Crime Victimization Survey (CVS) Mode Experiment

Data are available from a mode experiment in the Netherlands conducted from April to June 2011 by Statistics Netherland. The topic and large parts of the questionnaire were adopted from the national CVS, an existing cross-sectional survey conducted on a yearly basis by Statistics Netherland. The experiment was administered independently from the regular CVS at a different time and with a different sample. A simple random sample of 8,800 persons was drawn from the national address register and each person was randomly assigned to one of the four modes; 8,524 persons were eligible: 2,081 in F2F; 2,062 in Telephone; 2,182 in Paper; and 2,199 in Web. All persons received mailed prenotifications and multiple reminders, where self-administered modes additionally contained either a link to a web survey or a paper questionnaire with a return envelope. In the interviewer modes, contact was attempted by telephone or in person; 4,048 respondents participated. American Association for Public Opinion Research Response Rates 1 were F2F 64.3 percent (1,338), Telephone² 67.4 percent (993), Paper 49.8 percent (1,086), and Web 28.7 percent (631).

Statistical analyses were conducted on three unidimensional scales. Questions, item wording, and answer categories are shown in Table 1. Two scales were based on indicators that are regularly included in the CVS (neighborhood traffic pressure [NTP], and police visibility [PV], both four indicators). These were explored and cross-validated on a different data set, the Web version of the regular CVS from 2010. The third scale (duty to obey the police [DTO], three indicators) was validated in the pretest of the fifth round of the European Social Survey (in F2F). It is normally not included in the CVS.

Table 1.

Overview on Indicators and Scales With “Don’t know” (+Indicator Refusal) Rates (in Percentage).

	F2F	Telephone	Paper	Web
Neighborhood traffic pressure (NTP), early position^a
1. Aggressive behavior in traffic	0.4	0.6	8.3	5.9
2. Traffic noise nuisance	0.0	0.1	5.3	1.9
3. Speeding in traffic	0.4	0.5	4.6	1.7
4. Parking problems	0.3	0.2	5.3	2.7
Police visibility (PV), middle position^b
1. The police offer protection to people in this neighborhood	4.6	3.1	23.2	16.8
2. The police have contact with people from this neighborhood	9.0	6.8	27.2	26.0
3. The police react to problems in this neighborhood	10.2	7.3	30.5	24.7
4. The police do their best in this neighborhood	9.3	5.8	30.5	24.9
Duty to obey the police (DTO), late position^c
1. Support the decisions of the police, also if I disagree	1.8	3.6	2.7	0.0
2. Do what the police say, also if I disagree	1.4	3.6	3.0	0.0
3. Do what the police say, also if I am treated unpleasantly	1.7	4.4	2.9	0.0
Sample size (n)	1,338	993	1,086	631

Note. Scale labels: ^aquestion: how often does the following happen in your neighborhood? Answer categories: happens almost never or never (1); happens sometimes (2); and happens frequently and (3), Don’t know. ^b Completely disagree (1), disagree (2), neutral (3), agree (4), and completely agree (5), Don’t know. ^c Fully not my duty (1), (2), (3), (4); and fully my duty (5).

Ordinal rating scales contained either three (NTP) or five answer categories (PV, DTO). In the interviewer modes, answer categories including “DK” options were read out once at the outset of a set of questions and repeated upon request. No show cards were used in the F2F mode. In both self-administered modes, indicators were presented in grids with labeled scales. A well-known problem in unified mode designs is the presentation of “DK” categories (Dillman, Smyth, et al. 2009:327). If offered visually with each question, DK categories are more prominent in Web or Paper questionnaires than in Telephone or F2F. This can affect the “visual” scale midpoint (Tourangeau, Couper, and Conrad 2004). Also typically, this leads to more frequent use of DK categories in self-administered modes (de Leeuw, Hox, and Scherpenzeel 2011). But omitting DK fully might provoke false or random answers in Web and Paper. Differential presence of DK answers thus is an alternative explanation to the measurement effects we seek to identify. To control for the impact of DK, the treatment of DK categories was varied across the three scales. In the standard CVS, DK categories are explicitly offered in the two self-administered modes. PV and NTP thus had explicit DK in Web and Paper (Table 1). NTP was selected because it was known from earlier rounds of CVS that DK could be expected at low rates in all modes, despite the visual presence of a DK option in Web and Paper. In the PV scale, higher DK response in Web and Paper was expected, which is the more common case when offering explicit DK. As a contrast, no DK option was offered for the third scale in Web and Paper (DTO), which is normally not done in the CVS. In Paper, only omission could thus lead to item nonresponse, which consequently was low. The online routine required all indicators to be answered. Therefore, DK was fully absent in Web on this scale.

Statistical Methodology and Assumptions

The Ordinal MCFA Model

Ordinal MCFA assumes that the observed random vector of ordinal response variables Y has a latent response variable vector Y* linked to it. Assume as conditional distribution of Y*:

P (Y^{*} | T, M) ~ M V N (E (Y^{*} | T, M), V a r (Y^{*} | T, M)),

linked to the observed ordered categorical indicators Y by means of indicator-specific threshold parameters v^m . Suppose there are c = 1,…, C categories on each Y, then C − 1 threshold parameters are defined. The bivariate normal density is estimated by polychoric correlations that form the basis for further mean and covariance structure analysis:

P (Y_{j} = c, Y_{j^{'}} = c^{'} | T, M) = \int_{v_{c}^{m}}^{v_{c + 1}^{m}} \int_{v_{c^{'}}^{m}}^{v_{c^{'} + 1}^{m}} N (Y_{j}^{*}, Y_{j^{'}}^{*} | T, M) d Y_{j}^{*} d Y_{j^{'}}^{*} .

The conditional expectations and variances of the latent response variables are then linked to a congeneric factor model (Lord and Novick 1968; Jöreskog 1971):

E (Y^{*} | T, M) = τ^{M} + Λ^{M} T,

C o v (Y^{*} | T, M) = Θ^{M},

with T an R × 1 vector of common factor scores for latent variables r = 1,…, R, Y* a J × 1 vector of J latent response variables (indicators), $Λ^{M}$ a J × R matrix of $λ_{j | M}^{}$ parameters called factor loadings, $τ^{M}$ a J × 1 vector of intercept parameters, and $Θ^{M}$ a diagonal J × J matrix of error variance parameters. M identifies separate parameter sets over modes. The unconditional expectations and covariances of M are given by:

E (Y^{*} | M) = τ^{M} + Λ^{M} κ,

C o v (Y^{*} | M) = Λ^{M} Φ Λ^{M}' + Θ^{M} .

κ is the vector of population means and Φ population variance of T. Note that, since we seek to measure the same true score distribution in all modes, population means and variances do not depend on mode.

Thresholds, intercepts, and loadings set the scale of each question Y in a given mode. Random errors are explicitly included in the model and can be tested for equivalence.³ In the absence of measurement effects on scale or random error, the parameters are equivalent, which is testable on empirical data. Random error variance is essential for the estimation of indicator reliability (e.g., Alwin 2007), which is expressed as:

ρ_{j | M} = \frac{λ_{j | M}^{2} Φ}{λ_{j | M}^{2} Φ + θ_{j | M}},

with $θ_{j | M}$ error variance of indicator j. Obviously, random error equivalence is only identical to reliability equivalence in the presence of loading equivalence.

Thresholds and intercepts are not simultaneously identifiable, which is why estimation has to focus on one of the two (Millsap 2011:128-31). We focus on thresholds in the following by constraining intercepts to zero. Models are estimated with weighted least squares mean and variance adjusted (WLSMV; on further estimation details, see Muthén 1984; Muthén, du Toit, and Spisic 1997; Asparouhov 2005; Millsap 2011:131-36). For detailed identification restrictions, see Millsap and Yun-Tein (2004), Millsap (2011:138), and the Data Analysis and Results section. Measurement equivalence of scale and random error can be assessed by constraining parameters equal across modes, as detailed in the Data Analysis and Results section.

Systematic Errors in the Ordinal MCFA Model

In contrast to random error, systematic errors are common sources of variance and bias of a particular method of measurement, like a mode, affecting measurements in equivalent ways. One way to include systematic errors in measurement models is by an additive, mode-dependent random variable S. Following approaches by Saris and Andrews (1991) and Scherpenzeel and Saris (1997), we assume that these are observed as a compound with T, say for person i: $T_{i | M}^{*} = T_{i} + S_{i | M}$ (see also Alwin 2007:41-42). Latent response indicator j can then be described as ( $ϵ_{i j | M}$ be an error term):

y_{i j | M}^{*} = λ_{j | M} (T_{i} + S_{i | M}) + ϵ_{i j | M} .

Define mode-specific systematic bias and variance as $E (S | M)$ and $V a r (S | M)$ . Statistically, the presence of systematic bias and variance changes the response scale altering both the threshold (intercept) and the loading structure by monotonous shifts. Therefore, they represent serious threats to equivalence of measurement. Assume $C o v (T, S) = 0$ , then:

E (Y_{j}^{*} | M) = λ_{j | M} (κ + E (S | M)),

V a r (Y_{j}^{*} | M) = λ_{j | M}^{2} Φ + λ_{j | M}^{2} V a r (S | M) + θ_{j | M}^{} .

Equivalently, we may introduce a mode-specific systematic intercept with $λ_{j | M} E (S | M) = τ_{j | M}^{s}$ :

E (Y_{j}^{*} | M) = τ_{j | M}^{s} + λ_{j | M} κ .

This reflects that systematic errors can be interpreted as constant shifts of intercepts, weighted by factor loadings. Since in our model intercepts are zero-constrained for identification, see above, a presence of $τ_{j | M}^{s}$ causes a weighted shift of all thresholds $v_{j | M}^{}$ simultaneously. Similarly, systematic variance inflates the scale metric (loadings) of all indicators by a constant. Let parameter $γ_{M} = V a r (S | M) Φ^{- 1}$ scale S to the unit of true score variance, then for all j:

V a r (Y_{j}^{*} | M) = λ_{j | M}^{2} Φ + λ_{j | M}^{2} γ_{M} Φ + θ_{j | M}^{} = (1 + γ_{M}) λ_{j | M}^{2} Φ + θ_{j | M}^{} .

Another interpretation for the term $λ_{j | M}^{2} γ_{M} Φ$ is an increase in indicator error covariance. Also note that presence of systematic variance necessarily biases estimates of reliability upward due to overestimation of true score variance (cf. equation 7, also Alwin 2007:42).

In single-group situations systematic errors are not identified but estimated as true score compound $T_{M}^{*}$ . However, relative differences can be estimated in multigroup situations using factorial designs. Introduce $κ_{M}^{*}$ and $Φ_{M}^{*}$ as expectation and variance of $T_{M}^{*}$ . Assuming loading equivalence across modes (testable assumption), it must hold for any two modes due to randomization that:

κ_{M = 1}^{*} = κ_{M = 2}^{*} \Leftrightarrow E (S | M = 1) = E (S | M = 2),

Φ_{M = 1}^{*} = Φ_{M = 2}^{*} \Leftrightarrow V a r (S | M = 1) = V a r (S | M = 2) .

The mean and variance difference of the estimated compound factor gives an indication for relative differences in systematic errors, because true score distributions should be balanced by means of randomization. This logic forms the basis of a test of equivalence of systematic errors in the presence of loadings equivalence. Noteworthy, we will consider relative difference in systematic error, which is relevant to conclude about equivalence. We cannot conclude about absence of systematic bias or variance using this approach.

The term S is used to approximate⁴ the systematic effects of answering behaviors on means and variances of sets of response variables (Billiet and McClendon 2000; Welkenhuysen-Gybels, Billiet, and Cambré 2003; Billiet and Davidov 2008; Morren, Gelissen, and Vermunt 2011; Heerwegh and Loosveldt 2011). For example, if persons, depending on mode, vary in their propensity to agree to sets of indicators (acquiescence), S has variance and nonzero means in the direction of agreeing, introducing systematic bias and variance. A similar argument can be made for social desirable response behavior. If persons in the population vary in their tendency to provide desirable answers across all indicators in a given model, this introduces systematic variance and a bias in the direction of desirable responses. Moreover, if persons provide extreme answers on all indicators, loadings are scaled upward by a constant, because any true score leads to higher (or lower) responses. As shown above, the presence of S with a variance is equivalent to a shift in loadings. Also, behaviors like nondifferentiation and straight lining have been discussed to cause “correlated errors” of indicators (Gerbing and Anderson 1984; Green and Citrin 1994). As mentioned above, correlated errors in MCFA models are statistically equivalent to the presence of a systematic error term. From this illustration, it is apparent that the mode-dependent term S denotes a “net effect” of the many reasons for systematic error differences across modes but avoids specifying a particular type of behavior as an error source.

S has also been referred to as invalidity effect of the method, invalidating unbiased measurement of the concept of interest (Saris and Andrews 1991; Scherpenzeel and Saris 1997; Mellenbergh 1999). We conceptualize S as compound with T in this tradition. Then the effect of S is mediated by factor loadings (cf. equations 8 –10). An alternative conceptualization is to assume a direct effect of S on Y modeling S as a second factor with unit constrained loadings (e.g., Billiet and McClendon 2000; Bollen and Paxton, 2008; Welkenhuysen-Gybels et al. 2003). We will return to this alternative option in the discussion section.

Selection Effects in Mode Experiments

The above considerations were made for a fully randomized experiment. However, full randomization of persons to modes is seldom possible, especially if samples from the general population are concerned, because modes involve differential sampling frame coverage and evoke differential self-selection (e.g., Groves et al. 2010:162-68). Selection effects are an alternative explanation for measurement effects (Jäckle et al. 2010; Vannieuwenhuyze and Loosveldt 2013), comparable to counterfactual situations in quasiexperiments (Morgan and Winship 2007). Two counterfactual situations are possible. First, a selection variable X (i.e., a variable, which causes selection into mode conditions) might be also related to the true score of interest. Second, there might be measurement nonequivalence across classes of X. For these reasons, it is necessary to adjust for selection, for example, by conditioning on X. One way to do so is weighting adjustment by the inverse of propensity scores (Rosenbaum and Rubin 1983; Rosenbaum 1987; Kaplan 1999; Morgan and Winship 2007; Guo and Fraser 2010). For more than two modes, it is advisable to weight to a reference population.⁵ Define a propensity score model as: $P (R = 1 | X, M) = F (X β + M * X γ)$ . Where R is the response indicator, M*X indicates all interactions of mode indicators, and X under useful identification constraints and $F (\cdot)$ is the cumulative normal (probit) or logit link function. From this model, propensity scores $\hat{e} (X, M)$ are estimated as the basis of weights $\hat{w} = \hat{e} (X, M) -^{1}$ . This model calibrates the mode-specific response distributions to the population assuming availability of auxiliary variables X on the sample level. Weighting adjustment has been integrated in WLSMV estimation of ordinal MCFA (Asparouhov, 2005).

Data Analysis and Results

Testing Procedure

The testing procedure for mode differences on scales, random errors and systematic errors followed a series of steps, graphically illustrated in Appendix A (“The online [appendices/data supplements/etc.] are available at http://smr.sagepub.com/supplemental”). First, so-called configural equivalence models were fit (model 1), which specify only the factor structure with free parameters under minimal identification constraints⁶—hereafter “MIC” (Millsap and Yun-Tein 2004; Millsap 2011:138). If configural equivalence held, all loadings and thresholds were constrained simultaneously to test scale equivalence across all indicators ( $Λ^{M} = Λ$ , $v_{}^{M} = v$ , model 2). This is similar to a constrained backward strategy (Muthén and Asparouhov 2002; Stark, Chernyshenko, Drasgow 2006; Yoon and Millsap 2007; Kim and Yoon 2011). An advantage of this strategy is that specifying a particular type of MIC is avoided at first, which is useful, because “wrong” choices (i.e., constraining an unequal loading or threshold equal for identification) can bias parameter estimation. Deteriorating model fit against the configural equivalence model suggests scale nonequivalence of at least one indicator. If fit deteriorated, the location of misfit was determined on either loadings or thresholds. First, all loadings were freed while thresholds were held fixed (model 2a). This required unit constraining a particular reference loading. Which indicator to choose, is not trivial, however (Yoon and Millsap 2007; French and Finch 2008). We compared model fits of all possible anchor indicators and chose the model that maximized fit.

An improvement in fit of model 2a against the constrained scale equivalence model would suggest that nonequivalence of loadings causes (part of) the misfit. If loading equivalence holds, but scale equivalence not, nonequivalence is perhaps located on thresholds. This was then tested by freeing thresholds while holding loadings fixed (model 2b). As MIC, always one threshold per indicator needed to be constrained equal, plus a second threshold for the anchor indicator, where again we chose the MIC which maximized fit.

If nonequivalence was found on the loadings or the thresholds, it was assessed, if the expected structure of measurement effects according to the Hypotheses 1a/1b and 2 held. If there was nonequivalence on thresholds, then this implied testing (model 2b-1):

v_{}^{F 2 F} = v_{}^{T e l} \land v_{}^{P a p e r} = v_{}^{W e b},

against the model which kept all thresholds free (model 2b). If model fit did not deteriorate, the structure predicted by Hypotheses 1a/1b and 2 held. Finally, it was assessed which indicators caused the measurement effect by inspecting parameter estimates of thresholds (or loadings).

Next, random errors were constrained across modes ( $Θ^{M} = Θ$ , model 3). Error variances were always tested in the scale equivalence model and, if the scale equivalence model did not hold, additionally in a parsimonious model with good fit, because it is uncertain how sensitive error variance tests are to misspecified baseline models. If full equivalence of random errors was rejected, assessment of Hypotheses 1 and 2 was conducted as explained for nonequal thresholds.

To test systematic bias and variance equivalence, we used equations (13 and 14) suggesting that due to randomization compound factor means and variances are only equal across modes, if systematic errors are equal (equations 13 and 14). If means $κ_{M}^{*}$ were equality constrained, any group difference in systematic bias would have caused a differential shift in thresholds (equation 11). If thresholds were also fixed, however, as in the scale equivalence model, a decrease in model fit indicated the presence of a difference in systematic bias (model 4a). When the compound factor variance $Φ_{M}^{*}$ was equality constrained, systematic variance differences manifested in unconstrained loadings or error covariances by a scaling factor (equation 12). Model fit deteriorated, if this was not possible, when loadings were constrained, as in the scale equivalence model (model 5; error covariances are fixed at zero in all analyses).

Model Fit Evaluation, Oort Adjustment, and Cross-Validation

WLSMV estimation was conducted by the software Mplus 6.1. “DK” answers or refusals are treated as missing completely at random in WLSMV estimation. Change in fit was assessed by adjusted χ² difference tests (using the ‘difftest’ option in Mplus; cf. Asparouhov and Muthén 2006) and the global fit index Root Mean Square Error Of Approximation (RMSEA). A significant χ² test denotes a significant change in fit. Fit indices like RMSEA are still under testing for ordinal confirmatory factor analysis (CFA) models (Millsap 2011:136), but in continuous CFA, RMSEA < .05 is considered a good fit. Furthermore, a simulation by Chen (2007) showed for continuous MCFA that a change larger than .01 in RMSEA indicates meaningful change in fit. We took RMSEA as secondary guideline but primarily relied on the exact χ² test.

Two additional measures were taken to assure robustness of our results. Testing a less-constrained model against a constrained baseline model with bad fit has been shown to cause inflated type 1 error detection rates of nonequivalence (Kim and Yoon 2011). Such testing, however, is necessarily done, for example, in tests of loading or threshold nonequivalence when scale equivalence is rejected. The authors showed that the so-called Oort (1998) correction normalizes false positive rates in categorical MCFA with WLSMV. Oort’s correction is applied to all affected χ² tests.⁷

Additionally, all models were cross-validated using half of the sample for testing and the second half of the sample for retesting. Reported model fit statistics are based on the full sample, but results of difference tests are reported only if cross-validation suggested equal conclusions in both half splits.

Weighting Adjustment

Eight sociodemographic variables were available from the national registers: gender, age (six categories: 15–24, 25–34, 35–44, 45–54, 55–64, and 65 or higher), income (seven categories: no income, up to €30 k, 30–45 k, 45–60 k, 60–100 k, 100 k, and above, missing), civil status (four categories: married or partnership, single, divorced or widowed, and missing), nationality (three categories: Dutch, Western foreigner, and non-Western foreigner), household size (three categories: one person, two, and three or more), urbanity (four categories: strong, moderate, little, and none), and living in one of the three biggest national cities (four categories: Amsterdam, Rotterdam, Utrecht, and other). Response propensities were estimated from a probit model including all categorical predictors and their interactions with three mode condition indicators. The reference set were all eligible units including units without telephone access (also in the Telephone condition). The maximum normalized weight was 1.846, which is not extreme.

NTP and PV Scales

Table 2 provides test sequences for the NTP and PV scales. Consider first results for the NTP scale. Very low RMSEA and insignificant model χ² indicated very good fit of the configural equivalence model (model 1). The fit of the full scale equivalence model strongly deteriorated, however, suggesting that scale parameters were nonequivalent (model 2). Freeing loadings across conditions did not result in a significant increase in fit (model 2a) indicating that the major source of misfit was located on the thresholds. Consequently, we freed all thresholds, while holding loadings equal (model 2b). Compared to the scale equivalence model, a highly significant χ² test and negative RMSEA difference indicated a strongly improved fit. Now we tested Hypotheses 1 and 2 simultaneously by imposing “ $v_{}^{F 2 F} = v_{}^{T e l} \land v_{}^{P a p e r} = v_{}^{W e b}$ ” on the thresholds (model 2b-1). This model did not fit significantly worse than the full nonequivalence model 2b in terms of RMSEA and χ² value, which suggests that most nonequivalence lies between interviewer- and self-administered modes (Hypothesis 2). From these results, we can conclude, consistent with Hypotheses 1 and 2, that the nonequivalence of scale was located between interviewer- and self-administered modes only, and more particular, on nonequivalent positions of thresholds.

Table 2.

Equivalence Test Sequences for the NTP and PV Scales.

			Neighborhood Traffic Pressure (NTP) Scale					Police Visibility (PV) Scale
Model	Equivalence Test	Tested Against	RMSEA	RMSEA Diff.	Model χ²	Adj. χ² Diff. Test	Model df (Diff.)	RMSEA	RMSEA Diff.	Model χ²	Adj. χ² Diff. Test	Model df (Diff.)
1	Configural	—	.014	—	9.6 (ns)	—	8	.048	—	25.4^c	—	8
Scale equivalence
2	Fixed scale	1	.074	+.060^a	173.7***	149.8***	26 (18)	.052	+.004	180.2***	156.6***	50 (42)
2a	Free loadings	2	.085	+.011^a	141.5***	52.8 (ns)^b	17 (9)	.058	+.006	171.2***	13.0 (ns)	41 (9)
2b	Free thresholds	2	.045	−.029^a	50.9***	143.9***	17 (9)	.026	−.026^a	28.2*	151.8***	17 (33)
2b-1	Web = Paper ≠ F2F = Tel	2b	.040	−.005	60.8***	9.9 (ns)	23 (6)	.033	+.007	79.4***	51.7^c	39 (22)
Random error equivalence
3a	Fixed error variances	2	.080	+.006	281.0***	121.2**^b	38 (12)	.059	+.007	270.5***	108.4***	65 (15)
3a-1	Web = Paper ≠ F2F = Tel	2	.065	−.009	189.8***	5.8 (ns)	34 (8)	.052	−/+0	206.6***	29.5 (ns)^b	58 (8)
3b	Fixed error variances	2b-1	.056	+.016^a	145.5***	94.6***	35 (12)	.037	+.004	117.4***	42.3*^b	51 (12)
3b-1	Web = Paper ≠ F2F = Tel	2b-1	.034	−.006	67.7***	6.5 (ns)	31 (8)	.036	+.003	106.2***	31.2 (ns)^b	47 (8)
Systematic error equivalence
4a	Fixed factor means	2	.092	+.018^a	276.7***	68.5*^b	29 (3)	.072	+.020^a	316.8***	60.4**^b	53 (3)
4a-1	Web = Paper ≠ F2F = Tel	2	.070	−.004	166.5***	7.0 (ns)^b	28 (2)	.043	−.009	141.7***	3.0 (ns)	52 (2)
4b	Fixed factor means	2b-1	.047	+.007	86.4***	21.3*^b	27 (4)	.043	+.010	115.3***	19.9**^b	42 (3)
4b-1	Web = Paper ≠ F2F = Tel	2b-1	.041	+.001	69.2***	9.4 (ns)^b	26 (3)	.027	−.006	68.5**	3.2 (ns)	41 (2)
5	Fixed factor variance	3b-1	.040	+.006	89.8***	16.9 (ns)^b	34 (3)	.029	−.007	89.7***	4.6 (ns)	50 (3)

Note. NTP Scale: N = 4,021 (27 cases excluded with Don’t know/refusal on all indicators).

PV Scale: N = 3,799 (249 cases excluded with Don’t know/refusal on all indicators).

“≠” denotes free parameters for two modes and “=” denotes fixed parameters for two modes.

^aMeaningful change of Root Mean Square Error Of Approximation (RMSEA) criterion (i.e., >.01). ^bOort adjustment of critical value resulted in a lower significant level or insignificant test. In all other cases, adjustment did not change the level of significance (Oort 1998; Kim and Yoon 2011). ^cEffect/significance did not hold to cross-validation in both split half samples.

*p < .05. **p < .01. ***p < .001.

Next, measurement effects on random error were considered suggesting fixing all error variance matrices across modes. Besides the scale equivalence model 2, which had bad fit, error equivalence was also assessed in the threshold nonequivalence model 2b-1, which had improved fit (yielding models 3a and 3b). Results were robust with regard to the type of base model. Both suggested decrease in fit and thus unequal error variances. Subsequently, we imposed the structure implied by Hypotheses 1 and 2 simultaneously, leaving separate error matrices only for the self- and interviewer-administered conditions (models 3a-1 and 3b-1). Now model fit did not deteriorate at all, supporting Hypotheses 1 and 2 also for error variances (model parsimony even lead to lower RMSEA).

The findings so far suggest measurement effects on scale and random errors between interviewer- and self-administered modes. It was therefore relevant to assess which indicators were affected by measurement effects. Consider the parameter estimates of free thresholds and random errors in model 3b-1 shown in Table 3. Surprisingly, threshold estimates for Web/Paper are consistently lower than F2F/Telephone for all indicators (except the thresholds separating categories “never” and “sometimes,” which are fixed as MIC). Furthermore, consider relative sizes of the item-specific error variances. Here we find again a difference on all indicators, where Web/Paper showed consistently less random error (F2F/Telephone fixed at one as MIC).

Table 3.

Threshold and Error Variance Estimates for the Neighborhood Traffic Pressure Scale (From model 3b-1) With Bootstrapped Standard Errors (10,000 Draws).

	Free Threshold (1) (Never/Sometimes)		Fixed Threshold (2) (Sometimes/Frequently)		Random Error Variance
	F2F/Tel	Paper/Web	F2F/Tel	Paper/Web	F2F/Tel	Paper/Web
Indicator 1	0.157 (.050)	0.032 (.061)	1.672 (.078)	1.672 (.078)	1	0.515 (.076)
Indicator 2	0.739 (.049)	0.190 (.049)	1.475 (.066)	1.475 (.066)	1	0.729 (.082)
Indicator 3	−0.724 (.058)	−0.724 (.058)	0.679 (.055)	0.679 (.055)	1	0.330 (.057)
Indicator 4	0.165 (.029)	−0.103 (.032)	0.689 (.032)	0.689 (.032)	1	0.703 (.091)

Finally, we tested the equivalence of systematic bias and variance. Constraining factor means across modes in the scale equivalence model strongly deteriorated fit providing evidence for different extents of systematic bias (model 4). Subsequently, we tested, in line with Hypothesis 3a, change of fit when only constraining Web to Paper and F2F to Telephone, respectively (model 4-1). Doing so did not cause deterioration of fit, suggesting that systematic bias was equal for these modes, but differed between interviewer- and self-administered modes (support of Hypothesis 3a). As we found threshold nonequivalence, we additionally tested the equivalence of factor means in models with free thresholds (yielding models 4b and 4b-1) leading to the same result.

To assess the equivalence of systematic error variance, factor variances were constrained equal based on model 3b-1, which is a parsimonious model with the best fit in terms of RMSEA (model 5). Model fit slightly deteriorated compared to 3b-1 (RMSEA + .006) and the χ² difference test was insignificant, although it was still close to significance (Oort adjusted critical value: 17.2). In the cross-validation, one half-split sample was very far from significance, however. This is too little evidence to conclude on the presence of differential systematic variance, rejecting Hypothesis 3b. We also assessed factor variance equivalence in the scale equivalence model 2, leading to the same conclusion (not shown).

Subsequently, we assessed whether it was possible to reproduce the results of the NTP scale on the PV scale. This was possible, without exception, despite the fact that this scale used a different number of answer categories (five instead of three), had a later position in the questionnaire and that there were more “DK” answers in the Web and the Paper conditions. Scale equivalence across all modes was rejected (model 2), while again loading equivalence was not the source of nonequivalence (model 2a). Freeing all thresholds improved model fit strongly (model 2b), where again differences were located between interviewer- and self-administered modes only (model 2b-1). Again threshold differences were present on all indicators. The upper part of Figure 1 displays four ordinal thresholds for each indicator (based on model 2b-1). For the exact wording of the four indicators, we refer again to Table 1. In particular, the second threshold separating “disagree” from “neutral” was always lower in Web/Paper. Also random error was consistently smaller in Web/Paper on all except the third indicator, where it was equal (not shown).

Figure 1.

Threshold estimates with bootstrapped 95 percent confidence interval (10,000 draws) for the PV scale (upper part, based on model 3b-1) and with additionally zero-constrained factor means (lower part) illustrating mediated impact of systematic bias (black: F2F/Telephone; grey: Web/Paper).

There was again a difference in systematic bias but not in systematic variance (models 4 and 5). We illustrate the impact of the difference in systematic bias on threshold estimates in the lower part of Figure 1. These are estimates based on model 2b-1 with additionally constrained factor means, so that systematic bias is mediated to the thresholds. One can notice an upward shift of all thresholds (cf. equations 9 and 11; compare to upper part of Figure 1). While there was only nonequivalence on one of the thresholds before, systematic bias now causes systematic nonequivalence on all thresholds (except those constrained as MIC). Note that the strength of the impact of systematic bias somewhat varied across indicators. This was due to the fact that systematic bias was mediated by loadings, reflecting its impact depended on the strength of association between indicator and latent trait.

DTO Scale

The DTO scale differed from the prior two in that no explicit DK categories were offered in Paper and Web. The testing procedure still reproduced nearly all findings from NTP and PV (Table 4; note that the configural equivalence model is just identified and therefore not shown). There was still a difference in systematic bias. Also error variances were consistently smaller for all indicators in Web/Paper, likewise the NTP and PV questions. However, two differences emerged. First, we found no measurement effects on the scales of any of the indicators (no threshold differences) as suggested by very low RMSEA of model 2, rejecting Hypothesis 2. Second, we found a significant difference in factor variance of the Web condition, while all other variances were equal (model 5-2). This was remarkable, considering differential systematic variance was not found for the NTP and PV scales. In the discussion, we speculate that this is related to the omission of DK in Web.

Table 4.

Equivalence Test Sequence for the Duty to Obey the Police Scale.

Model	Equivalence Test	Tested Against	RMSEA	RMSEA Diff.	Model χ²	Adj. χ² Diff. Test	Model df (Diff.)
Scale equivalence
2	Fixed scale	—	.028	—	53.6^c	—	30
Random error equivalence
3a	Fixed Error Variance	2	.062	+.034^a	186.3***	157.1***	39 (9)
3a-1	Web = Paper ≠ F2F = Tel.	2	.032	+.004	71.7***	20.6^c	36 (6)
Systematic error equivalence
4a	Fixed Factor Means	2	.039	+.011^a	82.8***	16.7*^b	33 (3)
4a-1	Web = Paper ≠ F2F = Tel.	2	.019	−.009	43.5 (ns)	2.4 (ns)	32 (2)
5	Fixed Factor Variances	3a-1	.044	+.012^a	115.6***	22.8*^b	36 (3)
5-1	Web = Paper ≠ Capi = Cati	3a-1	.046	+.014^a	117.7***	20.1*^b	38 (2)
5-2	Paper = Cati = Capi ≠ Web	3a-1	.030	−.002	72.0***	5.2 (ns)	38 (2)

Note. N = 3,972 (76 cases excluded with Don’t know/refusal on all indicators).

“≠” denotes free parameters for two modes and “=” denotes fixed parameters for two modes.

^aMeaningful change of Root Mean Square Error Of Approximation (RMSEA) criterion (i.e., >.01). ^bOort adjustment of critical value resulted in a lower significant level. In all other cases, adjustment did not change the level of significance (Oort 1998; Kim and Yoon 2011). ^cEffect/significance did not hold to cross-validation in both split half samples.

* p < .05. *** p < .001.

Comparison of Factor Means and Indicator Reliability Across Scales

Two key findings across all scales are different extents of systematic bias and lower random error variances in the self-administered modes. Table 5 compares standardized factor mean estimates and reliability estimates across scales. The lower error variance in the self-administered modes manifests in higher indicator reliabilities of most of the questions in Paper/Web than in F2F/Tel. For the DTO scale, it can be seen that higher compound factor variance results in inflated reliabilities (cf. equation 7).

Table 5.

Mode Differences in Indicator Reliabilities (From Models 5, NTP and PV, and 5-2, DTO) and Standardized Factor Means (From Models 4a-1 and 4b-1).

	NTP Scale		PV Scale		DTO Scale
	F2F/Tel	Web/Paper	F2F/Tel	Web/Paper	F2F/Tel	Paper	Web
Indicator Rel. 1	0.590 (.029)	0.678 (.035)	0.545 (.019)	0.645 (.018)	0.317 (.016)	0.396 (.019)	0.490 (.028)
Indicator Rel. 2	0.444 (.029)	0.472 (.028)	0.432 (.019)	0.532 (.022)	0.767 (.023)	0.870 (.021)	0.908 (.017)
Indicator Rel. 3	0.577 (.030)	0.764 (.034)	0.660 (.018)	0.663 (.021)	0.638 (.020)	0.724 (.019)	0.794 (.020)
Indicator Rel. 4	0.101 (.015)	0.118 (.017)	0.771 (.018)	0.835 (.019)	—	—	—
Factor Means^a (Model 4a-1)	0	.384 (.050)	0	−.323 (.042)	0	−.159 (.041)	−.127 (.033)
Factor Means^b (Model 4b-1)	0	.192 (.054)	0	−.468 (.059)	—	—	—

Note. DTO = duty to obey the police; NTP = neighborhood traffic pressure; PV = police visibility. ^aModel 4a-1: scale equivalence with free factor means only between F2F/Tel and Mail/Web (cf. Table 2). ^bModel 4b-1: scale nonequivalence between F2F/Tel and Mail/Web (free thresholds) with free factor means only between F2F/Tel and Mail/Web (cf. Table 2).

In models 4a-1 and 4b-1, factor variances were additionally equality constrained for Web/Paper to yield equivalent standardized factor means (in the NTP and PV scales only). In the DTO scale, the factor variances were shown to differ strongly between mail and web (model 5-2, Table 4). The standardized means differ due to higher factor variances in Web.

Standardized factor means are shown from models 4a-1 and 4b-1 (NTP and PV only). Means of the interviewer modes were constrained to zero for identification. Negative means of the self-administered modes in the NTP scale indicated the systematic bias difference to the interviewer modes. The negative sign suggests that across all categories, it was relatively easier in Web/Paper to answer that a traffic problem persisted more frequently in the neighborhood. For PV and DTO, a positive mean was found for Web/Paper indicating that it was more difficult to agree to questions about proper PV in the neighborhood and support of police actions (cf. Figure 1). The strength of systematic mean differences varied across scales depending on whether model 4a-1 or 4b-1 is taken as a benchmark. However, the mean difference was smallest for the DTO scale.

Discussion

Survey researchers designing mixed-mode surveys need to know which modes can safely be combined in later analysis. In the present study, ordinal MCFA models were applied to assess measurement effects of modes on the equivalence of scale, random errors, and systematic errors of attitudinal rating scale questions. Consistent with our expectations, we found a divide between interviewer- and self-administered modes and nearly complete parity, when comparing F2F with Telephone and Web with Paper. The chief differences between interviewer- and self-administered modes were represented by threshold biases, systematic biases, and the extent of random error.

It must again be noted that our study was conducted in the context of a large national survey commissioned by Statistics Netherlands. The consistency of our findings across three scales measuring different traits, using different numbers of answer categories and labels, probability sampling from a general population, as well as the statistical power of our analyses, allow some stronger conclusions about measurement effects of modes. We will discuss the implications for survey methodology and statistical modeling of measurement effects separately.

Considerations About Survey Methodology

The first key finding of the present study is that measurement effects were not indicator-specific phenomena of questions “sensitive” to mode but systematically affected all indicators. Foremost, this is caused by unequal extents of systematic bias (Hypothesis 3a confirmed), which by definition affects all thresholds of indicators causing systematic scale nonequivalence. For the NTP and PV scales, we could also show that there are individual fluctuations per indicator on thresholds that might be due to question content (Hypotheses 1 and 2 confirmed). On the PV scale, only the second threshold of each indicator was affected. This suggests that the indicator-specific threshold bias might rather reflect a systematic threshold-specific bias that cannot be absorbed by the factor means. In this case, the indicator-specific effects would not relate to content but are a second symptom of a systematic category bias.

These results suggest in practice that the same respondent answers to the same questions differently, when asked in an interviewer- or a self-administered mode. The effect was identified across different question topics, formats, and position in the questionnaire. It therefore appears unlikely that these design elements can be altered to mitigate the difference in systematic bias (factor means). Rather the effects are probably caused by mode-specific factors that are impossible to balance by questionnaire design (cf. Expectations About Measurement Effects of Modes section). In survey designs that need to combine more than one survey mode in data collection and analysis, our findings suggest that caution is required when combining data from interviewer- and self-administered modes, especially if considerable amounts of attitudinal rating scale questions are to be included. Furthermore, consistent with theoretical expectations, we did not identify any difference in systematic errors nor the item-specific scale parameters between F2F and telephone on one hand and Paper and Web on the other (with exception for Web on the DTO scale, addressed below). Therefore, the viable mode combinations in mixed-mode surveys appear to be either the interviewer- or the self-administered modes. If surveys focus on factual questions, however, implications of the current study generally do not apply, because our study focused on attitudinal questions only. Further research therefore needs to examine measurement effects on factual questions.

The cause of the identified difference in systematic biases between interviewer- and self-administered modes might have been stronger social desirable responding in the interviewer modes, if one interprets less frequent reporting of traffic problems, better evaluation of PV, and stronger duty to obey police actions all as desirable answers.⁸ Reporting neighborhood traffic problems appears to us a topic with least or no social sensitivity, however. Thus, it is surprising to identify a difference in systematic bias of equal strength as for evaluations of PV (Table 5), which arguably has higher sensitivity. We therefore argue that further unknown sources of systematic bias might be present that generalize the problem to nonsensitive rating scale questions. Assessing systematic bias across further scales with a priori low sensitivity is an important aspect for further research. Thereby it is an advantage of the present statistical method that the conclusions drawn above apply regardless of our knowledge of the true cause of difference in systematic biases.

The second key finding of the present study was lower extent of random error and consequently higher reliability of most indicators in the self-administered modes (Hypotheses 1 and 2 confirmed for random error). This finding matches earlier empirical literature on reliability (cf. Expectations About Measurement Effects of Modes section). Noteworthy, the higher reliability is not caused by loading nonequivalence or higher systematic error variance. Only in the Web condition of the DTO scale did the increased factor variance further increase reliability estimates. The less pressured situation during self-administration, own pace, time for thought, and the possibility to reread questions multiple times appear to reduce random impact on measurements. Researchers studying relationships between attitudinal questions can expect less attenuated estimates from Web/Paper questionnaires. Also, self-administered modes can prove more efficient in the estimation of descriptive statistics.

A further result of this study is that, contrary to systematic bias, no differences in systematic variance were found for the NTP and PV scales, and the F2F, Telephone, and Paper modes of the DTO scale (Hypothesis 3b rejected). If prior findings on systematic answer behavior differences, like acquiescence, extremeness, and nondifferentiation, consistently hold across rating scale question, these do not have the expected effect on systematic variance. Possibly, they balance each other in unknown ways or by other factors. Another explanation is the applied propensity score weighting adjustment. Some prior studies did not use this type of adjustment when studying answering behaviors. Also many of the MTMM studies summarized in Saris and Gallhofer (2007) perhaps did not apply nonresponse adjustment. In our unweighted analyses (not shown), there were more pronounced variance differences between interviewer- and self-administered modes, which, however, fully disappeared after weighting and after taking the further robustness measures. Systematic variance difference might hence be a selection effect. In general, this is good news, since presence of differential systematic variance would have signified, another source of systematic scale nonequivalence.

Results on the third scale, DTO, differed from the previous two, but this scale also entailed a major difference in design: Web/Paper did not present explicit “DK” categories. First, we found increased systematic variance in the Web but not in the Paper condition. This might be related to the fact that all questions now had to be answered in Web due to “forced-choice” administration, while in Paper questions could still be skipped. Respondents might have shown straight lining or nondifferentiation behaviors instead of answering DK in Web inflating systematic error variance. An alternative explanation for this finding is, however, the scale’s position, which was close to the end of the questionnaire, where Web respondents might have shown stronger effects of response burden. Increased systematic variance of Web now caused nonequivalence of scale of all indicators between Web and Paper, which is certainly problematic in face of their strict equivalence in the two other scales.

Second, we did not identify an indicator-specific threshold bias on the DTO scale (scale equivalence). Furthermore, the systematic bias difference on the DTO scale was smaller than on the other scales, though not absent. Even though there are alternative explanations for both findings (e.g., topic, category labeling, and position in questionnaire), omission of DK might thus reduce the systematic bias problem to some extent. Conceptually, this finding can be related to the “visual scale midpoint,” which is biased by DK (Tourangeau et al. 2004). This gives a hint that a part of the systematic bias difference might be caused by the visual presentation of scales and presentation of DK.

In practice, omission of DK might therefore be helpful in reducing the systematic bias problem, while it apparently cannot fully solve it. “Forced-choice” administration in Web probably was counterproductive, but the problem could be solved by allowing respondents to skip questions in Web questionnaires without DK option. Assessing the impact of DK on equivalence of systematic bias and variance vis-à-vis alternative explanations that might yield better unified mode designs is an important path for further research.

Considerations About Statistical Methodology

One decisive advantage of MCFA models over marginal analyses of single questions is the possibility to make inferences about the relative sizes of systematic bias and variance. In the “compound parameterization” applied in this study, we assumed, consistent with approaches formulated, for example, by Saris and Andrews (1991), Scherpenzeel and Saris (1997), and Alwin (2007:42), that methods, such as modes, invalidate the true score that is estimated, T*, from the actual true concept of interest, T. An alternative parameterization for systematic errors would be to specify S as a random effect of unobserved heterogeneity affecting the indicators: $y_{i j | M}^{*} = λ_{j | M} T_{i} + S_{i | M} + ϵ_{i j | M}$ (cf. equation 8). This model assumes S_M to be a random effect (or equivalently a factor with unit constrained loading) with a mean and variance (for similar approaches, see Bollen and Paxton 1998; Billiet and McClendon 2000; Welkenhuysen-Gybels et al. 2003). In Mplus, estimation of this model was possible by specifying a second factor with unit constrained loadings impacting all indicators equivalently. Means and variances of T were zero and unit constrained, respectively, and the model assumed scale equivalence. Subsequently, means and variances of S_M could be tested for equality, representing an alternative equality test of systematic errors, where now S_M is not mediated by factor loadings of T*. These analyses are illustrated in Appendix B (“The online [appendices/data supplements/etc.] are available at http://smr.sagepub.com/supplemental”). In sum, equality tests about S_M yielded the same conclusions on systematic errors. However, in testing reasons for scale nonequivalence, this model caused estimation or identification problems, especially when freeing loadings of T or when leaving variance of S_M free in less constrained models. This turned the approach impractical for the current and other analyses. One possible explanation is that ordinal MCFA models, in which more than one factor loads on the same indicator, have a structure for which exact identification conditions are unknown (Millsap 2011:130). This parameterization is still an attractive approach to testing systematic error equivalence in the presence of scale equivalence and also to test loading and threshold nonequivalence, provided the estimation problem can be solved. It is both a conceptual and an empirical question, which of the two parameterizations is the “correct” one. Two of the models relying on the “compound” parameterization had better fit than the alternative parameterization and one slightly worse. Hence, we tend to prefer the compound parameterization on empirical grounds. Conceptually, the question of “where” modes systematically affect measurement, that is, on indicator level or on true score level, is a crucial one, but beyond this discussion.

Regardless of particular parameterization, S can crucially affect the equivalence of questions before indicator-specific loading or threshold nonequivalence even need to be considered. Arguably, in face of our findings, it might be less relevant to consider indicator-specific nonequivalence, if there is a difference in the extent of systematic bias (or variance) to begin with. If a test of systematic bias or variance in the scale equivalence model indicates nonequivalence, the analysis could already stop and conclude about scale nonequivalence of all indicators in a given set of questions. Methods that explicitly model systematic bias or systematic variance, such as MTMM (Saris and Gallhofer 2007), might thus be viable alternative approaches to studying mode equivalence.

We used ordinal MCFA in our analyses. Alternative categorical measurement models are offered by IRT. It is noteworthy that IRT models would assume error variance equality (see Footnote 3). As we clearly found error variance differences, we prefer the more flexible MCFA model.

Despite the fact that propensity score weighting on sociodemographic variables was used to adjust for selection bias in all analyses, there still remains a risk that some bias could not be removed successfully, because important confounders might not have been observed. When attempting causal inference using adjustment on background characteristics, experimental mode comparisons always encounter this potential threat to validity. Unfortunately, the assumption that bias is fully removed by the propensity score adjustment typically cannot be tested. An indication in favor of the validity of our results represents, however, the clear consistency of effects conforming to theoretical expectations across scales with three different topics. This reassures that the observed effects are indeed mainly caused by measurement differences between modes.

Whereas adjustment on propensity scores is an established technique to balance selection bias (e.g., Morgan and Winship 2007), it was recently suggested to use instrumental variables in the estimation of measurement effects on the means of survey variables (Vannieuwenhuyze, Loosveldt, and Molenberghs 2010; Vannieuwenhuyze and Loosveldt 2013). The instrumental variable method offers great potential in coping with confounding more effectively than statistical approaches that use only background characteristics while possibly omitting important unobserved confounding variables. The technique requires, however, a data collection design that is quite different from the one applied in the present study (in particular, a single mode comparison sample administered in parallel to a sequential mixed-mode sample). Furthermore, the technique, which originally has been suggested to estimate measurement effects on population means, would first need to be extended for use in more complex variance–covariance models, such as CFA. This is an important path for further research.

The propensity weights were treated as fixed in the analysis, which might underestimate variance, if weights lack precision. Since the maximum normalized weight was small, we believe this threat not to be large. Resampling techniques could offer a way to control for the variance of weights. However, to date, there is no standard procedure available to combine results of nested hypotheses tests across many sample draws.

Finally, in all analyses, the Oort correction and cross-validations were applied. In some cases, these robustness measures changed conclusions compared to the unadjusted statistics. These nonrobust findings might turn out to have substance, for example, if this study was replicated with larger samples, but the effects would most likely still be small, because the current sample is already quite large. In any case, we advise to apply robustness measures in similar analyses to caution against over-fitting and inflated false positive detection rates.

Footnotes

Acknowledgment

The authors would like to thank Meike Morren and three anonymous reviewers for their most valuable comments on earlier versions of this article.

Authors’ Note

The questionnaire lab and the fieldwork department at Statistics Netherlands provided indispensible support in the execution of data collection.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by Statistics Netherlands under the project “Mode Effects in Social Surveys” (in Dutch: Mode Effecten in Persoon Statistieken or MEPS). The PhD of Thomas Klausch at Utrecht University is funded by Statistics Netherlands.

Notes

References

Alwin

Duane F.

2007. Margins of Error. Hoboken, NJ: John Wiley.

Andrews

Frank M.

1984. “Construct Validity and Error Components of Survey Measures: A Structural Modeling Approach.”. Public Opinion Quarterly 48(2): 409–42.

Asparouhov

Tihomir

. 2005. “Sampling Weights in Latent Variable Modeling.” Structural Equation Modeling: A Multidisciplinary Journal 12:411–34.

Asparouhov

Tihomir

Muthén

Bengt

. 2006. “Robust Chi Square Difference Testing with mean and variance adjusted test statistics.” Retrieved June 12, 2012 (http://www.statmodel.com/download/webnotes/webnote10.pdf)

Biemer

Paul

Stokes

Lynne

. 1991. “Approaches to the Modeling of Measurement Errors.” Pp. 487–517 in Measurement Errors in Surveys, Wiley Series in Probability and Statistics, edited by Biemer

Paul P.

Groves

Robert M.

Lyberg

Lars E.

Mathiowetz

Nancy A.

Sudman

Seymour

. Hoboken, NJ: John Wiley.

Billiet

Jaak B.

Davidov

Eldad

. 2008. “Testing the Stability of an Acquiescence Style Factor Behind Two Interrelated Substantive Variables in a Panel Design.” Sociological Methods & Research 36:542–62.

Billiet

Jaak B.

McClendon

McKee J.

. 2000. “Modeling Acquiescence in Measurement Models for Two Balanced Sets of Items.” Structural Equation Modeling: A Multidisciplinary Journal 7:608–28.

Blalock

H. M.

1970. “A Causal Approach to Nonrandom Measurement Errors.” The American Political Science Review 64:1099–111.

Bollen

Kenneth A.

1989. Structural Equations with Latent Variables. New York: John Wiley.

10.

Bollen

Kenneth A.

Paxton

Pamela

. 1998. “Detection and Determinants of Bias in Subjective Measures.” American Sociological Review 63:465–78.

11.

Bowling

Ann

. 2005. “Mode of Questionnaire Administration can have Serious Effects on Data Quality.” Journal of Public Health 27:281–91.

12.

Braunsberger

Karin

Wybenga

Hans

Gates

Roger

. 2007. “A Comparison of Reliability between Telephone and Web-Based Surveys.” Journal of Business Research 60:758–64.

13.

Buchanan

Tom

Johnson

John A.

Goldberg

Lewis R.

. 2005. “Implementing a Five-Factor Personality Inventory for Use on the Internet.” European Journal of Psychological Assessment 21:115–27.

14.

Chang

Linchiat

Krosnick

Jon A.

. 2009. “National Surveys Via RDD Telephone Interviewing Versus the Internet.” Public Opinion Quarterly 73:641–78.

15.

Chen

Fang Fang

. 2007. “Sensitivity of Goodness of Fit Indexes to Lack of Measurement Invariance.” Structural Equation Modeling: A Multidisciplinary Journal 14:464–504.

16.

Christian

Leah Melani

Dillman

Don A.

Smyth

Jolene D.

. 2008. “The Effects of Mode and Format on Answers to Scalar Questions in Telephone and Web Surveys.” Pp. 250–75 in Advances in Telephone Survey Methodology, edited by Lepkowski

James

Tucker

Clyde

Michael Brick

Leeuw

Edith D. de

Japec

Lilli

Lavrakas

Paul J.

Link

Michael W

Sangster

Roberta L.

. New York: John Wiley.

17.

Cole

Michael S.

Bedeian

Arthur G.

Feild

Hubert S.

. 2006. “The Measurement Equivalence of Web-Based and Paper-and-Pencil Measures of Transformational Leadership.” Organizational Research Methods 9:339–68.

18.

Davis

Darren W.

1997. “Nonrandom Measurement Error and Race of Interviewer Effects among African Americans.” The Public Opinion Quarterly 61:183–207.

19.

de Beuckelaer

Alain

Lievens

Filip

. 2009. “Measurement Equivalence of Paper-and-Pencil and Internet Organisational Surveys: A Large Scale Examination in 16 Countries.” Applied Psychology 58:336–61.

20.

de Leeuw

Edith D

. 1992. Data Quality in Mail, Telephone, and Face to Face surveys. Amsterdam, the Netherlands: TT-Publicaties.

21.

de Leeuw

Edith D

. 2005. “To Mix or Not to Mix Data Collection Modes in Surveys.” Journal of Official Statistics 21:233–55.

22.

de Leeuw

Edith D

. 2008. “Choosing the Method of Data Collection.” Pp. 113–35 in International Handbook of Survey Methodology, edited by Leeuw

Edith D. de

Hox

Joop J.

Dillman

Don A.

. New York: Taylor & Francis.

23.

de Leeuw

Edith D.

Hox

Joop J.

Scherpenzeel

Annette

. 2011. “Emulating Interviewers in an Online Survey: Experimental Manipulation of ‘Do-Not-Know’ over the Phone and on the Web.” Pp. 6305–14 in JSM Proceedings, Survey Research Methods Section, edited by American Statistical Association. Alexandria, VA: American Statistical Association.

24.

de Leeuw

Edith D.

Mellenbergh

Gideon J.

Hox.

Joop J.

1996. “The Influence of Data Collection Method on Structural Models.” Sociological Methods & Research 24:443–72.

25.

Deutskens

Elisabeth

de Ruyter

Wetzels

Martin

. 2006. “An Assessment of Equivalence between Online and Mail Surveys in Service Research.” Journal of Service Research 8:346–55.

26.

Dillman

Don A.

Phelps

Glenn

Tortora

Robert

Swift

Karen

Kohrell

Julie

Berck

Jodi

. 2009. “Response Rate and Measurement Differences in Mixed-Mode Surveys using Mail, Telephone, Interactive Voice Response (IVR) and the Internet.” Social Science Research 38:1–18.

27.

Dillman

Don A.

Smyth

Jolene D.

Christian

Leah Melani

. 2009. Internet, Mail, and Mixed-Mode Surveys: The Tailored Design Method. Hoboken, NJ: John Wiley.

28.

French

Brian F.

Holmes Finch

. 2008. “Multigroup Confirmatory Factor Analysis: Locating the Invariant Referent Sets.” Structural Equation Modeling: A Multidisciplinary Journal 15:96–113.

29.

Fricker

Scott

Galesic

Mirta

Tourangeau

Roger

Yan

Ting

. 2005. “An Experimental Comparison of Web and Telephone Surveys.” Public Opinion Quarterly 69:370–92.

30.

Fuller

Wayne

. 1987. Measurement Error Models. New York: John Wiley.

31.

Gerbing

David W.

Anderson

James C.

. 1984. “On the Meaning of within-Factor Correlated Measurement Errors.” Journal of Consumer Research 11:572–80.

32.

Green

Donald Philip

Citrin

Jack

. 1994. “Measurement Error and the Structure of Attitudes: Are Positive and Negative Judgments Opposites?” American Journal of Political Science 38:256–81.

33.

Greene

Jessica

Speizer

Howard

Wiitala

Wyndy

. 2008. “Telephone and Web: Mixed-Mode Challenge.” Health Services Research 43:230–48.

34.

Groves

Robert M.

Fowler

Floyd. J.

Jr. Couper

Mick P.

Lepkowski

James M.

Singer

Eleanor

Tourangeau

Roger

. 2010. Survey Methodology. 2nd ed. Hoboken, NJ: John Wiley.

35.

Guo

Shenyang

Fraser

Mark W.

. 2010. Propensity Score Analysis. Thousand Oaks, CA: Sage.

36.

Heerwegh

Dirk

Loosveldt

Geert

. 2008. “Face-to-Face versus Web Surveying in a High-Internet-Coverage Population.” Public Opinion Quarterly 72:836–46.

37.

Heerwegh

Dirk

Loosveldt

Geert

. 2011. “Assessing Mode Effects in a National Crime Victimization Survey using Structural Equation Models: Social Desirability Bias and Acquiescence.” Journal of Official Statistics 27:49–63.

38.

Holbrook

Allyson L.

Green

Melanie C.

Krosnick

Jon A.

. 2003. “Telephone versus Face-to-Face Interviewing of National Probability Samples with Long Questionnaires: Comparisons of Respondent Satisficing and Social Desirability Response Bias.” Public Opinion Quarterly 67:79–125.

39.

Jäckle

Annette

Roberts

Caroline

Lynn

Peter

. 2010. “Assessing the Effect of Data Collection Mode on Measurement.” International Statistical Review 78:3–20.

40.

Jöreskog

Karl Gustav

. 1971. “Simultaneous Factor Analysis in Several Populations.” Psychometrika 36:409–26.

41.

Kamata

Akihito

Bauer

Daniel J.

. 2008. “A Note on the Relation between Factor Analytic and Item Response Theory Models.” Structural Equation Modeling: A Multidisciplinary Journal 15:136–53.

42.

Kankaraš

Miloš

Vermunt

Jeroen K.

Moors

Guy

. 2011. “Measurement Equivalence of Ordinal Items: A Comparison of Factor Analytic, Item Response Theory, and Latent Class Approaches.” Sociological Methods & Research 40:279–310.

43.

Kaplan

David

. 1999. “An Extension of the Propensity Score Adjustment Method for the Analysis of Group Differences in MIMIC Models.” Multivariate Behavioral Research 34:467–92.

44.

Kim

Eun Sook

Yoon

Myeongsun

. 2011. “Testing Measurement Invariance: A Comparison of Multiple-Group Categorical CFA and IRT.” Structural Equation Modeling: A Multidisciplinary Journal 18:212–28.

45.

Krosnick

Jon A.

1991. “Response Strategies for Coping with the Cognitive Demands of Attitude Measures in Surveys.” Applied Cognitive Psychology 5:213–36.

46.

Link

Michael W.

Mokdad

Ali H.

. 2005. “Alternative Modes for Health Surveillance Surveys: An Experiment with Web, Mail, and Telephone.” Epidemiology 16:701–04.

47.

Lord

Frederic M.

Norvick

Melvin R.

. 1968. Statistical theories of mental test scores. Reading, MA: Addison-Wesley.

48.

Meade

Adam W.

Lautenschlager

Gary J.

. 2004. “A Comparison of Item Response Theory and Confirmatory Factor Analytic Methodologies for Establishing Measurement Equivalence/Invariance.” Organizational Research Methods 7:361–88.

49.

Mellenbergh

Gideon J.

1989. “Item Bias and Item Response Theory.” International Journal of Educational Research 13:127–43.

50.

Mellenbergh

Gideon J.

1999. “Measurement Models.” Pp. 168–87 in Research Methodology in the Social, Behavioral and Life Sciences, edited by Adèr

Hermann

Mellenbergh

Gideon

. London, UK: Sage.

51.

Meredith

William

. 1993. “Measurement Invariance, Factor Analysis and Factorial Invariance.” Psychometrika 58:525–43.

52.

Millsap

Roger E.

2011. Statistical Approaches to Measurement Invariance. New York: Routledge.

53.

Millsap

Roger E.

Yun-Tein

Jenn

. 2004. “Assessing Factorial Invariance in Ordered-Categorical Measures.” Multivariate Behavioral Research 39:479–515.

54.

Morgan

Stephen L.

Winship

Chirstopher

. 2007. Counterfactuals and Causal Inference. Cambridge, MA: Cambridge University Press.

55.

Morren

Meike

Gelissen

John P. T. M.

Vermunt

Jeroen K.

. 2011. “Dealing with Extreme Response Style in Cross-Cultural Research: A Restricted Latent Class Factor Analysis Approach.” Sociological Methodology 41:13–47.

56.

Muthén

Bengt

. 1984. “A General Structural Equation Model with Dichotomous, Ordered Categorical, and Continuous Latent Variable Indicators.” Psychometrika 49:115–32.

57.

Muthén

Bengt O.

Muthén

Linda

. 2010. “IRT in Mplus.” Retrieved June 12, 2012 (http://www.statmodel.com/download/MplusIRT2.pdf).

58.

Muthén

Bengt O.

Asparouhov

Tihomir

. 2002. “Latent Variable Analysis with Categorical Outcomes: Multiple-Group and Growth Modeling in Mplus.” Retrieved July 12, 2012 (http://www.statmodel.com/download/webnotes/CatMGLong.pdf).

59.

Muthén

Bengt O.

du Toit

S. H. C.

Spisic

. 1997. “Robust Inference Using Weighted Least Squares and Quadratic Estimating Equations in Latent Variable Modeling with Categorical and Continuous Outcomes.” Retrieved July 12, 2012 (http://www.gseis.ucla.edu/faculty/muthen/articles/Article_075.pdf).

60.

Oort

Frans J.

1998. “Simulation Study of Item Bias Detection with Restricted Factor Analysis.” Structural Equation Modeling: A Multidisciplinary Journal 5:107–24.

61.

Rosenbaum

Paul R.

1987. “Model-Based Direct Adjustment.” Journal of the American Statistical Association 82:387–94.

62.

Rosenbaum

Paul R.,

Rubin

Donald B.

. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70:41–55.

63.

Saris

Willem E.

Andrews

Frank M.

. 1991. “Evaluation of Measurement Instruments Using a Structural Equation Modeling Approach.” Pp. 487–517 in Measurement Errors in Surveys, Wiley Series in Probability and Statistics, edited by Biemer

Paul P.

Groves

Robert M.

Lyberg

Lars E.

Mathiowetz

Nancy A.

Sudman

Seymour

. Hoboken, NJ: John Wiley.

64.

Saris

Willem E.

Gallhofer

Irmtraud

. 2007. “Estimation of the Effects of Measurement Characteristics on the Quality of Survey Questions.” Survey Research Methods 1:29–43.

65.

Scherpenzeel

Annette C.

Saris

Willem E.

. 1997. “The Validity and Reliability of Survey Questions A Meta-Analysis of MTMM Studies.” Sociological Methods & Research 25:341–83.

66.

Schonlau

Matthias

Zapert

Kinga

Simon

Lisa Payne

Sanstad

Katherine Haynes

Marcus

Sue M.

Adams

John

Spranca

Mark

Kan

Hongjun

Turner

Rachel

Berry

Sandra H.

. 2004. “A Comparison between Responses from a Propensity-Weighted Web Survey and an Identical RDD Survey.” Social Science Computer Review 22:128–38.

67.

Skrondal

Anders

Rabe-Hesketh

Sophia

. 2004. Generalized Latent Variable Modeling: Multilevel, Longitudinal, and Structural Equation Models. Boca Raton, FL: Chapman & Hall/CRC.

68.

Stark

Stephen

Chernyshenko

Oleksandr S.

Drasgow

Fritz

. 2006. “Detecting Differential Item Functioning with Confirmatory Factor Analysis and Item Response Theory: Toward a Unified Strategy.” Journal of Applied Psychology 91:1292–306.

69.

Tourangeau

Roger

Rips

Lance J.

Rasinski

Kenneth

. 2000. The Psychology of Survey Response. Cambridge MA: Cambridge University Press.

70.

Tourangeau

Roger

Couper

Mick P.

Conrad

Frederick

. 2004. “Spacing, Position, and Order Interpretive Heuristics for Visual Features of Survey Questions.” Public Opinion Quarterly 68:368–93.

71.

Vannieuwenhuyze

Jorre T. A.

Loosveldt

Geert

. 2013. “Evaluating Relative Mode Effects in Mixed-Mode Surveys: Three Methods to Disentangle Selection and Measurement Effects.” Sociological Methods & Research 42:82–104.

72.

Vannieuwenhuyze

Jorre T. A.

Loosveldt

Geert

Molenberghs

Geert

. 2010. “A Method for Evaluating Mode Effects in Mixed-mode Surveys.” Public Opinion Quarterly 74:1027–45.

73.

Vandenberg

Robert J.

Lance

Charles E.

2000. “A Review and Synthesis of the Measurement Invariance Literature: Suggestions, Practices, and Recommendations for Organizational Research.” Organizational Research Methods 3(1): 4–70.

74.

Welkenhuysen-Gybels

Jerry

Billiet

Jaak

Cambré

Bart

. 2003. “Adjustment for Acquiescence in the Assessment of the Construct Equivalence of Likert-Type Score Items.” Journal of Cross-Cultural Psychology 34:702–22.

75.

Yoon

Myeongsun

Millsap

Roger E.

. 2007. “Detecting Violations of Factorial Invariance Using Data-Based Specification Searches: A Monte Carlo Study.” Structural Equation Modeling: A Multidisciplinary Journal 14:435–63.