An Evaluation of Experimental Designs for Constructing Vignette Sets in Factorial Surveys

Abstract

Factorial surveys use a population of vignettes to elicit respondents’ attitudes or beliefs about different hypothetical scenarios. However, the vignette population is frequently too large to be assessed by each respondent. Experimental designs such as randomized block confounded factorial (RBCF) designs, D-optimal designs, or random sampling designs can be used to construct small subsets of vignettes. In a simulation study, we compare the three vignette designs with respect to their biases in effect estimates and show how the biases arise from the designs’ confounding structure, nonorthogonality, and unbalancedness. We particularly focus on the designs’ sensitivity to context effects and misspecifications of the analytic model. We argue that RBCF designs and D-optimal designs are preferable to random sampling designs because they offer a stronger protection against undesirable confounding, context effects, and model misspecifications. We also discuss strategies for dealing with context and order effects since none of the basic vignette designs can satisfactorily handle them.

Keywords

experimental design factorial survey vignette context effect confounding confounded factorial design D-optimal design random sampling design

In sociological research, researchers frequently use factorial surveys (also called vignette experiments) to measure social judgments. For example, several studies tried to measure fair income gaps between employees by systematically varying factors like gender, education, or occupation in vignette descriptions (e.g., Alves and Rossi 1978; Auspurg and Hinz 2015; Jasso 2006; Jasso and Webster 1997; Steiner, Atzmüller, and Su 2016). Since the overall number of possible vignettes is frequently too large to be assessed by each respondent, researchers create smaller subsets of vignettes. However, the use of vignette sets has two immediate implications: First, some of the main, two-way, or higher order vignette effects will necessarily be confounded and thus result in potentially biased effect estimates. Second, since each vignette set creates its own specific context, vignette assessments and effect estimates might be biased due to context effects. Since confounding and context effects depend on the choice of a specific vignette design and its correct analysis, it is important that researchers are aware of the biases that may arise from different designs. In this article, we evaluate the three most popular vignette designs: the confounded factorial design, D-optimal design, and random sampling design (Atzmüller and Steiner 2010; Auspurg and Hinz 2015; Steiner and Atzmüller 2006). In order to guide researchers in choosing an appropriate design for a given research question, we relate the design features—confounding, orthogonality, and balancedness—to the resulting biases.

In an illustrative simulation study, we compare the three vignette designs with respect to potential biases induced by confounding, context effects, and misspecifications of the analytic model. Highlighting differences in the vignette designs is of practical importance because the easily implementable and frequently used random sampling designs are much more prone to undesirable biases than the other two designs. Due to the availability and accessibility of (stochastic) search algorithms, D-optimal designs are replacing random sampling designs. However, the confounding structure of D-optimal designs crucially depends on the choice of an appropriate design-generating model, the set size, and the design suggested by the algorithm, which might not necessarily represent the optimal solution. This drawback is avoided by confounded factorial designs. They allow researchers to deliberately choose a design with a clear and acceptable confounding structure. But with a large number of factors and factor levels adequate confounded factorial designs might not exist or be hard to construct. In this article, we will argue that simpler but stronger designs, like the confounded factorial designs, are less prone to biases and thus more appropriate for confirmatory purposes, while complex designs created via D-optimal or random sampling designs are well suited for exploratory purposes.

Our simulation study also demonstrates that none of the three designs can satisfactorily handle all types of context effects, that is, effects that occur due to the context created by a specific set of vignettes or the order of their presentation. Since context effects frequently cannot be ruled out in practice, we will investigate the designs’ capabilities to deal with such effects and then discuss design strategies for avoiding or mitigating context effects.

This article is organized as follows. We first use an applied example to introduce basic vignette concepts and discuss context and set effects. In the subsequent section, we briefly explain confounding, orthogonality, and balancedness. Then, we outline the key features of the three vignette designs discussed in this article. The section on the simulation study gives a detailed description of the implemented designs and the data-generating processes. The Results section then presents bias plots for each design and data-generating scenario. In the Conclusions section, we summarize our findings and discuss strategies for dealing with context effects.

Vignette Population, Vignette Sets, Context Effects, and Set Effects

Generating the Vignette Population

The first step in designing a vignette experiment requires researchers to choose factors and factor levels that are factorially combined to produce the entire vignette population. As an example, consider the vignette experiment implemented by Steiner et al. (2016) who used descriptions of virtual employees as vignettes to investigate respondents’ perception about the actual and fair income. Each vignette represents a virtual employee characterized by the following four factors (Table 1; in our discussion, we omit employee’s gender which was implemented as a between-subjects factor):

Table 1.

Vignette Factors and Factor Levels.

Factor	Factor Levels
Industry (I)	3 levels	Construction (I1) / health and care (I2) / business-related services (I3)
Educational degree (E)	3 levels	Apprenticeship training (E1) / high school (E2) / college (E3)
Occupational experience (Y)	3 levels	5 Years (Y1) / 20 years (Y2) / 35 years (Y3)
Parental leave (L)	3 levels	0 Month (L1) / 3 months (L2) / 24 months (L3)

industry of employee’s occupation (I), with three levels: construction, health and care, and business-related services;

highest educational level attained (E), with three levels: apprenticeship training, high school, and college;

occupational experience in years (Y), with three levels: 5, 20, and 35 years; and

parental leave in months (L), with three levels: 0, 3, and 24 months.

Given the four factors with three factor levels each, the full factorial combination results in a vignette population of 3⁴ = 81 vignettes. From a statistical point of view, each respondent would ideally assess the incomes of all 81 virtual employees. In such a full factorial design, all main and interaction effects (up to the four-way interactions) would be estimable without any confounding. However, judging 81 vignettes is not only tiresome and may results in unreliable measurements, but it also is unnecessary because higher order effects like three- or four-way interaction effects are rarely of interest. Thus, it makes sense to forgo the estimation of higher order effects and to reduce the burden on respondents by creating small subsets of vignettes.

Constructing Vignette Sets

The main idea of constructing vignette sets is that each respondent judges only a few vignettes, but that the entire population of vignettes is still fully exploited across the sample of respondents.¹ Steiner et al. (2016) chose a set size of nine vignettes because a respondent can comfortably compare and judge nine vignettes without getting tired or frustrated about the repetitive task. Although the construction of vignette sets usually increases the reliability of vignette measurements, it necessarily results in a confounding of main and interaction effects with each other or with the set effects.

Context Effects

Context effects need careful consideration when constructing sets and analyzing vignette data. First, it is important to acknowledge that all the factors and factor level together (i.e., the entire vignette population) create their own context that potentially limits the generalizability of results. However, here we are not concerned with the context created by the entire vignette population but with the context created by each subset of vignettes. We can distinguish between two context types: set-specific contexts and order-specific contexts. Set-specific contexts are generated by the specific combination of vignettes in a set and may strongly affect the assessment of all vignettes in a set. Set-specific context effects usually occur when all vignettes of a set are presented simultaneously, for instance, when a respondent receives the vignettes as a shuffled deck of cards with the request to first read and sort the vignettes before rating the vignettes. Since the contexts vary across sets, the corresponding context effects typically vary from set to set as well.

Set-specific context effects can be caused by a single, maybe unusual vignette or by a combination of certain vignettes. Consider, for example, the simultaneous presentation of a set of nine vignettes where seven vignettes refer to employees with a college or higher educational degree in the construction industry and two vignettes refer to employees with only apprenticeship training degree in health and care. Then, the income assessments of the two employees in health and care may strongly depend on the context created by the well-educated employees in the construction industry. Income assessments of the very same two employees might have been systematically lower (or higher) if they would have been in a more heterogeneous, less construction-dominated set of vignettes. Similarly, a single vignette with an unusual combination of factor levels can strongly affect the assessment of all other vignettes in the set as well.

The second context type refers to order-specific contexts that are created by sequentially presenting the vignettes of a set in a certain order. That is, a respondent assesses one vignette after the other without seeing any of the subsequent vignettes. Since the presentation order sequentially builds up a specific context, each vignette assessment may depend on the preceding vignettes. As before, order effects can be triggered by the appearance of a single vignette or by the sequence of multiple vignettes.

Both types of context effects increase the variance in the outcome variable, and more importantly, they may also bias main and interaction effects. Bias is only an issue when context effects occur systematically, that is, when a set-specific or order-specific context exerts the same or a similar effect on all respondents or groups of respondents. If context effects randomly vary across respondents, bias is not an issue because the random effects average out.

Biases due to context effects are well-known in survey research and psychology. Counterbalancing or randomization strategies are frequently used to avoid or mitigate some of the systematic effects (Auspurg and Hinz 2015; Gravetter and Forzano 2015; Jasso 2006; Yellen and Cella 1995). However, with vignette designs, a completely counterbalanced design is not feasible for two reasons: First, even with a few vignettes per set, the large number of possible order permutations makes counterbalancing infeasible. Second, the vast number of possible assignments of vignettes to sets (even when holding the confounding structure constant) does not allow for a complete counterbalancing. Randomization strategies like randomly sampling and ordering vignettes are regularly used instead of counterbalancing, but they cannot satisfactorily deal with all types of set- and order-specific context effects (as we will show). However, recording and modeling the presentation order of vignettes can help in dealing with order effects and the autocorrelation of measurements (Jasso 2006).

In addition to set- and order-specific context effects, similar effects can occur with respect to the order of factors presented in a vignette or the order of rating questions associated with the vignettes. This article does not deal with such context effects, but for details on possible effects due to the ordering of factors within a vignette, see Auspurg and Jäckle’s (2015) study.

Set Effects

It is important to distinguish between the actual context effects that occur when respondents judge a set of vignettes and the set effects in the statistical analysis. The set effects in the analysis are the fixed or random effects of set indicators (i.e., indicator variables that show which respondents obtained set 1, set 2, etc.). The effects associated with set indicators are frequently referred to as set effects. They represent the between-sets differences in the average vignette assessments (Atzmüller and Steiner 2010; Cox 1958; Kirk 1995).

Whether set indicators succeeds in (partially) removing context effects depends on how they were induced. Set indicators successfully take care of set-specific context effects if the context affects all vignettes of a set to the same additive extent (referred to as constant context effect). But if the context affects all or some vignettes of a set in a different way (heterogeneous context effect), then set indicators might capture none or only part of the heterogeneous context effects. Set indicators are also unable to account for order-specific context effects because order effects are heterogeneous by definition. In the following sections, we use the term “set effects” especially when we discuss issues about confounding and modeling.

Although the literature on experimental designs is very clear about the necessity to consider set effects in the analysis (Cox 1958; Kirk 1995; Steiner and Atzmüller 2006; Wu and Hamada 2009), they are rarely modeled in applied vignette studies—maybe because many methodological key publications do not explicitly include set indicators in the analytic models (e.g., Auspurg and Hinz 2015; Hox, Kreft, and Hermkens 1991; Jasso 2006). Auspurg and Hinz (2015:91) mention the possibility of including set indicators but argue that “one normally does not need to expect strong deck [set] effects.” But irrespective of the presence of context-specific set effects, a design with repeated measures of unique sets requires a modeling of fixed or random set effects (just like the repeated measurement of subjects requires corresponding random or fixed effects). Ignoring set effects results in potentially biased estimates, incorrect standard errors, and significance tests (Kirk 1995; Steiner et al. 2016).

Experimental Designs for Constructing Vignette Sets

An ideal design for creating vignette sets should guarantee that the effects of interest can be estimated without bias and with maximum efficiency (Cox 1958; Kirk 1995; Montgomery 1984; Searle 1987; Wu and Hamada 2009). Unbiasedness and efficiency require a design that is unconfounded with respect to the effects of interest, orthogonal, and balanced. In this section, we first briefly explain confounding, orthogonality, and balancedness and then highlight the design features of the three main vignette designs.

Confounding, Orthogonality, and Balancedness

A design confounds the effects of two predictors if the effects cannot be estimated separately due to perfect collinearity, that is, the parameter estimate of the first predictor fully absorbs the effect of the second predictor (or vice versa). The effect estimate of the first predictor would be unbiased only if the effect of the second predictor is 0 (which is an untestable assumption). An effect is unconfounded if the design does not confound the effect under investigation with any other effects. Confounding necessarily occurs if the vignette population is partitioned into mutually exclusive sets or if one or more unique vignettes of the population do not get assessed because of dropping implausible vignettes, exploiting only a fraction of the entire vignette population, or because a random sampling design failed to sample all vignettes.

In practice, a design that does not confound the effects of major interest, typically main and two-way interaction effects, is preferable to a design that confounds these effects with other effects. Researchers should avoid a confounding of main effects (e.g., education) with other main effects (e.g., occupational experience) or with two-way interaction effects (e.g., Industry×Occupational Experience) because the confounded effects cannot be disentangled and thus cannot be meaningfully interpreted (unless some of the confounded effects can be assumed to be 0).

Two effects are orthogonal to each other when the cross products of their predictors sum to 0. If all pairs of unconfounded effects are orthogonal, the design is called orthogonal as well. Orthogonality between two effects implies that their predictors are uncorrelated. This implies that the effects of an orthogonal design can be estimated and interpreted independently of each other (Wu and Hamada 2009). Thus, effect estimates are insensitive to model (mis)specifications, that is, effect estimates do not depend on the inclusion of other main and interaction effects in the analytic model.

A design is called balanced when each factor-level combination (i.e., vignette) is measured equally often. If a design is balanced and orthogonal, effects are estimated with maximum efficiency. Although efficiency is an important aspect in designing vignette surveys, we focus in this article only on bias issues since establishing valid effect estimates is our primary concern here (see Dülmer 2016 for a comparative assessment of different designs’ efficiency).

Experimental Vignette Designs

An important issue in implementing a valid vignette experiment is the proper construction of vignette sets such that the resulting design (a) does not confound important main and interaction effects with each other or with set effects, (b) allows for a modeling of set effects, and (c) is insensitive to model misspecifications. Here, we discuss three main strategies for constructing vignette sets: (1) confounded factorial designs, (2) D-optimal designs, and (3) random sampling designs. These strategies differ in their potential to control the confounding of main, interaction, and set effects; their ability to model set effects; their sensitivity to model misspecifications; and their flexibility and ease of implementation.

Confounded factorial design

A confounded factorial design partitions the vignette population into several vignette sets of equal size such that only higher order interaction effects are confounded with set effects. Partitioning the vignette population guarantees that the sets are mutually exclusive and exhaust the entire vignette population. Vignette sets are then randomly assigned to respondents, so that each set is judged by the same number of respondents. Such designs are frequently called randomized block confounded factorial designs and abbreviated as RBCF-p^k, where k is the number of factors and p is the number of levels for each factor (Kirk 1995). As the notation indicates, RBCF designs preferably use factors with the same number of factor levels. For a given set size, RBCF-p^k designs can be either looked up from publications or constructed using modular arithmetic (Kirk 1995; Montgomery 1984; Wu and Hamada 2009).

The main advantage of RBCF designs is that the researcher can choose a design that avoids a confounding of main effects and important two-way interaction effects. In our example with the employee vignettes, one can implement an RBCF-3⁴ design with a set size of nine vignettes that only confounds three-way interaction effects with set effects, while main effects and all other interaction effects remain completely unconfounded. Since three-way interaction effects and set effects are not separable, estimates of three-way interaction effects are not meaningfully interpretable unless all set effects are 0 or negligibly small (but this is rarely known in practice). Since vignette sets are randomly assigned to respondents, set effects can be modeled via set indicators. The inclusion of set indicators typically results in more efficient effect estimates because the set indicators eliminate the variance between sets (Fisher 1926). However, due to the inclusion of set indicators, the confounded interaction terms are perfectly collinear with the set effects and thus need to be dropped from the analysis (which is automatically done by most statistical software packages).

Other advantages are the design’s orthogonality and balancedness that guarantee efficient parameter estimates and an insensitivity to model misspecifications (i.e., the omission of higher order terms does not affect the parameter estimates of lower order terms in the model). The major disadvantage of RBCF designs is the lack of flexibility, that is, the number of factors should be small, and the factors should have the same number of factor levels. Although unequal numbers of factor levels are possible via partial confounding, here we only discuss standard RBCF designs with a complete confounding.

D-optimal design

D-optimal designs pursue the same goal as RBCF designs, that is, a partitioning of the vignette population into equally sized vignette sets. But unlike RBCF designs, the partitioning is generated by a search algorithm that tries to maximize an optimality criterion. Software packages like the optimal experimental designs program or jmp from Statistical Analysis System (SAS; Kuhfeld 1997; Kuhfeld, Tobias, and Garratt 1994; SAS Institute Inc. 2012) or the R function optBlock() from the AlgDesign package (Wheeler 2014) can be used to search for an optimal design. The advantage of an algorithmic approach is that it can easily handle designs with many factors and different numbers of factor levels. However, searching for an optimal partitioning requires the researcher to first specify a design-generating model that captures the effects of main interest to the researcher. For instance, if we are only interested in main effects and some selected two-way interaction effects, but believe that all remaining two-way and higher order effects are 0 or negligibly small, the design-generating model needs to include the main effects and two-way interaction effects of interest. For a predefined set size, the algorithm then uses the design matrix X of the design-generating model (i.e., the predictor matrix of the specified main and interaction effects) and searches for a partitioning of the vignette population such that the determinant of $X^{'} X$ is maximized (the “D” in D-optimality reflects the determinant criterion). Maximizing the determinant of $X^{'} X$ is equivalent to minimizing the volume of the joint confidence region of vignette effects, that is, all effects captured by the design matrix X can be jointly estimated with maximum efficiency.²

Since the estimators’ efficiency is maximized when a design is balanced and orthogonal, D-optimal designs closely resemble RBCF designs with respect to balance and orthogonality. For simple designs with a few factors and factor levels, D-optimal designs may even have a similar or identical confounding structure as RBCF designs such that the effects of interest remain free of any confounding. However, for more complex designs with a larger number of factor and factor levels, some or all effects of interest might need to be confounded with other effects.

The confounding structure of D-optimal designs strongly depends on the design-generating model. A design-generating model that only includes main effects may result in a D-optimal design where higher order interactions are confounded with main effects. This is not surprising because the D-optimization criterion aims to maximizing efficiency rather than minimizing confounding. Omitting two-way or higher order interactions from the design-generating model signals the algorithm that their effects are 0 and thus may be confounded with the main effects. However, increasing the complexity of the design-generating model (i.e., including two-way and higher order interactions terms) does not necessarily result in a more desirable confounding structure because some of the two-way or higher order interaction effects might need to be confounded, particularly when set sizes are small. Since the search algorithms do not explore all possible set assignments, it is advisable to generate multiple designs for a given set size and design-generating model and to choose the one with the least problematic confounding.

The search algorithms are flexible enough to accommodate designs where the vignette population cannot be partitioned into equally sized sets or where more vignette sets are generated than required for covering the entire vignette population. Despite the algorithms’ flexibility, one should aim for nearly orthogonal and balanced designs with an acceptable confounding structure and repeated measurement per set (to model set effects). Since D-optimal designs maximize the degree of balance and orthogonality, misspecifications of the analytical model will only have a minor effect, provided they do not trigger confounding bias (see, e.g., the Simulation Study section).

Random sampling design

The strategy of randomly sampling respondent-specific vignette sets was suggested by Rossi (1951; Jasso 2006) and has been used in many applications (Rossi and Nock 1982; Rossi, Sampson, et al. 1974; Rossi, Waite, et al. 1974; Taylor 2006). Instead of partitioning the vignette population into mutually exclusive sets, a vignette set is randomly drawn from the vignette population for each respondent. Each respondent-specific set is drawn without replacement (to avoid identical vignettes in a single set), but each set is always drawn from the entire population (i.e., after replacing the vignettes of previously drawn sets). Although variations of this random sampling strategy are possible, we only discuss the very basic design here.³ Random sampling designs have been very popular in actual research practice because they are easy to implement, particularly for huge vignette populations generated from a large number of factors with different numbers of factor levels. Random sampling designs are considered as viable alternatives to RBCF and D-optimal designs whenever large respondent samples are available such that the design’s large sample properties are approximately met: As the sample size approaches infinity, the random sampling design is orthogonal and completely unconfounded (Jasso 2006).

However, with finite respondent samples, the drawbacks of random sampling designs can be considerable because random sampling may result in an uncontrolled confounding of main and interaction effects with each other. An uncontrolled confounding almost surely occurs when the respondent sample is small in comparison to the vignette population such that the sampled vignettes fail to cover the entire vignette population. Which effects are confounded depends on the unsampled vignettes, but the confounding structure can be checked once the vignette sets have been drawn. Another disadvantage is the impossibility to account for set effects (unless multiple measurements are available for each set), which frequently leads to bias in the effect estimates. Moreover, parameter estimates are more sensitive to model misspecifications and less efficient because random sampling generally results in a nonorthogonal and unbalanced design.

Simulation Study

In our simulation study, we compare RBCF, D-optimal, and random sampling designs with respect to the bias in effect estimates caused by confounding, set- and order-specific context effects, and model misspecifications. The simulation (i) illustrates how the general principles in designing a vignette experiment—confounding, orthogonality, and balancedness—translate into potential biases in effect estimates and (ii) demonstrates how different types of context effects may invalidate the designs. Linking design characteristics to the resulting biases allows researchers to critically evaluate the properties of much more complex designs than discussed in this section.

For our simulation setup, we follow the vignette design of Steiner et al. (2016) with the four factors industry (I), highest educational-level attained (E), occupational experience in years (Y), and parental leave in months (L). The vignette population consists of 3⁴ = 81 vignettes because all four factors have three levels (Table 1). We use a set size of nine vignettes per set for all designs. Thus, for the RBCF-3⁴ and D-optimal designs, the vignette population is partitioned into nine sets. For the random design, the number of sets corresponds to the number of simulated respondents because each respondent receives its own randomly sampled set. In general, we ran the simulations with a sample of 900 respondents, but for the random sampling design, we also present small sample results for 54 respondents.

Our simulation study varies three factors (Table 2): (1) the design for constructing the vignette sets, (2) the presence of set- and order-specific context effects in the data-generating model, and (3) the specification of the analytic model. Overall, we use five vignette designs: one confounded factorial design (RBCF-3⁴), two D-optimal designs, and two random sampling designs. The first of the two D-optimal designs is based on a design-generating model that includes all main and two-way interaction effects (DO-M2), while the second one also includes all three-way interaction effects (DO-M3). Thus, the DO-M2 design assumes that all three-way or higher order effects are irrelevant and negligibly small, while the DO-M3 design also aims to estimating three-way interaction effects with maximum efficiency.

Table 2.

Simulation Factors and Levels.

Simulation Factors	Levels	(Abbreviations) / Explanations
Vignette designs	RBCF-3⁴ design	(RBCF-3⁴) / confounded factorial design
	D-optimal designs	(DO-M2) / with main effects and two-way interactions
	D-optimal designs	(DO-M3) / with main effects, two-way and three-way interactions
	Random sampling designs	(RS-900) / with sample size of 900 respondents
	Random sampling designs	(RS-54) / with sample size of 54 respondents
Context effects	No context effect	(C0) / neither set- nor order-specific context effects
	Constant context effects	(C1) / all vignettes within a set are affected to the same extent
	Heterogeneous context effects	(C2) / vignettes within a set are differentially affected
	Order effects	(C3) / vignettes are affected by the preceding vignettes of a set
Analytic models	Main effects model	(A1) / only main effects are included
	Two-way interaction effects model	(A2) / main effects and selected two-way interactions
	Three-way interaction effects model	(A3) / main effects and selected two- and three-way interactions

Note. RBCF = randomized block confounded factorial.

We employ two random sampling designs, one with 900 respondents (RS-900) and one with 54 respondents (RS-54). The RS-900 design randomly samples nine vignettes for each of the 900 respondents and thus is directly comparable to the RBCF and D-optimal designs. Given the relatively small vignette population of 81 vignettes, a respondent sample of 900 (with 9 × 900 = 8,100 vignette measurements) could be large enough to sufficiently approximate the large sample properties of the random sampling design. To investigate the properties of small samples, we use the RS-54 design. With only 54 respondents, it is likely that the 54 randomly drawn sets do not fully exhaust the vignette population. The probability that one of the vignettes never gets sampled is 81×(72/81)⁵⁴ = 0.14. For both the RS-900 and RS-54 design, we hold the design matrix of the predictors of the randomly assigned vignettes (X) constant across all iterations of our simulation. This implies that we evaluate the properties of a single realized vignette assignment rather than the long-term (asymptotic) properties across repeated vignette studies where new vignette sets are randomly drawn for each study.

To demonstrate the long-term properties for the RS-54 design, that is, what happens if we repeat the very same study with new samples of respondents and new samples of vignette sets over and over again, we randomly draw vignette sets in each iteration. Consequently, the design matrix X changes across iterations. From a practical point of view, the evaluation of a single realized RS-900 and RS-54 design is more important than the evaluation of the known long-term properties because research practice requires us to draw conclusion from a single realization of the random sampling design.

In comparing the vignette designs, we are particularly interested in how the designs are affected by the presence of context effects. We simulate four scenarios: (C0) no context effects: neither set- nor order-specific context effects are present; (C1) constant context effects: set-specific context effects are induced by four (maybe atypical) vignettes that affect the measurement of all vignettes in the set to the same additive extent; (C2) heterogeneous context effects: set-specific context effects are induced by four vignettes that only affect the assessment of a specific group of vignettes in the set; (C3) order effects: order-specific context effects are induced by four vignettes that affect the assessment of subsequent vignettes.

Finally, in analyzing the simulated vignette data, we use three analytic models, the correctly specified model and two misspecified models. The correctly specified model (A3) is identical to the data-generating model that includes main, two-way, and three-way interaction effects. The two misspecified models refer to a simple main effects model (A1) and a model that includes two-way interaction effects in addition to the main effects (A2). In the following subsections, we describe the simulation design in more detail.

Generating the Vignette Designs

RBCF-3 ⁴ design

We generated the RBCF-3⁴ design with nine sets according to the design plan published in Montgomery (1984:316). Except for the deliberately confounded three-way interaction effects, the effect-coded RBCF-3⁴ design is orthogonal and balanced. The design completely confounds the effects of the 4 three-way interactions I×E×Y, I×E×L, I×Y×L, and E×Y×L with the set effects. Table 3 shows the confounding structure of the effect-coded parameter estimates (we used the alias() function in R to compute the confounding structure). The 8 three-way interaction parameters listed in the columns are confounded with the parameters of the set effect but also with the remaining parameters of the confounded three-way interactions (exemplary interpretations of the confounding are given in the footnote to the table). Importantly, the parameters of all main and two-way interaction effects and the four-way interaction effects remain completely unconfounded. Across factors, the effect-coded predictors are also uncorrelated (orthogonal) such that the parameter estimates do not depend on model specification.

Table 3.

Confounding Structure of the RBCF-3⁴ Design.

Effects	I1×E2×Y2	I2×E2×Y2	I1×E2×L2	I2×E2×L2	I1×Y2×L2	I2×Y2×L2	E1×Y2×L2	E2×Y2×L2
Set 1	1	0	1	0	1	0	0	1
Set 2	0	1	0	1	0	1	0	1
Set 3	1	0	−1	−1	0	1	1	0
Set 4	0	1	1	0	−1	−1	1	0
Set 5	−1	−1	0	1	1	0	1	0
Set 6	0	1	−1	−1	1	0	−1	−1
Set 7	1	0	0	1	−1	−1	−1	−1
Set 8	−1	−1	1	0	0	1	−1	−1
I1×E1×Y1	−1	0	0	0	0	0	0	0
I2×E1×Y1	0	−1	0	0	0	0	0	0
I1×E2×Y1	0	−1	0	0	0	0	0	0
I2×E2×Y1	1	1	0	0	0	0	0	0
I1×E1×Y2	1	1	0	0	0	0	0	0
I2×E1×Y2	−1	0	0	0	0	0	0	0
I1×E1×L1	0	0	−1	0	0	0	0	0
I2×E1×L1	0	0	0	−1	0	0	0	0
I1×E2×L1	0	0	1	1	0	0	0	0
I2×E2×L1	0	0	−1	0	0	0	0	0
I1×E1×L2	0	0	0	−1	0	0	0	0
I2×E1×L2	0	0	1	1	0	0	0	0
I1×Y1×L1	0	0	0	0	−1	0	0	0
I2×Y1×L1	0	0	0	0	0	−1	0	0
I1×Y2×L1	0	0	0	0	0	−1	0	0
I2×Y2×L1	0	0	0	0	1	1	0	0
I1×Y1×L2	0	0	0	0	1	1	0	0
I2×Y1×L2	0	0	0	0	−1	0	0	0
E1×Y1×L1	0	0	0	0	0	0	0	−1
E2×Y1×L1	0	0	0	0	0	0	1	1
E1×Y2×L1	0	0	0	0	0	0	1	1
E2×Y2×L1	0	0	0	0	0	0	−1	0
E1×Y1×L2	0	0	0	0	0	0	1	1
E2×Y1×L2	0	0	0	0	0	0	−1	0

Note. The effects in the rows are linearly dependent on the columns. For example, the effect of set 1 confounds with 1 times the effect of I1×E2×Y2 plus 1 times the effect of I1×E2×L2 plus 1 times the effect of I1×Y2×L2, and plus 1 times the effect of E2×Y2×L2. Industry: I1 construction, I2 health and care; educational degree: E1 apprenticeship training, E2 high school; occupational experience: Y1 5 years, Y2 20 years; parental leave: L1 0 month, L2 3 months. RBCF = randomized block confounded factorial.

D-optimal designs

We generated the two D-optimal designs (DO-M2 and DO-M3) with nine sets and nine vignettes per set using the R function optBlock() from the AlgDesign package (Wheeler 2014). Since we used effect coding for the main and interaction effects, the algorithm maximizes the efficiency of the effect-coded estimators in the design matrix X. For each design, we ran the optBlock() function several times and then deliberately chose a partitioning with a confounding structure that is typically produced by the algorithm. In deliberately choosing a partitioning, we avoided designs with an incidentally bad confounding structure. Both D-optimal designs are balanced and orthogonal.⁴ Due to the designs’ orthogonality, all across-factor correlations of the effect-coded predictors are 0.

Table 4 shows the confounding structure of the DO-M2 design that is very similar to the RBCF-3⁴ design. Three-way interactions (I×E×Y, I×E×L, I×Y×L and E×Y×L) are confounded with the set effects but also with each other (e.g., the effects of I×E×Y are also confounded with the effects of I×E×L). The confounding structure is more complex and stronger than for the RBCF-3⁴ design (the confounding multipliers range from −2 to 1). All other effects remain unconfounded. Thus, the DO-M2 and RBCF-3⁴ design should perform equally well with respect to the estimation of main and two-way interaction effects.

Table 4.

Confounding Structure of the D-optimal Design Generated with Main and All Two-way Interaction Effects.

	I1×E2×L2	I2×E2×L2	I1×Y2×L2	I2×Y2×L2	E1×Y1×L2	E2×Y1×L2	E1×Y2×L2	E2×Y2×L2
Set 1	0	0	1	0	−1	1	−2	−1
Set 2	−2	−1	0	1	2	0	1	0
Set 3	2	1	−1	−1	−1	−1	−2	−2
Set 4	1	2	0	1	−1	0	−2	0
Set 5	−1	−1	0	1	1	0	1	0
Set 6	−1	−2	−1	−1	−1	−1	1	1
Set 7	1	−1	0	1	−1	0	1	0
Set 8	0	0	1	0	2	1	1	2
I1×E1×Y1	1	0	0	0	0	−1	−1	0
I2×E1×Y1	−1	−1	0	0	1	1	1	1
I1×E2×Y1	0	1	0	0	1	1	−1	0
I2×E2×Y1	1	0	0	0	−2	−1	−1	−1
I1×E1×Y2	−1	−1	0	0	1	1	1	1
I2×E1×Y2	0	1	0	0	0	0	−1	−1
I1×E2×Y2	1	0	0	0	−2	−1	−1	−1
I2×E2×Y2	−1	−1	0	0	1	0	2	1
I1×E1×L1	0	−1	0	0	−1	−1	0	−1
I2×E1×L1	1	1	0	0	0	1	−1	0
I1×E2×L1	1	1	0	0	0	0	0	0
I2×E2×L1	−1	0	0	0	0	0	0	0
I1×E1×L2	1	1	0	0	0	1	−1	0
I2×E1×L2	−1	0	0	0	1	0	1	1
I1×Y1×L1	0	0	1	1	0	0	0	0
I2×Y1×L1	0	0	−1	0	0	0	0	0
I1×Y2×L1	0	0	0	−1	0	0	0	0
I2×Y2×L1	0	0	1	1	0	0	0	0
I1×Y1×L2	0	0	0	−1	0	0	0	0
I2×Y1×L2	0	0	1	1	0	0	0	0
E2×Y2×L1	0	0	0	0	0	0	−1	0
E1×Y1×L1	0	0	0	0	0	0	0	−1
E2×Y1×L1	0	0	0	0	1	0	1	0
E1×Y2×L1	0	0	0	0	1	0	1	0
E2×Y2×L1	0	0	0	0	0	1	0	1

Note. The effects in the rows are linearly dependent on the columns. Industry: I1 construction, I2 health and care; educational degree: E1 apprenticeship training, E2 high school; occupational experience: Y1 5 years, Y2 20 years; parental leave: L1 0 month, L2 3 months.

Table 5 reveals that the confounding structure of the DO-M3 design significantly differs from the DO-M2 confounding. The 8 four-way interactions are completely confounded with the set effects, main effects, and all other interaction effects (note that the table only shows the confounding of a subset of the interaction effects). The confounding structure of the DO-M3 design is not surprising because by including all main, two- and three-way interactions in the design-generating model, the algorithm has no choice other than to confound the excluded four-way interactions with all the other effects. Since all effects are confounded with the four-way interaction parameters, model specification becomes an issue if one does not want to estimate the fully saturated model (with all possible two- to four-way interactions). For instance, if the analytic model includes only the main effects but no interaction effects, then all two-way and higher order interaction effects will be indirectly confounded with the main effects, resulting in biased main effect estimates. This is so because the main effects are confounded with the four-way interaction effects, which are themselves confounded with the omitted two-way and higher-order interaction effects.

Table 5.

Confounding Structure of the D-optimal Design Generated with Main, All Two- and Three-way Interaction Effects and of the Random Sampling Design with Sample Size 54.

	DO-M3								RS-54
	I1×E1 × Y1×L2	I2×E1 × Y1×L2	I1×E2 × Y1×L2	I2×E2 × Y1×L2	I1×E1 × Y2×L2	I2×E1 × Y2×L2	I1×E2 × Y2×L2	I2×E2 × Y2×L2	I2×E2 × Y2×L2
Set 1	2.2	6.3	3.2	7.2	2.6	−1.4	10.5	6.3	—
Set 2	−1.2	9.5	3.4	8.7	2.2	0.1	13.5	9.0	—
Set 3	−1.4	−15.9	−4.3	−13.1	−5.3	−0.9	−16.7	−11.1	—
Set 4	−2.5	−2.7	−3.0	−3.1	−1.5	−0.3	−5.4	−3.2	—
Set 5	0.9	2.0	0.3	−0.4	−0.2	0.0	0.3	−1.9	—
Set 6	−0.1	−4.3	−1.1	−3.8	0.0	0.4	−7.1	−4.6	—
Set 7	−0.2	−4.3	−0.8	−1.7	−1.0	1.0	−5.5	−2.0	—
Set 8	1.1	3.0	1.5	2.7	0.0	0.1	2.9	3.6	—
I1	0.3	2.5	0.9	2.4	0.7	0.2	3.9	2.6	−0.1
I2	−0.5	−0.7	−0.9	−1.8	−0.2	0.0	−1.7	−1.9	0.3
E1	0.0	−0.9	−0.3	−0.8	−0.3	0.0	−1.1	−0.7	−0.1
E2	−0.2	−2.2	−0.8	−1.9	−0.7	0.0	−2.7	−1.6	0.3
Y1	0.1	0.2	0.1	0.3	0.2	−0.1	0.6	0.2	0.3
Y2	−0.5	−1.6	−0.6	−1.4	−0.5	0.0	−1.8	−1.1	−0.1
L1	0.1	−0.3	−0.1	−0.3	−0.1	0.0	−0.5	−0.4	−0.1
L2	0.0	−1.0	−0.2	−0.7	−0.2	−0.1	−0.8	−0.5	0.3
I1×E1	−0.3	−1.1	−0.2	−0.6	−0.3	−0.1	−0.8	−0.3	0.1
I2×E1	−0.2	−0.7	−0.1	−0.4	−0.3	−0.1	−0.7	−0.4	−0.3
I1×E2	0.1	1.8	0.6	1.4	0.4	0.2	1.9	1.3	−0.3
I2×E2	0.3	0.2	0.1	0.1	0.2	0.0	0.3	0.0	0.5
⋮									⋮
I1×E2×Y2×L1	−0.5	2.1	0.4	1.2	0.6	0.1	1.5	0.6	−0.3
I2×E2×Y2×L1	−0.9	−2.8	−0.5	−1.5	−1.2	0.3	−2.6	−0.6	0.5

Note. The confounding of two-, three-, and four-way interactions is shown for a few selected effects only. The effects in the rows are linearly dependent on the columns. The confounding coefficients are rounded to the first decimal point. Industry: I1 construction, I2 health and care; educational degree: E1 apprenticeship training, E2 high school; occupational experience: Y1 5 years, Y2 20 years; parental leave: L1 0 month, L2 3 months. DO-M3 = three-way interaction effect; RS-54 = random sampling designs with sample size of 54 respondents.

Random sampling designs

For the two random sampling designs, we randomly sampled for each respondent nine vignettes without replacement. For the RS-900 design with 900 respondents, we got 900 different vignette sets. The resulting RS-900 design is neither orthogonal nor balanced. The nonorthogonality is reflected by the dependence structure of the effect-coded predictors. The cross-factor correlations of the effect-coded predictors of all main and interaction effects range from −.05 to .04 with 60 percent of the correlations falling between −.01 and .01. Despite the nonorthogonality, the main and interaction effects are not confounded with each other because all vignettes in the vignette population are covered by the sampled vignette sets. However, the frequency of the sampled vignettes across sets varies between 78 and 120 (a balanced design would have exactly 100 measurements per vignette). Although the main and interaction effects are not confounded among themselves, any bias due to set-specific context effects cannot be removed because, with a single measurement per set, set indicators cannot be included in the analytic model. In addition, its nonorthogonality will produce bias when the analytic model is misspecified.

For the RS-54 design, we only sampled vignette sets for 54 respondents. The resulting design consists of 54 unique sets and is neither orthogonal and balanced nor unconfounded. The confounding results from the fact that the deliberately chosen design failed to cover the entire vignette population. Due to a single omitted vignette, the four-way interaction I2×E2×Y2×L2 is confounded with all other main and interaction effects (with confounding multipliers between −0.3 and 0.5; see Table 5). Note that such confounding structures also result when researchers drop one or more implausible vignettes from RBCF or D-optimal designs. The RS-54 design is also imbalanced because the vignette frequencies range from 0 to 11 (in a balanced design, we would have six measurements per vignette). The cross-factor correlations of the effect-coded predictors are between −.19 and .19, with only 20 percent of the correlations falling between −.01 and .01. In comparison to RS-900, the smaller respondent sample of the RS-54 design resulted in a confounding of effects, a less balanced and less orthogonal design.

The investigation of the long-term properties of the random sampling design builds on the RS-54 design, but in each iteration, new vignette sets are sampled such that the design matrix X varies across iterations. With 10,000 iterations and 54 respondents per iteration, we obtain 10,000×54 = 540,000 randomly drawn vignette sets, which is still only a small fraction of all possible sets, but it is good enough to approximate the properties in the long-term limit. Thus, across all iterations of the simulation, we obtain nearly perfect balance and orthogonality. Averaged across the 10,000 iterations, the vignette frequencies range from 5.94 to 6.05 vignettes per set, and all cross-factor correlations are between −.0015 and .0017. However, systematic context effects will still cause bias because they do not average out by randomly sampling vignettes (and with only one measurement per set, it is impossible to model set indicators when analyzing the data).

Data-generating Model

We generated the outcome data for each design separately. In order to account for the two-level structure of vignette data—vignettes are nested within respondents—we simulated our data according to a random intercept model with main and selected two- and three-way interaction effects:

log (Y_{i j}) = β_{0 j} + {I'}_{i j} β_{1} + {E'}_{i j} β_{2} + {Y'}_{i j} β_{3} + {L'}_{i j} β_{4} + {(I \times E)'}_{i j} β_{5} + {(I \times Y)'}_{i j} β_{6} + {(I \times L)'}_{i j} β_{7} + {(E \times Y)'}_{i j} β_{8} + {(I \times E \times Y)'}_{i j} β_{9} + r_{i j} β_{0 j} = γ_{00} + γ_{01} V 1_{j} + γ_{02} V 2_{j} + γ_{03} V 3_{j} + γ_{04} V 4_{j} + γ_{05} {rsex}_{j} + u_{0 j},

where log(Y_ij) in the level-1 equation is the logarithm of the income that respondent j assigns to vignette i. The vectors $I_{i j}$ , $E_{i j}$ , $Y_{i j}$ , and $L_{i j}$ represent the effect-coded predictors of the vignette factors industry (I), education (E), occupational experience (Y), and parental leave (L), respectively. $β_{0 j}$ is the random intercept for respondent j, $β_{1}$ to $β_{9}$ are the coefficient vectors of the predictors and their interactions. The independent and identically distributed error terms $r_{i j}$ were drawn from a normal distribution with zero mean and variance $σ_{r}^{2}$ , $NID (0, σ_{r}^{2})$ .

The level-2 equation generates the random intercepts that linearly depend on set-specific context effects induced by four selected vignettes (V1 _j –V4 _j ), a binomially distributed respondent sex (rsex _j ), and an independent and normally distributed error term, $u_{0 j} \sim NID (0, τ_{0}^{2})$ . V1 _j –V4 _j are the context effect indicators of four vignettes. In order to mimic the potential complexity of context effects in reality, we deliberately chose four vignettes for generating set- and order-specific context effects (Table 6). For the aim of our simulation study, the particular choice of vignettes is not of importance. We would obtained similar results with more or fewer context-affecting vignettes or a different selection of vignettes. The value of an indicator variable is equal to 1 if the set for respondent j contains the specific vignette, otherwise it is 0. For the data-generating scenario where no set effects are present (scenario C0), we set the corresponding vignette parameters to 0, γ₀₁ = γ₀₂ = γ₀₃ = γ₀₄ = 0. For the constant context effects scenario (C1), the effects for the first two vignettes are positive (γ₀₁ = .05, γ₀₂ = .1), that is, if a set contains one or both of the two vignettes, V1 or V2, the respondent rates all vignettes in the set higher as compared to a situation where none of the two vignettes is contained in the set. The context effects caused by vignettes V3 and V4 are negative (γ₀₃ = −.15, γ₀₄ = −.1), thus their presence results in lower ratings of all vignettes in the sets. Since the four effects are additive, the context effects vary between −.25 and .15 across sets (but within sets they are constant).

Table 6.

Four Vignettes that Generate the Context Effects.

Vignette No.	Industry	Educational Level	Occupational Experience
V1	Construction	Apprenticeship	35
V2	Business	Apprenticeship	20
V3	Construction	College	35
V4	Business	College	20

For the scenario of heterogeneous context effects (C2) where only some vignette measurements are affected by the presence of any of the four atypical vignettes, the context effect is generated as a cross-level interaction between the four vignette indicators V1–V4 and an indicator variable I that is equal to one if a vignettes belongs to the health and care industry (otherwise 0): $(β_{10} V 1_{i j} + β_{11} V 2_{i j} + β_{12} V 3_{i j} + β_{13} V 4_{i j}) \times I_{i j}$ . In order to generate the order-specific context effects (C3), we first randomized the order of the vignettes for each respondent (mimicking a randomization strategy to mitigate order effects). Then, the order effects in the data-generating model are given by $β_{10} V 1_{i j} + β_{11} V 2_{i j} + β_{12} V 3_{i j} + β_{13} V 4_{i j}$ , but now the vignette indicators V1–V4 indicate whether vignette i has been presented after vignettes V1–V4, respectively. The size of the context effects ( $β_{10}, β_{11}, β_{12}, β_{13}$ ) are identical to the gammas described above. The only difference is that the context effects (C3 and C4) are now generated in the level-1 equation.

All effect and variance parameters for generating the data were held constant across all five designs and are given in Table 7. With the exception of the context-effect parameters, we chose parameter values based on the study by Steiner et al. (2016). Since they found significant three-way interactions, we also included the corresponding three-way interaction effects in our data-generating model. Although three-way or higher order interaction effects are frequently not significant in practice and do not play an important role in many subject matter theories, they might nonetheless be present and occasionally be large. For instance, vignettes with extreme or unusual factor-level combinations can produce strong higher order interaction effects. The context effects are of similar magnitude as the largest main effects of the vignette factors. We deliberately chose large set effects in order to highlight their implications for the different designs. Although the parameter settings do not vary across designs, the generated data sets differ systematically across vignette designs because each design has a different allocation of vignettes to sets, which is captured by the respondent-specific vectors $I_{i j}$ , $E_{i j}$ , $Y_{i j}$ , and $L_{i j}$ (for i = 1,…, 9) and their interactions.

Table 7.

Parameter Values of the Data-generating Model.

Variable (Predictors)	Parameter Value
Level-1 variables
Industry (I1, I2)	$β_{1}$ = (0.045, 0.027)
Education (E1, E2)	$β_{2}$ = (−0.116, −0.049)
Occupational experience (Y1, Y2)	$β_{3}$ = (−0.138, 0.030)
Parental leave (L1, L2)	$β_{4}$ = (−0.002, 0.002)
Industry×Education (I1×E1, I2 × E1, I1×E2, I2×E2)	$β_{5}$ = (−0.037, 0.032, 0.009, −0.019)
Industry×Occupational experience (I1×Y1, I2×Y1, I1×Y2, I2×Y2)	$β_{6}$ = (−0.030, 0.035, 0.019, −0.004)
Industry×Parental leave (I1×L1, I2 × L1, I1×L2, I2×L2)	$β_{7}$ = (−0.006, −0.002, 0.032, 0.000))
Occupational experience×Education (Y1×E1, Y2×E1, Y1×E2, Y2×E2)	$β_{8}$ = (0.000, −0.001, −0.026, 0.007)
Industry×Occupational experience × Education (I1×Y1×E1, I2×Y1 × E1, I1×Y2×E1, I2×Y2×E1, I1 × Y1×E2, I2×Y1×E2, I1×Y2 × E2, I2×Y2×E2)	$β_{9}$ = (0.002, 0.005, −0.004, 0.014, −0.008, 0.017, −0.014, 0.000, 0.053)
Level-1 error variance	$σ_{r}^{2}$ = 0.053
Level-2 variables
Intercept	$γ_{00}$ = 7.702
Context effect indicator V1	$γ_{01}$ = (0, 0.05)
Context effect indicator V2	$γ_{02}$ = (0, 0.1)
Context effect indicator V3	$γ_{03}$ = (0, −0.15)
Context effect indicator V4	$γ_{04}$ = (0, −0.1)
Respondent gender	$γ_{05}$ = −0.024
Level-2 error variance	$τ_{0}^{2}$ = 0.04

Note. Industry: I1 construction, I2 health and care; educational degree: E1 apprenticeship training, E2 high school; occupational experience: Y1 5 years, Y2 20 years; parental leave: L1 0 month, L2 3 months.

Analytic Models

In order to assess the vignette designs’ robustness against model misspecifications, we estimate for each data set three different random intercept models: A main effects model (A1), a two-way interaction effects model (A2), and a three-way interaction effects model (A3), which correspond to our data-generating model. The model equations are given by:

log (Y_{i j}) = β_{0 j} + {I'}_{i j} β_{1} + {E'}_{i j} β_{2} + {Y'}_{i j} β_{3} + {L'}_{i j} β_{4} + r_{i j} β_{0 j} = γ_{00} + s e {t'}_{j} γ_{01} + γ_{02} {rsex}_{j} + u_{0 j},

(A1)

log (Y_{i j}) = β_{0 j} + {I'}_{i j} β_{1} + {E'}_{i j} β_{2} + {Y'}_{i j} β_{3} + {L'}_{i j} β_{4} + {(I \times E)'}_{ij} β_{5} + {(I \times Y)'}_{ij} β_{6} + {(I \times L)'}_{ij} β_{7} + {(E \times Y)'}_{ij} β_{8} + r_{i j} β_{0 j} = γ_{00} + s e {t'}_{j} γ_{01} + γ_{02} {rsex}_{j} + u_{0 j},

(A2)

log (Y_{i j}) = β_{0 j} + {I'}_{i j} β_{1} + {E'}_{i j} β_{2} + {Y'}_{i j} β_{3} + {L'}_{i j} β_{4} + {(I \times E)'}_{i j} β_{5} + {(I \times Y)'}_{i j} β_{6} + {(I \times L)'}_{i j} β_{7} + {(E \times Y)'}_{i j} β_{8} + {(I \times E \times Y)'}_{i j} β_{9} + r_{i j} β_{0 j} = γ_{00} + s e {t'}_{j} γ_{01} + γ_{02} {rsex}_{j} + u_{0 j} .

(A3)

The vector $s e t_{j}$ consists of eight effect-coded set indicators, and it models potential set effects in the RBCF and D-optimal designs.⁵ The analyses for the random sampling designs cannot include any set indictors because each set is only measured once. As for the data-generating model, all predictors are effect coded.⁶

Evaluation of Vignette Designs

For each of the 10,000 iterations in our simulation, we estimated the three analytic models for all fives designs (RBCF, DO-M2, DO-M3, RS-900, and RS-54) and the RS-54 long-term evaluation. For each design, we computed the bias in the effect-coded parameter estimates, as the average difference between the estimated and true data-generating parameter value. In order to reflect the simulation uncertainties in the biases (due to the finite number of iterations), we also computed 95 percent simulation confidence intervals for the biases (by using parameters’ standard errors divided by the square root of the number of iterations). If a simulation interval does not contain 0, it is most likely due to systematic bias rather than sampling uncertainty in our finite simulation.

Simulation Results

Figures 1 –6 show the 95 percent simulation intervals of the biases for the estimated main and two-way interaction effects and also for the level-2 covariate of respondent’s sex (rsex). Since we are interested in designs that are capable of estimating main and two-way interaction effects, we do not present results for the three- or four-way interaction effects (many of them are confounded anyway). Each figure contains six panels representing the results for the RBCF, DO-M2, DO-M3, RS-900, and RS-54 design and the RS-54 long-term evaluation. The small triangles next to the x-axis indicate a significant positive (△) or negative bias (▽), that is, when the 95 percent simulation interval fails to cover 0. Extreme biases with simulation intervals outside the plotting range of (−0.004, 0.004) are not shown, though the direction of the bias is still indicated by the triangles.

Figure 1.

Bias in main and two-way interaction effects when no context effects are present (C0) and the analytic model is correctly specified (A3). Biases are represented as a 95 percent simulation interval. If the interval does not include 0, the triangles just above the x-axis indicate a significant positive (△) or negative bias (▽). Biases outside the range (−.004, .004) are not shown, but the triangles still indicate the direction of the bias. The main effects show the predictors of industry (I1, I2), educational degree (E1, E2), occupational experience (Y1, Y2), and parental leave (L1, L2) sequentially. The two-way interactions sequentially show the predictors I1×E1, I1×E2, I2×E1, I2×E2, I1×Y1, I2×Y1, I1×Y2, I2×Y2, I1×L1, I2×L1, I1×L2, I2×L2, Y1×L1, Y2×L1, Y1×L2, and Y2×L2.

Figure 2.

Bias in main and two-way interaction effects when no context effects are present (C0) and the analytic model is misspecified (A2). Biases are represented as a 95 percent simulation interval. If the interval does not include 0, the triangles just above the x-axis indicate a significant positive (△) or negative bias (▽). Biases outside the range (−.004, .004) are not shown, but the triangles still indicate the direction of the bias. The main effects show the predictors of industry (I1, I2), educational degree (E1, E2), occupational experience (Y1, Y2), and parental leave (L1, L2) sequentially. The two-way interactions sequentially show the predictors I1×E1, I1×E2, I2×E1, I2×E2, I1×Y1, I2×Y1, I1×Y2, I2×Y2, I1×L1, I2×L1, I1×L2, I2×L2, Y1×L1, Y2×L1, Y1×L2, and Y2×L2.

Figure 3.

Bias in main and two-way interaction effects when no context effects are present (C0) and the analytic model is misspecified (A1). Biases are represented as a 95 percent simulation interval. If the interval does not include 0, the triangles just above the x-axis indicate a significant positive (△) or negative bias (▽). Biases outside the range (−.004, .004) are not shown, but the triangles still indicate the direction of the bias. The main effects show the predictors of industry (I1, I2), educational degree (E1, E2), occupational experience (Y1, Y2), and parental leave (L1, L2) sequentially.

Figure 4.

Bias in main and two-way interaction effects when constant context effects are present (C1) and the analytic model is correctly specified (A3). Biases are represented as a 95 percent simulation interval. If the interval does not include 0, the triangles just above the x-axis indicate a significant positive (△) or negative bias (▽). Biases outside the range (−.004, .004) are not shown, but the triangles still indicate the direction of the bias. The main effects show the predictors of industry (I1, I2), educational degree (E1, E2), occupational experience (Y1, Y2), and parental leave (L1, L2) sequentially. The two-way interactions sequentially show the predictors I1×E1, I1×E2, I2×E1, I2×E2, I1×Y1, I2×Y1, I1×Y2, I2×Y2, I1×L1, I2×L1, I1×L2, I2×L2, Y1×L1, Y2×L1, Y1×L2, and Y2×L2.

Figure 5.

Bias in main and two-way interaction effects when heterogeneous context effects are present (C2) and the analytic model is correctly specified (A3). Biases are represented as a 95 percent simulation interval. If the interval does not include 0, the triangles just above the x-axis indicate a significant positive (△) or negative bias (▽). Biases outside the range (−.004, .004) are not shown, but the triangles still indicate the direction of the bias. The main effects show the predictors of industry (I1, I2), educational degree (E1, E2), occupational experience (Y1, Y2), and parental leave (L1, L2) sequentially. The two-way interactions sequentially show the predictors I1×E1, I1×E2, I2×E1, I2×E2, I1×Y1, I2×Y1, I1×Y2, I2×Y2, I1×L1, I2×L1, I1×L2, I2×L2, Y1×L1, Y2×L1, Y1×L2, and Y2×L2.

Figure 6.

Bias in main and two-way interaction effects when order effects are present (C3) and the analytic model is correctly specified (A3). Biases are represented as a 95 percent simulation interval. If the interval does not include 0, the triangles just above the x-axis indicate a significant positive (△) or negative bias (▽). Biases outside the range (−.004, .004) are not shown, but the triangles still indicate the direction of the bias. The main effects show the predictors of industry (I1, I2), educational degree (E1, E2), occupational experience (Y1, Y2), and parental leave (L1, L2) sequentially. The two-way interactions sequentially show the predictors I1×E1, I1×E2, I2×E1, I2×E2, I1×Y1, I2×Y1, I1×Y2, I2×Y2, I1×L1, I2×L1, I1×L2, I2×L2, Y1×L1, Y2×L1, Y1×L2, and Y2×L2.

Biases Due to Model Misspecifications

Figures 1 –3 show the biases for the three analytic models A3, A2, and A1, when no context effects are present (C0). If the analytic model is correctly specified with three-way interactions (A3), all designs produce unbiased effect estimates (Figure 1). The few significant deviations from the zero bias line (indicated by the triangles for the DO-M3 and RS-54 plots) are minor and likely due to chance. The bias intervals for the RS-54 design and the RS-54 long-term evaluation are wider than for the other designs because the smaller sample sizes result in less precision. It is worth noting that the RS-54 design does not result in any bias though the design confounds four-way interactions with all the other effects (see Table 5). The confounding is not an issue here because the four-way interactions were set to 0 in the data-generating model (but with nonzero effects, we would obtain biased effect estimates).

Figure 2 shows the biases for the misspecified analytic model A2 (omitted three-way interactions). All effect estimates from the RBCF and DO-M2 design are still unbiased because both designs are orthogonal and confound neither main nor two-way interaction effects. However, the DO-M3 design results in biased effect estimates despite its orthogonality because all main and two-way interaction effects are indirectly confounded with the omitted three-way interactions (via the confounded four-way interaction effects; Table 5). Also the estimates of the random sampling design (RS-900) are slightly biased due to the design’s nonorthogonality. Since the predictors of the main and two-way interactions are correlated with the omitted three-way interactions, we obtain classical omitted variable bias (Angrist, Pischke, and Pischke 2009; Steiner and Kim 2016). The same holds for the RS-54 design, but biases are, in general, larger because the correlation among predictors is larger. However, the omitted variable bias vanishes in the RS-54 long-term evaluation because the design is orthogonal in the long-term limit.

When the analytic model does not even include two-way interaction effects (A1), Figure 3 shows that we obtain similar results as before but with significantly larger omitted variable biases in the main effects. Again, the DO-M3 and RS designs are very sensitive to model misspecification. The sensitivity of D-optimal designs to model misspecifications clearly depends on the design-generating model and the resulting confounding structure. The level-2 covariate (respondent’s sex) is estimated without bias in both misspecified models.

Biases Due to Context Effects

Figures 4 –6 show for each design the biases in parameter estimates when different types of context effects are present (C1, C2, C3), but the analytic model is correctly specified (A3). Figure 4 contains the results for the scenario with constant context effects (C1) when all vignettes of a set are simultaneously presented. The RBCF and both D-optimal designs produce unbiased estimates because the set indicators in the analytic model (A3) control for the constant context effects and are not confounded with the main and two-way interaction effects. Since set indicators cannot be modeled with random sampling designs (because each set is only measured once), the context effects bias not only the main and two-way interaction effects but also the estimate of the level-2 covariate (respondent’s sex) in the RS-900 and RS-54 design. Importantly, even in the long-term evaluation or with extremely large samples, some estimates remain biased.

Figure 5 displays the biases for heterogeneous context effects (C2), that is, the set-specific context differentially affects the vignettes within a set. In this case, all our investigated vignette designs produce biased effect estimates. Since the context effect is produced by an interaction with the industry factor (health and care), for both the RBCF and DO-M2 design, all industry-related main effects (first two effects in the plot) and two-way interaction effects (last four effects) are considerably biased. But the estimates of all other main and two-way interaction effects remain unbiased. This does not hold for the DO-M3, RS-900, and RS-54 design. For the DO-M3 design, biases occur due to the indirect confounding of set effects with all main and interaction effects. In addition, the industry-related effect estimates are biased because the set indicators cannot fully capture the heterogeneous context effects. For the RS-900 design, the uncontrollable set effects and the design’s nonorthogonality are responsible for the bias. The same holds for the RS-54 design, but its confounding structure leads to additional bias. Even in the long term, biases still remain, particularly in the industry-related estimates.

Figure 6 shows the results for order effects when vignettes are presented sequentially (C3). When systematic context effects arise because of the vignettes’ ordering, the estimates of all designs are biased. The set indicators that control for the set effects no longer capture the order-specific context effects. These biases occur despite randomizing the order of vignettes for each respondent and in each iteration of the simulation. Without randomizing the order, the biases would be even larger (not shown here). The biases seem to be of comparable magnitude for all designs with 900 respondents. For the RS-54 design, more extreme biases are possible due to the smaller sample size. For the long-term evaluation (or with large sample sizes), biases tend to become smaller, but they do not vanish.

The level-2 covariate (respondent’s sex) is affected by context effects as well. For constant and heterogeneous context effects (Figures 4 and 5), the level-2 covariate is biased in random sampling designs (RS-900 and RS-54). In the presence of order effects (Figure 6), the level-2 covariate is biased in almost all designs. Also note that all biases in Figures 4 –6 are exclusively due to the context effect because the analytic model (A3) has been correctly specified for all designs. With misspecified models, the omitted variable biases shown in Figures 2 and 3 would cause additional biases in parameter estimates.

Conclusions

To guide researchers in choosing an appropriate vignette design, we compared the confounded factorial design, D-optimal design, and random sampling design with respect to potential biases induced by confounding, context effects, and misspecifications of the analytic model. We showed that the designs’ performance is directly linked to their design features—confounding, orthogonality, balancedness, and the ability to model set effects. It is crucial to always check the confounding structure before the design is implemented in the field, since confounded effects cannot be interpreted without making strong assumptions about the absence of other effects. If the design is not orthogonal, bias due to model misspecification becomes an issue. Finally, balancedness and orthogonality guarantee an efficient estimation of parameters. Our simulations demonstrated that carefully chosen confounded factorial or D-optimal designs avoid undesirable confounding, allow for a modeling of set effects, are less sensitive to model misspecifications, and guarantee the maximum efficiency.

Experimental Vignette Designs

Our comparison of the three major strategies for constructing vignette sets revealed that the confounded factorial designs (RBCF designs) produce the least biased effect estimates. Table 8 lists the key features of the three main designs. These features hold in general and are not restricted to the specific design discussed in this article. RBCF designs guarantee unbiased effect estimates of all unconfounded effects even when constant context effects are present and when the analytic model is misspecified. These desirable properties are due to the controlled confounding of higher order interaction effects with set effects, the orthogonality, and the balancedness of RBCF designs. However, RBCF designs are typically restricted to simple designs with a few factors and ideally the same number of factor levels. For more complex designs that involve large vignette populations, generated from a large number of factors (i.e., five or more factors) with unequal numbers of factor levels (i.e., 2–10 or more levels), adequate RBCF designs might not exist or be challenging to construct. In such situations, D-optimal designs are a viable alternative because slightly nonorthogonal designs with an acceptable confounding structure might still exist.

Table 8.

Comparison of the RBCF Design, D-optimal Design, and Random Sampling Design.

Design Feature	RBCF^a	D-Optimal	Random Sampling
Main and interaction effects are confounded with each other	No	Design dependent^b	Yes^c
Interaction effects are confounded with set effects	Yes	Yes	Yes
Orthogonal^d	Yes	Design dependent^b	No
Balanced	Yes	Design dependent^b	No
Sensitivity to model misspecification	No	Design dependent^b	Yes
Can deal with homogeneous context effects	Yes	Yes	No
Can deal with heterogeneous context effects	Partially	Partially	No
Can deal with systematic order effects	No	No	No
Design flexibility (large number of factors and factor levels)	Low	High	High

Note. RBCF = randomized block confounded factorial.

^aRBCF designs that allows for a complete confounding (only possible for designs of low complexity). ^bWhether D-optimal designs share RBCF’s desirable design features depends on the design’s complexity, that is, the number of factors and factor levels, the set size, and the design-generating model chosen by the researcher. ^cIf the sampled vignettes exhaust the entire vignette population, no confounding of main and interaction effects would occur. ^dOrthogonality of all unconfounded effects (across factors).

The performance of D-optimal designs can be almost identical to RBCF designs if, for a given design-generating model, the search algorithm finds a design that does not confound any effects of interest, in particular, the main and two-way interaction effects. D-optimal designs have the advantage that they are easy to generate and much more flexible than RBCF designs. Search algorithms in SAS or R, for instance, can be used to find an appropriate design. However, for complex designs with relatively small set sizes (i.e., less than 20 or 30 vignettes per set), it might not be possible to prevent main and two-way interaction effects from being confounded with higher order effects. In addition to the direct confounding of main or two-way interaction effects, an incorrect model specification (e.g., omission of higher order terms) may trigger additional confounding bias. Moreover, complex D-optimal designs are only approximately orthogonal and not necessarily balanced, but this is usually of minor concern. In any case, researchers need to carefully select a design-generating model and check the confounding structure of the resulting design—also because the algorithms rarely find the design with the least confounding in a single run. As exemplified with our DO-M3 design, aiming at an efficient estimation of three-way interactions produced an undesirable confounding structure and sensitivity to model misspecification.

Random sampling designs show the poorest performance. They produce unbiased effect estimates only if (i) main and interaction effects are not confounded with each other (i.e., each vignette is measured at least once) or the confounded effects are 0, (ii) systematic context effects are absent, and (iii) the analytic model is correctly specified. If these conditions are not met, the random sampling design fails to deliver unbiased estimates. Although they are easy to implement, researchers should be aware of the uncontrolled confounding, nonorthogonality, and the inability to deal with any sort of context effects. Since a meaningful interpretation of effect estimates rests on stronger but untestable assumptions, standard random sampling designs have a lower validity than RBCF and D-optimal designs and should be avoided.

In practice, the choice of a specific experimental design is mostly driven by the substantive research question, the corresponding number of factors and factor levels, and the maximum set size possible. The research question determines which effects need to be estimable without confounding and with sufficient power. With only a few factors and two or three levels per factor, RBCF designs should be considered first. For designs with more factors and an unequal number of factor levels, software packages can be used to search for a D-optimal design with a confounding structure that does not compromise the research question. If some of the effects of major interest are confounded, increasing the set size or reducing the number of factors or factor levels might solve the confounding issue. Which of the two strategies actually works in practice depends on how many vignettes a respondent can reasonably assess, and whether dropping factors or factor levels would severely compromise the research question.

However, if subject matter theory or the research question does not allow for a reduction in factors and factor levels, then a confounding of higher order effects with main or two-way interaction effects might be acceptable for an exploratory analysis that aims at investigating the factors’ relative importance rather than accurately estimating main and interaction effects in a confirmatory analysis. If the respondent sample is large (i.e., such that each set of the basic design could be measured multiple times), an alternative strategy is to double or triple the number of sets but without reducing the set size. That is, each vignette appears in two or three different sets instead of a single set only. For an RBCF design, this means that we replicate the design 2 or 3 times but always with a different confounding contrast such that each single design confounds different effects. For a D-optimal design, we only need to increase the number of sets in the search algorithm while holding the set size constant. Such a strategy might allow for an unconfounded estimation but at the cost of decreased efficiency and power.

To avoid bias due to model misspecifications, particularly in confirmatory analyses, researchers should always estimate the fully saturated model that includes all set effects and all estimable main and interaction effects even if they are confounded with other effects; if the saturated model contains a large number of predictors, one should at least compare the effect estimates of the more parsimonious model to those from the saturated model. If the effect-coded estimates significantly differ, the higher order terms should not be dropped.

In any case, researchers searching for an optimal vignette design must be aware of the trade-off between the design’s complexity and its validity, given a fixed set size and respondent sample: The higher the complexity of the research question and vignette design, the lower the design’s validity. Although large vignette designs can be very helpful in screening the relative importance of large sets of factor, they are rarely an ideal choice for confirmatory analyses because the unbiasedness of effect estimates rests on strong but untestable assumptions: All effects that are confounded with the main and interaction effects of interest must be 0.

Context Effects

None of the three designs considered in the article can adequately deal with all types of context effects. This is a serious issue for practice because context effects likely occur whenever a respondent has to assess multiple vignettes. If the set-specific context affects all vignette measurements to the same additive extent (constant context effect), then the inclusion of set indicators allows us to remove the context effect—provided each set is measured multiple times. However, set indicators fail to handle heterogeneous context effects and order effects. Also randomization strategies, like using a larger number of randomly drawn sets (i.e., a random sampling design) and randomly ordering the vignettes within a set, cannot deal with such systematic context or order effects.

Different strategies can be used to prevent or control for set- and order-specific context effects. First, both types of context effects can be avoided with between-subjects designs where each respondent only receives a single vignette. But with large vignette populations, between-subjects designs might not be a viable alternative mostly because of the lack of power.

Second, order effects can be avoided or at least mitigated if all vignettes of a set are simultaneously presented to a respondent. If a respondent has the chance to screen all vignettes before their assessment, order effects will be less likely. It might also be helpful to conduct pilot tests with different assignments of vignettes to sets in order to detect potential context effects. If there is an indications of strong context effects, a redesign of the vignette experiment (e.g., avoiding or removing extreme factor levels) may be able to avoid biased effect estimates.

Third, if vignettes are presented sequentially, researchers may use a counterbalancing strategy to overcome biases induced by order effects. Although a complete counterbalancing is rarely feasible due to the relatively large number of vignettes in each set, it is possible to partially counterbalance the potential order effects. For instance, within each set, one could employ a Latin squares design to (groups of) vignettes (Bradley 1958; Reese 1997). Such designs are clearly very demanding and require multiple measurements for each variation of a single set and a corresponding modeling when analyzing the data.

Finally, removing implausible or illogical vignettes from the vignette population is frequently also recommended as a strategy to prevent potential context effects and invalid assessments (e.g., Auspurg and Hinz 2015; Jasso 2006). While this strategy undoubtedly increases the validity of vignette assessments, it biases main and interaction effects with regard to the complete vignette population because the resulting nonestimable interaction effects will necessarily be confounded with other estimable effects; and with regard to the restricted population (i.e., without the dropped vignettes), the interpretation of main and interaction effects is complicated and frequently meaningless with respect to the research question. The estimable effects are approximately unbiased and directly interpretable only if the confounded interaction effects are negligibly small. In order to avoid overly strong assumptions in interpreting the estimated effects, researchers are advised to design a vignette experiment with factor levels that do not result in implausible or illogical vignettes. If extreme factor levels are nonetheless of major interest, then multiple but smaller vignette designs that do not result in implausible or illogical vignettes may be able to circumvent the issue.

We evaluated the three basic vignette designs using a simulation study from a theoretical point of view. But the choice of a specific strategy for constructing vignette sets and dealing with context effects also depends on other factors that are specific to a study, like the cognitive skills of respondents, mode of vignette presentation, or researcher’s belief about the presence of context effects. Thus, besides the theoretical evaluation, we would need more empirical evidence from split ballot experiments—particularly with respect to the strategies for handling different types of context effects. Moreover, external validity evaluations can also help us in learning what works in practice (examples of external validity evaluations can be found in Eifler [2010], and Steiner et al. [2016]).

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Notes

References

Alves

W. M.

Rossi

P. H.

. 1978. “Who Should Get What? Fairness Judgments of the Distribution of Earnings.” American Journal of Sociology 84:541–64. doi:10.1086/226826

Angrist

J. D.

Pischke

J. S.

. 2009. Mostly Harmless Econometrics: An Empiricist’s Companion. Vol. 1. Princeton, NJ: Princeton University Press.

Atkinson

Donev

Tobias

. 2007. Optimum Experimental Designs, with SAS. Vol. 34. Oxford, UK: Oxford University Press.

Atzmüller

Steiner

P. M.

. 2010. “Experimental Vignette Studies in Survey Research.” Methodology: European Journal of Research Methods for the Behavioral and Social Sciences 6:128–38.

Auspurg

Hinz

. 2015. Factorial Survey Experiments. Thousand Oaks, CA: Sage.

Auspurg

Jäckle

. 2015. “First Equals Most Important? Order Effects in Vignette-based Measurement.” Sociological Methods & Research. doi:10.1177/0049124115591016.

Bradley

J. V.

1958. “Complete Counterbalancing of Immediate Sequential Effects in a Latin Square Design.” Journal of the American Statistical Association 53:525–28.

Cox

D. R.

1958. Planning of Experiments. New York: Wiley.

Dülmer

2007. “Experimental Plans in Factorial Surveys: Random or Quota Design?” Sociological Methods & Research 35:382–409.

10.

Dülmer

2016. “The Factorial Survey Design Selection and Its Impact on Reliability and Internal Validity.” Sociological Methods & Research 45:304–47.

11.

Eifler

2010. “Validity of a Factorial Survey Approach to the Analysis of Criminal Behavior.” Methodology 6:139–46. doi:10.1027/1614-2241/a000015.

12.

Fisher

R. A.

1926. “The Arrangement of Field Experiments.” Journal of the Ministry of Agriculture of Great Britain 33:503–13.

13.

Gravetter

F. J.

Forzano

L.-A. B. T.

. 2015. Research Methods for the Behavioral Sciences. 5th Ed. Belmont, CA: Wadsworth.

14.

Hox

J. J.

Kreft

I. G.

Hermkens

P. L.

. 1991. “The Analysis of Factorial Surveys.” Sociological Methods & Research 19:493–510.

15.

Jasso

2006. “Factorial Survey Methods for Studying Beliefs and Judgments.” Sociological Methods & Research 34:334–423.

16.

Jasso

Webster

. 1997. “Double Standards in Just Earnings for Male and Female Workers.” Social Psychology Quarterly 60: 66–78.

17.

Kirk

R. E.

1995. Experimental Design: Procedures for the Behavioral Sciences. 3rd ed. Pacific Grove, CA: Brooks/Cole.

18.

Kuhfeld

W. F.

1997. “Efficient Experimental Designs Using Computerized Searches.” Research Paper Series , SAS Institute, Inc. Sequim, WA: Sawtooth Software Conference Proceedings.

19.

Kuhfeld

W. F.

Tobias

R. D.

Garratt

. 1994. “Efficient Experimental Design with Marketing Research Applications.” Journal of Marketing Research 31:545–57.

20.

Montgomery

D. C.

1984. Design and Analysis of Experiments. New York: Wiley.

21.

Reese

H. W.

1997. “Counterbalancing and Other Uses of Repeated-measures Latin-square Designs: Analyses and Interpretations.” Journal of Experimental Child Psychology 64:137–58.

22.

Rossi

P. H.

1951. “The Application of Latent Structure Analysis to the Study of Social Stratification.” PhD dissertation , Columbia University, New York.

23.

Rossi

P. H.

Nock

S. L.

, Eds. 1982. Measuring Social Judgments: The Factorial Survey Approach. Beverly Hills, CA: Sage.

24.

Rossi

P. H.

Sampson

W. A.

Bose

C. E.

Jasso

Passel

. 1974. “Measuring Household Social Standing.” Social Science Research 3:169–90.

25.

Rossi

P. H.

Waite

Bose

C. E.

Berk

R. E.

. 1974. “The Seriousness of Crimes: Normative Structures and Individual Differences.” American Sociological Review 39:224–37.

26.

SAS Institute Inc. 2012. Using JMP 10. Cary, NC: SAS Institute.

27.

Schunck

Abendroth

Diewald

Melzer

S. M.

Pausch

. 2013. “What Do Women and Men Want? Investigating and Measuring Preference Heterogeneity for Life Outcomes Using a Factorial Survey.” SFB 882 Working Paper Series, 20, DFG Research Center (SFB) 882 from Heterogeneities to Inequalities , Bielefeld, Germany.

28.

Searle

S. R.

1987. Linear Models for Unbalanced Data. New York: Wiley.

29.

Steiner

P. M.

Atzmüller

. 2006. “Experimentelle Vignettendesigns in faktoriellen Surveys [Experimental Vignette Designs in Factorial Surveys]” Kölner Zeitschrift für Soziologie und Sozialpsychologie 58:117–46.

30.

Steiner

P. M.

Atzmüller

. 2016. “Designing Valid and Reliable Vignette Experiments for Survey Research: A Case Study on the Fair Gender Income Gap.” Journal of Methods and Measurement in the Social Sciences 7:52–94.

31.

Steiner

P. M.

Kim

. 2016. “The Mechanics of Omitted Variable Bias: Bias Amplification and Cancellation of Offsetting Biases.” Journal of Causal Inference 4. doi:10.1515/jci-2016-0009.

32.

Taylor

B. J.

2006. “Factorial Surveys: Using Vignettes to Study Professional Judgement.” British Journal of Social Work 36:1187–207.

33.

Wheeler

2014. “AlgDesign: Algorithmic Experimental Design. R Package Version 1.1-7.2.” Retrieved October 15, 2014 (http://CRAN.R-project.org/package=AlgDesign).

34.

C. J.

Hamada

M. S.

. 2009. Experiments: Planning, Analysis, and Optimization. Vol. 552. Hoboken, NJ: John Wiley.

35.

Yellen

S. B.

Cella

D. F.

. 1995. “Someone to Live for: Social Well-being, Parenthood Status, and Decision-making in Oncology.” Journal of Clinical Oncology 13:1255–64.

An Evaluation of Experimental Designs for Constructing Vignette Sets in Factorial Surveys

Abstract

Keywords

Vignette Population, Vignette Sets, Context Effects, and Set Effects

Generating the Vignette Population

Constructing Vignette Sets

Context Effects

Set Effects

Experimental Designs for Constructing Vignette Sets

Confounding, Orthogonality, and Balancedness

Experimental Vignette Designs

Confounded factorial design

D-optimal design

Random sampling design

Simulation Study

Generating the Vignette Designs

RBCF-3 4 design

D-optimal designs

Random sampling designs

Data-generating Model

Analytic Models

Evaluation of Vignette Designs

Simulation Results

Biases Due to Model Misspecifications

Biases Due to Context Effects

Conclusions

Experimental Vignette Designs

Context Effects

Footnotes

Declaration of Conflicting Interests

Funding

Notes

References

RBCF-3 ⁴ design