Abstract
In recent decades, factorial survey experiments (FSEs) have become increasingly widespread and successful for analyzing attitudes and behavioral intentions. FSEs measure the ratings of multidimensional treatments embedded in textual scenarios, which are called vignettes. Analyses of FSEs often assume that these ratings are interval scaled. Past research indicates that this assumption is problematic. Therefore, the following article develops a design for interval scaling tests in FSEs and a method for analyzing FSEs which is sensitive to the scaling level of ratings. An exemplary application of scaling sensitive factorial survey analysis in comparison to standard methods yields effect sizes, which are about a sixth larger with regard to the treatments used in the FSE and about a third larger with respect to influences of between respondent differences. The new method also enables the evaluation of interval in contrast to noninterval scaled rating behavior.
Keywords
In recent decades, factorial survey experiments (FSEs, Jasso and Rossi 1977; Rossi and Anderson 1982) have become an increasingly widespread and successful method for measuring and analyzing attitudes, judgments, beliefs, opinions, preferences, and behavioral intentions (Wallander 2009). An FSE is a type of survey experiment consisting of—typical textual—scenarios (called vignettes) combining several treatments (called dimensions) with controlled varying doses (called levels). The vignettes are used as stimuli which are singly evaluated by respondents (Auspurg and Hinz 2015). The review of Wallander (2009) shows that in 71 percent of the FSEs, the responses to these evaluation tasks were analyzed with methods that assume an observed interval or log-interval scale like linear regression or analysis of variance, that is, it was assumed that not only the ranking of the vignette evaluations but also the distance between these rankings is informative. With regard to the response instruments used for the evaluation tasks, 46 percent of the reviewed FSEs used ordered and 15 percent unordered category rating instruments (Wallander 2009). 1 Hence, while interval scales are not necessarily implied by the response instruments commonly used in FSEs, the methods of analysis typically implemented require the assumption of an interval scale. Such methods of analysis are also most widely discussed in the literature on analyzing FSEs (Hox, Kreft, and Hermkens 1991, Jasso 2006). Past research demonstrates that the match between responses and the interval scaling assumption varies between individuals and study settings (Wegener 1982). Thus, there is an interval scaled and a noninterval scaled evaluation mode in empirical applications.
These results indicate that assuming interval scaled vignette ratings, in general, is problematic since for some cases, the distance between the ratings is not informative for the analyses. Therefore, it is plausible to expect that utilizing information on the scaling level of vignette ratings improves the effect sizes, that is, standardized mean differences, resulting from factorial survey analyses and furthermore, that factors influencing vignette ratings can vary between an interval and a noninterval scaled response mode. Based on these expectations, the following article adds to the methodological research on FSEs and survey response quality by answering two related questions: First, how is the interval scaling assumption testable in FSEs? Second, how can one analyze FSEs in a way which is sensitive to the scaling level of vignette ratings, that is, with a method which does not mistakenly treat all vignette evaluations in an FSE as interval scaled but also does not disregard the truly interval scaled information measured in a FSE?
This article is structured as follows: First, I introduce methods for testing the interval scaling assumption based on category ratings (Orth 1982; Orth and Wegener 1983), discuss expectations regarding this assumption in FSEs, and introduce a design implementing such a test in a FSE. Then, I develop a method for scaling sensitive factorial survey analysis (SSFSA) using the previously introduced design and generalized multilevel structural equation modeling (GSEM; Rabe-Hesketh, Skrondal, and Pickles 2004). Afterward, I present an exemplary application of the previously proposed methods. In the application, I investigate the influences of tertiary students’ educational motivations on their preferences regarding internships. The final section concludes and discusses the implications of this study for future research using FSEs.
Testing the Interval Scaling Assumption in FSEs
Methods for Testing the Interval Scaling Assumption
Measurements are interval scaled if not only their ranking but also their distance on a scale is meaningful. “Commonly,…category ratings are supposed to lead to interval scales…based on the assumption that respondents judge according to perceived stimulus differences but there is no direct approach to testing whether these differences allow the construction of an interval scale” (Orth and Wegener 1983, p. 418f). To address this problem, tests of the interval scaling assumption use a rank order of subjective paired stimuli differences to be measured empirically in addition to the subjective single stimulus category ratings (Krantz et al. 1971; Wegener 1983). Thus, in the case of FSEs, not only category ratings C of single vignettes are measured (e.g., vignettes called a, b, c, and d) but also ratings D of differences between pairings of these single vignettes (e.g., vignette pairings called ab, ac, bc). In general, techniques using such an additional set of observations for interval scaling tests are called conjoint measurement. 2
If the rank orders of the single vignette ratings C and paired vignette ratings D satisfy the two conditions of an algebraic difference structure, that is, the quadruple and the compatibility condition, then the category ratings form an interval scale (Krantz 1972; Krantz et al. 1971; Orth 1982).
First, the quadruple condition implicates weak monotonicity and states that,
The relation
Second, the compatibility condition postulates:
Here,
If conditions (1) and (2) hold for all stimuli, then the ratings C and D form an algebraic difference structure which implies that these measurements are unique up to positive linear transformations and, thus, are interval scaled. Specifically, the interval scaling property states that the calculated numerical differences
u > 0 implicates that the rated differences between the vignette pairings D are predictive of the distances
Given respondents’ additional ratings of differences between vignette pairings D, both, conditions (1) and (2), as well as property (3), are empirical testable on the respondent level i. The quadruple condition (1) is typically assessed by calculating the proportion of violations qi
∈ℝ[0; 1] over the set of all vignette pairings. If the quadruple condition (1) is satisfied, then qi
= 0. The compatibility condition (2) is evaluated by calculating the Kendall’s (1938)
Expectations Regarding Interval Scaled Response Behavior in Factorial Surveys
In social scientific research applications such as FSEs, it is likely that the measurement of qi
,

Stylized cumulative distribution function of ri
or
Past research supports these expectations regarding qi
,
I expect a lower level of compliance with the interval scaling assumption for FSEs in comparison to the studies cited above for two reasons. First, compared to the one-dimensional evaluation task of line sizes and occupational prestige in previous applications, FSEs are multi-dimensional, more complex evaluation tasks. Second, in comparison to these previous studies, FSEs are mostly conducted without noteworthy prior training for the rating tasks and often also in field settings without interviewer assistance, for example, as an online study (Weinberg, Freese, and McElhatten 2014). In the following subsection, I outline a design to test these expectations regarding the interval scaling assumption in an FSE.
Implementing an Interval Scaling Test in a Factorial Survey
In principal, an interval scaling test for a FSE can be set up by constructing vignette pairings based on the vignettes used as stimuli for the experiment and have respondents rate the differences between these vignette pairings in addition to the single vignettes. Practically, the large number of vignette pairings per respondent needed for an assessment of the quadruple condition (1) makes testing it barely feasible for most survey designs. 5 Formally, given a set of vignettes per respondent of size A, the related set of vignette pairings B is constituted by the lower triangle of the squared matrix given by the Cartesian product of A. Hence, the size of B is (A 2 − A)/2. For example, B consists of 45 vignette pairings if A contains 10 vignettes. Moreover, the set of vignette pairings B contains many repetitions of the same single vignettes which can annoy or irritate respondents.
The structure of the interval scaling test described above indicates a possible solution. The quadruple condition (1) is a necessary precondition for the compatibility condition (2) and the interval scaling property (3) to hold (Orth 1982). Therefore, compared to the interval scaling assumption—stating that both conditions (1) and (2) are satisfied—presuming consistency with only the quadruple condition (1) is a weaker postulate. Assuming consistency with the quadruple condition (1), drastically reduces the workload of an interval scaling test for respondents. For a test of (2) and (3) under this assumption, a subset of paired vignette difference ratings which is large enough to calculate
In contrast to other FSE designs, the PDR design provides information regarding the degree to which vignette ratings match the interval scaling assumption on the respondent level. The results of the other applications described above suggest that this assumption will be violated in FSEs. To address this problem, I develop a method to analyze FSE data that incorporate information about the scaling level of vignette ratings in the next section. Utilizing such information facilitates more robust and differentiated analyses of FSEs.
A Model for Scaling Sensitive Factorial Survey Analysis
Standard and Ordinal Factorial Survey Analysis
Nowadays, it is routine practice to analyze FSE data in which respondents rate several vignettes yv using linear random intercept models, that is, a linear regression model with a normal distributed random parameter ε i ∼ N(0, σ 2 ) and cluster-robust standard errors on the respondent level (Auspurg and Hinz 2015; Hox et al. 1991; Jasso 2006). 6 Therefore, I call this approach standard factorial survey analysis (SFSA). The cluster-robust standard errors account for the within respondent dependence between ratings in statistical tests. This dependence is induced by the hierarchical data structure in which vignettes are nested within respondents, that is, the cluster-robust standard errors address the violation of the assumption that the vignette ratings are independent and identical distributed (iid). The respondent-level random component controls for normal distributed unobserved heterogeneity in the mean vignette ratings between respondents. 7 Accounting for such unobserved heterogeneity in FSEs is important if not only the effects of the experimental treatments implemented in the vignettes on the ratings but also the effects of differences between respondents—or of interactions between such differences and the treatments—on the vignette ratings are the focus of the analyses. A model for SFSA is given by:
Here, α is a fixed overall intercept parameter, Xv , i is a matrix of exogenous variables on the vignette level v and the respondent level i, β is a vector of the respective fixed parameters, and ε v ∼ F(0, σ 2 ) is a random error component with a fixed variance parameter on the vignette level.
A central assumption of SFSA is that vignette ratings are interval scaled. Violations of this assumption call into question the linear additive structure of equation (4) and, hence, the use of SFSA. A way to handle FSE data which violate the interval scaling assumption, that is, data including a relevant share of respondent with
Here, α
c
is a vector of fixed parameters for the c − 1 thresholds separating the c ordered outcome categories of yv
, y* are the values of yv
on the logit scale estimated by the model, γ is a vector of the exogenous variables’ fixed parameters, and ζ
v
∼ SL(0, π
2
/3) is a standard logistic distributed random error component on the vignette level. OFSA renders the interval scaling assumption unnecessary for analyzing FSE data and it can be implemented as robustness check for SFSA even if information on the scaling level of the vignette ratings is unavailable (Auspurg and Hinz 2015). But with knowledge about the degree to which the interval scaling assumption holds, for example, through the calculation of ri
and
Scaling Sensitive Factorial Survey Analysis
Given indicators of the interval scaling level of vignette ratings like ri
or
To address this problem, I extend the afore-described conditional modeling strategy by conceptualizing a weighting scheme for FSE data including information on the interval scaling level of vignette ratings. For clarity, I restrict the following part of the presentation to analysis using ri
as an indicator of the interval scaling level of ratings, but an identical setup using
Given equation (6), a GSEM (Rabe-Hesketh et al. 2004) based on equations (4) and (5) can be set up. In such a GSEM, a weight of wr that increases with the interval scaling level of the vignette ratings, indicated by ri , is assigned to equation (4) and a weight of 1 − wr that decreases with ri is assigned to equation (5). Formally, such a GSEM is specified as:
with [·] denoting a function assigning the respective weight to yv , y*, and Xv , i .
A GSEM based on equations (7) and (8) facilitates the simultaneous estimation of a SFSA and an OFSA in a way that makes gradual use of the distance information contained in the vignette ratings dependent on the interval scaling level, indicated by ri . Therefore, I call such a GSEM a model for scaling sensitive factorial survey analysis (SSFSA). GSEMs are typically estimated using a maximum likelihood algorithm. For the maximum likelihood estimation, ε v and ζ v are assumed to be normally distributed.
Importantly, over equations (7) and (8), the weight assigned to each vignette rating by equation (6) given ri ≥ r min sums to 1, that is, for each observation weight (7) + weight (8) = wr + (1 − wr ) = 1 holds. Thus, the sample size used for a SSFSA is the same as in the corresponding SFSA or OFSA conditional on ri ≥ r min. Additionally, specifying r min = 0 gives a simple formulation for the weighting scheme: If r min = 0, then equation (6) results in wr = ri for respondents with ri ≥ 0. Moreover, if instead of defining wr based on equation (6), wr is set to 1, then equations (7) and (8) reduce to equation (4), and if wr is set to 0, then equations (7) and (8) reduce to equation (5). In consequence, SSFSA is a generalization containing the models typically used to analyze FSE data, SFSA or OFSA, as borderline cases.
Implementation of a SSFSA
A central advantage of SSFSA is its’ ability to model comparisons of parameters determining the vignette evaluation process between an interval scaled and a noninterval scaled rating mode, that is, between β in equation (7) and γ in equation (8). This can be achieved by rescaling γ in equation (8) based on the ratio of σ, the residual standard deviation of ε v estimated in equation (7), and the residual standard deviation of ζ v in equation (8), which is fixed at π/√3. This ratio defines a scaling parameter λ = σ/(π/√3) = σ√3/π mapping the scale of equation (7) on the scale of equation (8). Using λ, z tests of hypotheses such as
can be conducted. Similar methods for testing parameter differences are used in comparing the results of different discrete choice experiments or combining stated and reveal preference data on choice processes (Louviere, Hensher, and Swait 2000; Swait and Louviere 1993).
In consequence, equation (9) can be used to test possible restrictions regarding the parameters to estimate in a SSFSA. Given that equation (9) is rejected for certain parameters, related constraints can be implemented by specifying β = λγ (Louviere et al. 2000) and estimating the GSEM based on equations (7) and (8) again including these constraints. To differentiate the two steps of this approach, I call the first estimation without parameter restrictions unconstrained SSFSA and the second estimation with parameter restrictions constrained SSFSA. To account for the uncertainty in the first step estimate of λ, I use a plausible value estimator for the second estimation step (Mislevy 1991). 8
Furthermore, differences in the respondent-level random error parameters, ε i and ζ i , between the interval and the noninterval scaled rating mode can also be tested using equation (9). Therefore, the random error parameters are specified as ε i = βei and ζ i = γei with ei ∼ N(0, 1) in the estimation of the GSEM based on equations (7) and (8). Afterward, equation (9) is calculated for the β and γ parameters corresponding to ei . Moreover, the scaling parameter λ can be used to compare the threshold parameters α c of the logit scale estimated by equation (8) with the respective fixed values implied by the interval scale in the estimation of equation (7). Such a threshold parameter is given by calculating α + λα c .
Finally, since SSFSA is based on GSEM extensions using random slopes or mapping additional, hierarchical data structures can be implemented. Furthermore, different specifications for the link functions of yv and related error components within the overdispersed exponential family of probability distributions are possible. For example, yv in equation (8) can be implemented as an ordered probit regression or yv in equation (7) can be specified as a tobit regression (Tobin 1958). Thus, SSFSA is also compatible with other link functions used to address censoring in FSE data (Auspurg and Gundert 2015; Groß and Lang 2018).
In the following section, I present an exemplary application of a SSFSA based on a PDR design. I show the implementation of a SSFSA step-by-step and demonstrate that SSFSA results in improved inferences, that is, increased effect sizes, compared to SFSA.
Application: Students’ Internship Preferences and Educational Motivations
Design and Data Used
The substantive interest of the FSE presented in the following section is tertiary students’ preferences regarding internships, that is, work placements with a strong practical training aspect conducted while enrolled in tertiary education. Since the FSE was set up with the primary aim to test the PDR design and the model for SSFSA, a simple experimental set up was implemented: a full factorial set up with two 4-level and two 2-level dimensions. 9 In a full factorial set up, all treatment factors (i.e., levels of dimensions) are covered by the measured stimuli (i.e., vignettes). Therefore, all main effects of the treatments, as well as all effects of interactions between them on the vignette ratings, are identified and the number of vignettes which cover all possible combinations of dimensions and levels (called vignette population or universe) is equal to the number of vignettes measured (called vignette sample). For the case of two 4-level and two 2-level dimensions, these are 42 × 22 = 64 vignettes. This vignette sample was randomly split into four groups (called decks) of 16 vignettes each for respondents to rate. To construct the additional paired vignettes for the PDR design, all single vignettes were used only once, resulting in four different decks of eight vignette pairings each. This strategy avoided additional repetitions of vignettes which could annoy or irritate respondents. Specifically, the vignette pairs were created by constrained random matching without replacement of the single vignettes in each deck. 10
Study participants were randomly assigned to one of the decks and instructed to evaluate the single vignettes and also the vignette pairs for the PDR design in comparison to each other. The choice of the response instruments for the evaluations was guided by the research interest to test the interval scaling assumption in the context of a FSE. First, I picked the instrument most often used in FSEs: an 11-category rating instrument with range ℕ[−5, 5]. Eleven is the mode of the number of categories for rating instruments used in FSEs (Wallander 2009). Second, I used a 21-category rating instrument with range ℕ[−10, 10] which is the maximum number of categories used for rating instruments in FSEs, so far (Wallander 2009). The idea motivating the use of rating instruments with more categories is to enable respondents to make more fine-grained evaluations, potentially also facilitating a better match of these evaluations with the interval scaling assumption. I wanted to take this idea even further and hence, third, I implemented a response instrument allowing very fine-grained evaluations: a slider rating instrument with range ℕ[−1,000, 1,000]. All three instruments were bipolar and had a neutral midpoint at zero between the extremes of “very unattractive” and “very attractive” since this is a category rating design which is suitable to map a continuous preference which can also involve dislike of alternatives (i.e., internship vignettes). In such cases, using a unipolar response instrument mapping categories of “attractiveness” can result in skewed rating distributions because respondents cannot express a part of their preference continuum. Moreover, bipolar rating instruments are more readily applicable to the vignette pairings used in the PDR design. Here, participants have to calculate differences in “attractiveness” between two internships. A task which is easier if clearly preferring one of the two alternatives sets the poles of the scales. Respondents were randomly assigned to one of the three different response instruments. In addition, the ordering of the single vignette and paired vignette rating tasks was exchanged for a third of the respondents assigned to a 21-category or a slider rating instrument. Overall, this combination of an FSE with a split-ballot experiment resulted in five comparison groups (see Table 1). Moreover, the presentation order of the vignettes and vignette pairs, as well as the text order of the dimensions within a vignette, was randomized between respondents. Figure A1 in the Online Supplemental Material shows an example of the single vignettes and Figure A2 in the Online Supplemental Material is an example of the vignette pairings used for the PDR design. In addition, survey participants had to answer several psychological short scales as well as some questions regarding their studies, internship experiences, and social background.
Overview of the Experimental Design and Related Sample Statistics.
Source: Own calculations.
Between April 15 and 26, 2015, 1,920 randomly selected tertiary students from a commercial sampling pool of about 10,000 participants covering different German tertiary institutions and fields of study were successively invited to participate in the study. About every other day—overall five times—384 invitations were posted. Each student was only invited once and no reminders were sent. The gross sample was restricted to tertiary students aged 35 or younger. Three hundred and fifty-three students started answering the questionnaire resulting in a response rate of about 20 percent. Forty-one of these starters were screened out for speeding while filling out the questionnaire. Furthermore, 49 of the remaining respondents dropped out before finishing the survey, thus resulting in a sample of 263 complete cases and a completion rate of about 85 percent, excluding the speeders. Thirty-eight of the complete cases were excluded from further analyses due to straight lining. 11 The median response time for the 225 remaining respondents was about 27 min. The sample statistics described in Table A1 in the Online Supplemental Material (Statistisches Bundesamt, 2014) show that this net sample is comparable to the population of tertiary students in Germany with respect to the type of tertiary institution attended, fields and years of study, and age. 12
Table 1 gives an overview of the experimental design and the sample used for this study differentiated by instruments and rating task order. The maximum absolute correlations between the vignette dimensions (max. |r dim|) are .02 in the overall sample and .09 in the sample split by the different design specifications. The respective mean absolute correlations (mean |r dim|) are .01 in both cases. These correlations indicate that the randomization of the experimental conditions worked out correctly. To enhance comparability, the ratings of the instruments with 11 and 21 categories were rescaled to the range of the slider instrument (ℕ[−1,000, 1,000]). Figure A3 in the Online Supplemental Material shows the related vignette rating distributions differentiated by design conditions. The response range is covered well in all conditions and the distributions are only weakly negative skewed. The boundary values of the response range are used about 5–9 percent in all design conditions. 13 The mean of the vignette ratings yv is 86. In contrast, the mean yv of 16 in the design condition with the slider rating instrument and the single vignette rating task first is significantly lower (z-value −3.24***). This difference persists in multivariate analyses controlling for other design factors, treatments of the FSE, and heterogeneity between respondents. It would be interesting to test whether this design effect can be replicated in other FSEs.
Operationalization and Methods Used
In the next subsection, I present the results of the interval scaling test based on PDR design using the respondent-level indicators ri
and
In all the models, I implement dichotomous indicators for the within respondent effects, that is, the treatment levels of the FSE. 15 On the between respondent level, I use z-standardized sum indexes of external, internal, and altruistic motivations for tertiary studies based on two-item instruments each (Simeaner, Ramm, and Kolbert-Ramm 2014) as further explanatory variables. 16 In addition, I incorporate different factors related to the interval scaling level of vignette ratings based on previous research (Lang 2016): An indicator of being a first-generation migrant and z-standardized sum indexes for the psychological constructs need for cognition, conscientiousness, and expressiveness (Rammstedt and Beierlein 2014) based on a four-, a two-, and a five-item instrument, respectively. Figure A4 in the Online Supplemental Material shows descriptive statistics for the variables used. Finally, I include indicators controlling the described design splits, vignette decks, and the random ordering of the vignettes as well as the vignette texts.
Results of the Interval Scaling Test
Now, I summarize the results of the interval scaling test for the FSE on tertiary students’ internship preferences described above. Figure 2 displays the CDFs of the indicators for the interval scaling level ri
and

Cumulative distribution functions of the interval scaling level indicated by ri
or
The results of this interval scaling test point out the potential utility of analyzing this FSE with a SSFSA. Related, 24 respondents in the sample have a ri
< 0 and 25 respondents a
Results of the SSFSA
In the following subsection, I describe the implementation of an SSFSA for the FSE on tertiary students’ internship preferences step-by-step and present related results. The results of the different analyses are summarized in Tables 2 and 3. The first column of Table 2 shows the results of the SFSA (equation (4)). First, we look at the effects of the experimental factors on the vignette evaluations. The indicators for a match of the internship with the content of the respondents’ tertiary studies and a hiring option subsequent to the internship, as well as all three payment related indicators, increase the ratings, are highly significant, and reach z-values > 10. The indicator for a nonprofit foundation as an employer also has a significant positive effect. Regarding the influences of between respondent differences on the vignette ratings, I find significantly higher ratings with z-values > 2 for students with a more altruistic motivation. In addition, I obtain comparable results, given vignettes with additional internship payment for students with more external motivation and given vignettes indicating a match between the studies and the internship for students with more internal motivation. Furthermore, the ratings are significantly lower for respondents with more need for cognition. In addition, the highly significant respondent-level intercept points to relevant unobserved heterogeneity between respondents with regard to the vignette ratings. The residual variance of 135,306 implicates an explained variance of 64 percent for the SFSA, indicating an overall good fit of the model with this FSE data.
Analyses of Tertiary Students’ Internship Preferences.
Source: own calculations.
Note: SFSA = standard factorial survey analysis; SSFSA = scaling sensitive factorial survey analysis. aControl variables: instrument, task order, deck, vignette order and text order indicators; bReference category: no payment, no hiring option, no match with studies, employer is a private large enterprise, no first-generation migrant, and all other indicators are fixed at their mean level rescaled to zero.
*P(Z > |z|) < .10.
**P(Z > |z|) < .05.
***P(Z > |z|) < .01.
The first column of Table 3 displays the results of an OFSA (equation (5)) based on the same data. This model implies no interval scaling assumption with respect to the vignette ratings. The estimated effects of the experimental treatments are quite similar to the SFSA and the related z-values are consistently slightly lower. Regarding the influences of between respondent differences, I do not find this general similarity. In contrast to the SFSA, the effect of students’ external motivation on the vignette evaluations is significantly negative and the positive effect of students’ altruistic motivation is only weakly significant. The other effects of between respondent differences mentioned above are robust across both specifications. Overall, the comparison between SFSA and OFSA demonstrates that the same conclusions regarding the influences of the experimental treatments can be drawn without the interval scaling assumption, but uncertainty exists regarding the main effects of students’ motivations on the vignette ratings.
Analyses of Tertiary Students’ Internship Preferences.
Source: Own calculation.
Note: OFSA = ordinal factorial survey analysis; SFSA = standard factorial survey analysis; SSFSA = scaling sensitive factorial survey analysis. aControl variables: instrument, task order, deck, vignette order, and text order indicators; bReference category: no payment, no hiring option, no match with studies, employer is a private large enterprise, no first-generation migrant, and all other indicators are fixed at their mean level rescaled to zero.
*P(Z > |z|) < .10.
**P(Z > |z|) < .05.
***P(Z > |z|) < .01.
Next, the second columns of Tables 2 and 3 show the results of SFSAs restricted to ri
≥ 0 and
Now, I come to the SSFSA. Columns 3–5 of Table 2 show the results of an unconstrained SSFSA based on ri
≥ 0. Column 3 displays the results of the linear regression based on equation (7), column 4 those of the ordered logit regression based on equation (8), and column 5 those of tests of parameter differences between the two model parts based on equation (9). With two exceptions, the coefficients are quite similar across both model parts of this SSFSA. The exceptions are the influence of students’ altruistic motivation on the vignette evaluations and its’ interaction with vignettes including a high additional internship payment which are both only significant in the linear regression part. The first effect is also found in the restricted SFSA based on ri
≥ 0 in column 2 of Table 2, while the second is not. For the latter, the not significant result in the ordered logit model part of the SSFSA is consistent with the OFSA in column 1 of Table 3. The overall stability across both parts of this SSFSA is confirmed by the tests of parameter differences. They do not indicate significant discrepancies between the coefficients. The scaling parameter λ mapping the parameters of the linear regression part on the ordinal logit model part is 197, that is, a change of 197 units on the interval rating scale of equation (7) equals a change of one unit on the logit scale estimated by equation (8). Columns 3–5 of Table 3 display the results of the related unconstrained SSFSA based on
Given the similarity of parameters across model parts for both unconstrained SSFSA specifications, I fitted SSFSAs with equality constraints on all common parameters using a plausible value estimator. The results of these constrained SSFSAs are shown in the sixth columns of Tables 2 and 3. Related calculations of effect size changes in comparison to the SFSA (equation (4)) and the respective restricted SFSA specifications are displayed in the seventh columns. Regarding the effects of the experimental treatments on the vignette ratings, the conclusions drawn are comparable to the SFSA and restricted SFSAs for both constrained specifications. But the effect sizes of the significant coefficients in the constrained SSFSAs are consistently larger. For the constrained SSFSA based on ri
≥ 0, the z-values increase on average by 18 percent compared to the SFSA and by 23 percent compared to the restricted SFSA. For the constrained SSFSA based on
Finally, Figure 3 displays the estimated threshold parameters α c in the ordinal logit part of the constrained SSFSA specifications in comparison to the respective values of the interval vignette rating scale. The results are quite similar for both constrained SSFSA specifications. The threshold parameters α c are substantially larger than implied by the interval scale in the lower part of the scales’ range and a bit smaller than assumed in the upper part. Thus, the noninterval scaled vignette evaluations are less dispersed compared to the interval scaled vignette ratings. Especially, the strongly negative noninterval scaled evaluations are less negative than implied by the interval scale. Hence, in this application, violations of the interval scaling assumption in a SFSA mainly results in a misfit with the implied scale in the lower part of the range.

Threshold parameters α
c
for constrained scaling sensitive factorial survey analyses based on ri
and
Discussion and Conclusion
In this article, I introduced a FSE design which enables testing the interval scaling assumption for vignette ratings, the PDR design, and a new model for FSE data, SSFSA. The PDR design conjointly measures single vignette ratings and evaluations of vignette pairings constructed using the same set of vignettes. The strength of positive correlation between the single and the paired vignette ratings indicates the fit of the responses with the interval scaling assumption. SSFSA uses these correlations as indicators for interval scaled response behavior. This new method has two advantages: First, it does not generally assume interval scaled vignette ratings and, second, it utilizes the respondent-level information on the match of the ratings with the interval scaling assumption to gain efficiency. I tested the PDR design and the SSFSA using a FSE on the influence of tertiary students’ motivations on their internship preferences. Regarding the interval scaling test, the results clearly demonstrate that the interval scaling assumption for the vignette ratings is violated in this application. Previous findings and theories about survey response behavior suggest that this outcome has to be expected for other FSEs, too. With respect to the SSFSA based on the PDR design, the results show that this method addresses this problem and, thus, facilitates more robust analyses of FSEs. The SSFSA yield on average 14–18 percent larger treatment effects than standard methods depending on the specification used. For the influences of between respondent differences on the vignette ratings, the gain in effect sizes of on average 37 percent is even larger. Moreover, based on the SSFSA, I compare the interval rating scale with the threshold parameters between the response categories estimated for the noninterval scaled vignette evaluations. Here, the results indicate that these evaluations mainly deviate from the interval rating scale in the lower part of the scale range.
In summary, the study shows that the efficiency gains achieved by SSFSA and differences in conclusions drawn based on SSFSA are larger for the effects of between respondent differences than for the influences of treatments. While replications have to demonstrate that the stability of treatment effects is not due to the comparatively simple FSE design used or the respondent population focused on in this study (see the following paragraphs), the results so far indicate that one has to be less concerned about violations of the interval scaling assumption in FSEs focusing solely on the treatment effects. In these applications, the additional efforts regarding data collection using a PDR design to implement a SSFSA might not be justified. But Wallander’s (2009) review shows that 82 percent of the FSEs also analyzed influences of between respondent differences, and additionally, 30 percent estimated interaction effects of treatments with between respondent differences. In this majority of cases, a diligent assessment of the interval scaling assumption based on a PDR design could be insightful. If such a design is considered too costly (e.g., due to the burden for respondents incurred by evaluating the paired vignettes), it can alternatively be implemented in a pretest or a subsample. Such an approach confines PDR designs to applications where they are necessary and at the same time advances the understanding of interval scaled response behavior. Specifically, future research could address the following limitations of this study.
The PDR design implemented in this study is restricted to an interval scaling test based on category rating instruments. This focus is based on common practice since magnitude estimation instruments are not as often used in FSEs (Auspurg and Hinz 2015). To widen this focus and apply the methods developed here to a FSE design using magnitude estimation instruments, especially in comparison to category rating instruments, would be a useful extension of this research. Not only the interval scaling test of Orth and Wegener (1983) but also the model for SSFSA developed in this article is readily transferable to FSEs using magnitude estimation instruments. Research in this direction could help to solve unsettled issues with respect to survey response behavior, at least for FSEs (Birnbaum 1980; Cools, Hofmans, and Theuns 2006; Hardin and Birnbaum 1990; Jasso and Wegener 1997; Liebig, Sauer, and Friedhoff 2015; Schaeffer and Bradburn 1989): To which degree are difference, ratio, and noninterval scaled rating behavior characteristics of respondents in comparison to study design features? And hence, to what extent can certain response modes be stimulated by instruction? Moreover, the development of a PDR design overcoming the practical hurdles to testing the quadruple condition assumption in FSEs would be a valuable extension of the current study.
Furthermore, the application in this study uses a comparably simple FSE, a full factorial with a vignette universe consisting of 64 vignettes. The full factorial setup excludes vignette sampling variably from the experimental design and thus enables this study to focus the analyses on testing the interval scaling assumption and evaluating the SSFSA. Also, the respondents used for this research were recruited using a sampling pool of tertiary students for online studies. Given this combination of a simple design with a comparably trained and educated sample of respondents, this application probably indicates fewer violations of the interval scaling assumption than typical FSEs. In a next step, it would be useful to test this expectation. Therefore, it would be necessary to implement a PDR design for an FSE with a vignette universe of the size used in typical applications, consisting of several thousand vignettes which necessitate the sampling of vignettes. Here, deliberate methods which facilitate the efficient construction of a vignette sample according to experimental design criteria like D-efficient sampling should be implemented (Dülmer 2007; Kuhfeld, Tobias, and Garratt 1994).
In addition, it would be informative to choose an application for this extension which could be applied to a general population sample. Using such a sample, it would be especially interesting to assess which kind of between respondent differences lead to different effect sizes and conclusions in an SSFSA compared to other methods of analysis. Are these only the mainly psychological factors focused on in this study which are partly also determining interval scaled response behavior? Or are dimensions of social stratification like age, education, and gender or cultural context involved? Methodological research on rating scales not explicitly addressing the scaling level of responses suggests that there are group differences in response styles by cultural contexts (e.g., Chen, Lee, and Stevenson 1995). From the perspective of many FSEs focusing on differences between groups of respondents’ knowledge about the effects of the scaling level of responses on the analysis, such group differences would be pertinent to assess the relevance of complying with the interval scaling assumption in their applications.
Related, in this study, the different SSFSA specifications indicate no or only a few weakly significant differences between the coefficients of factors affecting vignette ratings. But research on combined analyses of different discrete choice experiments suggests that this is not common (Louviere et al. 2000). Without replications, this result could also be a consequence of the specific FSE or sample of respondents used (see previous paragraph). A more general expectation regarding the results of an SSFSA is to find more substantially relevant determinants of vignette evaluations in an interval scaled compared to a noninterval scaled response mode since the violations of the interval scaling assumption in the latter hint to a cognitive overload regarding the vignette rating task. Therefore, I call this expectation overload hypothesis. Extensions of the current research like those described in the last paragraph could also test this hypothesis. Since constraints can be applied separately to each coefficient, the SSFSA proposed in this article is suitable for handling applications in which some parameters differ between an interval and a noninterval scaled response mode.
Moreover, the method of SSFSA can in principle be implemented using any indicator of the responses’ interval scaling level. For the interval scaling test used in this study, two respondent-level indicators based on correlation coefficients are plausible: first, ri
which indicates the fit of the distances between the measurements in the two sets of ratings used for the test and second,
Finally, the GSEM-based scaling sensitive analyses methods introduced in this study aid deal with noninterval scaled responses not only in FSEs but also in other applications using interval response scales. The only precondition is that information on the interval scaling level of responses is available. Therefore, I hope that this research draws the attention of social scientists to the potential gains of accounting more explicitly for the scaling level of survey and test responses in their analyses.
Supplemental Material
Supplemental Material, Online_Supplement_SMR799382 - Scaling Sensitive Factorial Survey Analysis
Supplemental Material, Online_Supplement_SMR799382 for Scaling Sensitive Factorial Survey Analysis by Volker Lang in Sociological Methods & Research
Footnotes
Acknowledgments
The author would like to thank Martin Groß, Stefan Liebig, Meike Han, Knut Petzold, and Marc Schwenzer, as well as the participants and organizers of the “Colloquium on Empirical Social Research” at the University of Konstanz in February 2016 and the participants and organizers of the “Colloquium on Macro-sociology” at the University of Tübingen in January 2016 for their helpful comments on draft versions of this article.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
