Abstract
Effects of rating scale forms on cross-sectional reliability and measurement equivalence were investigated. A randomized experimental design was implemented, varying category labels and number of categories. The participants were 800 students at two German universities. In contrast to previous research, reliability assessment method was used, which relies on the congeneric measurement model. The experimental manipulation had differential effects on the reliability scores and measurement equivalence. Attitude strength seems to be a relevant moderator variable, which influences measurement equivalence. Overall, the results show that measurement quality is influenced by rating scale forms. Results are discussed in terms of their implications for latent variables measurement.
Introduction
Multi-item instruments have often been used in surveys to address concepts of interest, assuming that the items (observed variables) provide an appropriate representation of the latent construct. The result of such data collection with an instrument is referred to as measurement of latent, not observed variables, with a number of observed reactions to the items. Such latent variable measurement is based on a measurement theory, for instance, the Classical Test Theory (Lord and Novick 1968), which allows for an evaluation of measurement quality in terms of reliability. Reliability is defined as the ratio of true variance to observed variance (Lord and Novick 1968). In this article, we address cross-sectional reliability, which relies on measurements that use more than two items and are carried out at one point in time (henceforth referred to as reliability).
An important prerequisite for measuring a latent variable and evaluating reliability is one-dimensionality, meaning that observed variables measure only one latent variable (Graham 2006; Raykov and Marcoulides 2011). The estimation of the true score of the latent variable and reliability evaluation also rely on assumptions, which are referred to as measurement model assumptions (Raykov and Marcoulides 2011). Three kinds of measurement models have been established, namely, the parallel model (PM), the tau equivalent model (TSEM), and the congeneric measurement model (CMM). Key assumption of PM is that the items measure the same true score with the same level of measurement precision, which is rarely the case (e.g., Raykov and Marcoulides 2011). Therefore, we do not consider PM in this article. TSEM assumes a uniform linear relationship between the observed variables and the latent variable. The least restrictive measurement model is CMM, in which the items’ linear relationships with the latent variable are allowed to differ. Testing the assumptions of measurement models is crucial when testing reliability. For example, Cronbach’s α (Cronbach 1951), which is a commonly used reliability measure, is based on the TSEM (Lord and Novick 1968). Because the assumptions of TSEM have often been violated in social science measurements, reliability coefficients based on CMM have been recommended as an alternative (Schweizer 2011).
Reliability can be affected by the choice of rating scales used with the items (e.g., Krosnick and Fabrigar 1997). A rating scale presents a measurement continuum that extends from one extreme to the other (e.g., strongly agree–strongly disagree). Rating scales can affect reliability because of differences in the interpretation of categories (Andrews 1984) or difficulties in administration, which could lead to satisficing behavior, such as superficial information processing (Krosnick and Alwin 1987). The number of categories and the labeling of a rating scale are basic cues that respondents use to understand the measurement continuum (Parducci 1983).
With respect to labeling, it is possible to differentiate between rating scales with verbal labels for each category (ALL form) and those with verbal labels for the end categories only (END form). Previous research has reported mixed results regarding the effects of verbal labeling on reliability: Whereas some studies found that the ALL form increases reliability (Alwin and Krosnick 1991; Menold et al. 2014), others did not (e.g., see a meta-analysis by Churchill and Peter 1984). The END form has also often been used with numeric labels (ENDN form). Numeric labels can increase satisficing because numbers are rarely used for self-description in everyday communication (Krosnick and Fabrigar 1997).
With respect to the number of categories, rating scales with five to seven categories often maximize reliability in the case of attitude measurement (e.g., Maitland 2009). Newer simulation studies found an increase in reliability when scales included five categories, compared to scales with three or four categories, whereas reliability coefficients remained stable for scales with five and seven categories (Lozano, García-Cueto, and Muñiz 2008; Parker, Vannest, and Davis 2013). However, Leung (2011) did not find differences in Cronbach’s α for scales with 4–11 categories. Weng (2004) analyzed the effects of the number of categories (varying between three and nine categories) and verbal labeling and did not find that reliability was affected. However, few studies address both numbers of categories and labeling, and the choice between five and seven categories is still under discussion in the literature (Toepoel and Dillman 2011).
Furthermore, there is a need for research on how rating scale forms affect reliability, because Cronbach’s α has been used in previous studies, although the assumptions of TSEM, upon which Cronbach’s α is based, have not been addressed (Leung 2011; Preston and Colman 2000; Weng 2004). The conclusions of these studies may, therefore, have been erroneous. Our first research question addresses the effect of rating scales on cross-sectional reliability: How do five versus seven categories, using ALL or END forms in both cases, affect reliability? In contrast to previous research, we will choose a reliability evaluation method upon the results of testing measurement model assumptions.
We also focus on measurement equivalence between different rating scales. When measurements that use different rating scales have to be compared, for example, between different waves of a survey, measurement equivalence is an important concern. Measurement equivalence can be assessed with the multigroup confirmatory factor analysis (MGCFA; e.g., Byrne 2011; Davidov, Schmidt, and Schwartz 2008). One can conclude that the same latent dimension is measured only if the factor loadings (metric invariance) and intercepts (scalar invariance) are equivalent (e.g., Byrne 2011; Davidov et al. 2008). Nonequivalent factor loadings between different rating scales may reflect variations in the understanding of an item’s meaning (e.g., Bollen 1989; Steinmetz 2013). Differences in items’ intercepts can be interpreted as different or biased reactions to the items, which are independent of the item content (Steinmetz 2013). Only a few studies address the measurement equivalence of different rating scales. Krebs and Hoffmeyer-Zlotnik (2010) found that rating scale orientation affected factor loadings, whereas item intercepts were not affected. However, we do not know whether varying the number of categories and the category labeling affects measurement equivalence. Accordingly, with our second research question, we investigate whether the measurements of one construct obtained with the same items but with different rating scales are equivalent with respect to (1) factor loadings and (2) items’ intercepts.
Little is known about the situational context of measurements. In particular, attitudes can differ in strength, a characteristic that may be a relevant factor when comparing the effects of different rating scales. While strong attitudes are highly pronounced and have significant impacts on cognition and behavior, weak attitudes do not (Bassili and Krosnick 2000). In addition, weak attitudes are more difficult to report, because respondents may be less motivated to report them (since these would be associated with a minor interest in the topic). Thus, the effect of rating scales may be more highly pronounced in the case of weak attitudes (Krosnick and Schuman 1988). Therefore, our third research question is: How does attitude strength (AS) affect measurement results, when different rating scales are used? Specifically, can the consideration of AS as an explanatory variable improve measurement equivalence between different rating scales?
Method
Participants
The data (N = 800) were collected at two German universities in 2011 and 2012. The study was a paper-and-pencil self-administered survey. Approximately 46 percent of respondents were male, and the respondents’ average age was 23 years (SD = 4.39 years). Thirty-two percent of the participants studied social sciences, 27 percent studied economics, 8 percent studied psychology, and 6 percent studied medicine; the remaining students studied other disciplines. There were no significant differences between the experimental groups with respect to gender (χ 2 (4,N=800) = .95; p > .10), age (F(4,783) = 1.49; p > .10), or discipline of study (four experimental groups: χ 2 (45,N= 690) = 56.3; p > .10; five experimental groups: χ 2 (16,N=500) = 19.79; p > .10) (see Experimental Design).
Concepts
Two concepts were analyzed. The first concept—opinions about the European Union (EU)—represents an attitude toward an issue, whereas the second concept—studying effort—refers to attitudes toward oneself. Two types of concepts were used to examine whether the results could be generalized across a variety of concepts.
To measure opinions about the EU, six items (Table 1) were used that were part of a battery from the German Longitudinal Election Study (GLES; Rattinger et al. 2011) and that had been shown, via Principal Component Analysis (PCA), to cover one dimension (Menold et al. 2014). These items address negative aspects of the EU’s impact on German life and the economy and the preference for retaining state sovereignty within the EU. In the GLES, the items were used with a five-category agree–disagree ENDN rating scale.
Items Used in the Study.
Note: EU = European Union. Source for the EU items: Rattinger et al. (2011), Menold et al. (2014). Source for the studying effort items: Wild and Schiefele (1994). Authors’ translation.
Studying effort was addressed by using eight items (Table 1) that also originated from a battery (students’ learning strategies; Wild and Schiefele 1994). Originally, a fully verbalized five-category frequency rating scale with numeric labels was used; according to the authors’ (Wild and Schiefele 1994) PCA, the eight items represent one dimension (Cronbach’s α = .74).
Two items (cf. Bassili and Krosnick 2000) were used to measure AS: (1) How important is the topic (studying effort vs. European politics) for a respondent (importance)? (2) How certain was a respondent when answering the items related to the topic (certainty)? These questions were placed after the items for each concept. Both items were rated on a five-category ALL rating scale, ranging from not at all important/not sure at all to very important/very sure. The two items were aggregated into a multiplicative index variable because low values of importance are supposed to not be compensated for by high values of certainty, and vice versa (cf., e.g., Trautwein et al. 2012).
Experimental Design
A randomized, experimental, between-subjects (split-ballot) design was used. This design is regarded as more accurate with respect to causal references of treatment manipulation than a within-design approach, such as a multitrait-multimethod design (Krosnick 2011). A systematic method with random start was applied for the randomization.
We used four experimental groups: a five-category fully verbalized rating scale (5ALL); a five-category end-points-only verbalized (5END) rating scale; and the same forms with seven categories (7ALL and 7END; Figure 1). We used verbal labels developed by Rohrmann (1978) and asked respondents to evaluate the degree to which an item applied. In addition, a fifth experimental group with the five-category ENDN (5ENDN) form from the GLES was used with some of the data (Figure 1). This allowed for a comparison of the previously described forms with this 5ENDN form, which has been used with the EU items in a large population survey. For the 5ENDN form, we retained the agreement dimension from GLES (“do not agree at all” to “fully agree”). However, because such “agree” scales are associated with acquiescence (e.g., Billiet and McClendon 2000), which we strove to avoid, we did not use the agreement dimension for the other four experimental groups.

Rating scales in experimental groups. Note: Authors’ translation of German category labels.
Data Analysis
Data were first z-standardized for each treatment group separately, using the SPSS 20 software, to obtain comparability between the groups with differing numbers of rating scale categories (Krosnick 2011). Thus, the resulting variable values were z-distributed, ranging from −3 to +3, with a mean of 0 and a standard deviation of 1, regardless of the number of categories. To retain better comparability with the original variables, integers were used. To avoid effects of outliers, z-values with n ≤ 8 were summarized with the neighboring values (e.g., 3 with 2 or −3 with −2) in each experimental condition.
Tests of measurement model assumptions and reliability scores were analyzed for each experimental group, using confirmatory factor analyses (CFAs) as described by Raykov and Marcoulides (2011). While the assumptions of the CMM can be confirmed if the items have at least significant loadings (>.3) on the corresponding factor, the TSEM additionally requires an equality of factor loadings (e.g., Graham 2006; Raykov and Marcoulides 2011). Depending on the results of the measurement model test, a specific reliability method was then chosen. Whereas Cronbach’s α requires that assumptions of the TSEM must be met, McDonald’s Ω (McDonald 1999) relies on the CMM (e.g., Raykov and Marcoulides 2011).
Subsequently, various hierarchical MGCFA models were compared to address measurement equivalence among different rating scales. The first (baseline) model assumes the same one-dimensional structure of measurements in all groups (configural invariance), the second restricts the factor loadings to being equal (metric invariance), and the third model restricts the items’ intercepts to being equal across groups (scalar invariance; e.g., Byrne 2011). A second set of MGCFA models was tested to address the impact of AS on measurement equivalence by including AS as an observed covariate variable in the models. Linear regression paths of AS on observed variables were thereby included. Nonnested models with and without AS were compared by using the Bayesian Information Criterion (BIC) (Raftery 1995); lower BIC values represent models with a better fit to the data, and differences higher than ΔBIC = 10 were considered to be statistically significant (Raftery 1995).
Because normality was violated in all items and for all experimental groups (demonstrated by Kolmogorow–Smirnow statistics, with p < .001), the maximum likelihood parameter estimator that is robust with respect to violation of nonnormality (MLR; Muthén and Muthén 2010) was used in CFAs and MGCFAs. MLR has also been recommended for rating scales with more than four categories (Muthén and Muthén 2010; Raykov and Marcoulides 2011). Additional analyses for categorical data (weighted least squares (WLSMV) estimator; mixture factor analysis; Muthén and Muthén 2010) yielded results that were comparable with those obtained with the MLR. The model fit of CFA and MGCFAs was obtained using the chi-square (χ 2 ) test, the root mean square error of approximation (RMSEA), and the comparative fit index (CFI; Beauducel and Wittmann 2005). The CFI should be .95 or higher (Hu and Bentler 1999), while an RMSEA of .08 or less indicates an acceptable fit (Raykov 1998). With respect to the comparison of nested MGCFA models, a significant change in χ2 or a change in ΔCFI ≥ |.01| and ΔRMSEA ≥ |.015| indicates reasonable differences (Byrne 2011; Chen 2007).
Results Related to the EU Opinion Concept
Tests of Measurement Models
The goodness-of-fit statistics of the CFA for the six EU items, in which no equality restrictions for single items within a condition were modeled, are presented in Table 2 (initial models). Acceptable RMSEA and CFI values were observed in the 5END and 5ENDN groups, but the factor loadings of two items (EU3 and EU5) were not significant. In the other treatment groups, no goodness-of-fit statistics reached the cutoff values, and nonsignificant standardized factor loadings or loadings much lower than λ = .30 were obtained. A relatively tenable result was found for the 7ALL group, in which only one item (EU3) failed to have a significant loading, and goodness-of-fit statistics were near the cutoff criteria. In total, the assumptions of the CMM—all EU items have at least a linear relationship with the latent variable—were not supported by the data.
Measurement Model Test (CFA) Results for the EU Items.
Note: CFA = confirmatory factor analysis; CI = confidence interval; EU = European Union; RMSEA = root mean square error of approximation; df = degrees of freedom. Experimental groups: 5ALL (five categories, fully verbalized), 5END (five categories, only end categories verbalized), 7ALL (seven categories, fully verbalized), 7END (seven categories, only end categories verbalized), and 5ENDN (five categories, end categories verbalized, numeric labels for each category). Differences in n (number of cases) because of data missing.
*p < .05.
**p < .01.
***p < .001.
To reach the CMM level, three items had to be deleted in the groups 5ALL, 5END, 7END, and 5END. The three remaining items (EU1, EU4, and EU6) in these groups describe negative effects of the EU on German life and the economy; thus, the concept to be measured is much narrower. Use of only three items leaves df equal to 0, so that the CFA model fit cannot be evaluated. Hence, in the 5ALL, 5END, 7END, and 5ENDN conditions, items had sufficient factor loadings and no correlated error terms (Table 2, re-specified models). In the condition 7ALL, one item had to be deleted (EU3, quitting of the EU by a state). The CMM can be assumed for five items in the 7ALL group, as shown by tenable goodness-of-fit for the re-specified model.
The TSEM assumptions (factor loadings are equal within one rating scale group) could not be confirmed, either for the initial or for the re-specified models, with the exception of the re-specified model in the 7END group. The detailed results of this test are available from the corresponding author on request.
Reliability
We calculated McDonald’s Ω because, overall, the assumptions of the TSEM were not met and thus Cronbach’s α was inappropriate. Table 3 shows the reliability coefficients for both the initial six items and the re-specified models. Notably, the inclusion of some error covariances improved the goodness-of-fit of the initial models in the 5ALL, 7ALL, and 7END groups (see note in Table 3); thus, the error covariances were considered as a part of the error variance when obtaining reliability (cf. Raykov and Marcoulides 2011). With respect to the initial models, an acceptable reliability coefficient could be found only in the 7ALL group. In all other groups, reliability coefficients were unacceptably low (i.e., with a reliability of ∼.50; the measurement results were inflated by a nonsystematic error of up to 50 percent). The deletion of items for the re-specified models, in which the CMM measurement level could be assumed, led to a remarkable increase in reliability in the 5ALL and 5ENDN conditions. As a result, for the re-specified models, acceptable reliability coefficients with highly overlapping confidence intervals were found for the 5ALL, 7ALL, and 5ENDN groups.
Reliability (r; McDonald’s Ω) with Standard Errors (SE) in Rating Scale Groups.
Note: CI = confidence interval; EU = European Union. Experimental groups: 5ALL (five categories, fully verbalized), 5END (five categories, only end categories verbalized), 7ALL (seven categories fully verbalized), 7END (seven categories, only end categories verbalized), and 5ENDN (five categories, end categories verbalized, numeric labels for each category). Initial model EU items: Two error covariances (ECs) are considered in the 5ALL group and one is considered in the 7ALL and 7END groups, to reach a tenable model fit. Initial model studying effort: four ECs are included in the 5ALL group, one EC in the 5END group, three ECs in the 7ALL group, four ECs in the 7END group, and five ECs in the 5ENDN group.
Measurement Equivalence between Rating Scales
With the help of MGCFAs, we compared the goodness-of-fit statistics between the unrestricted model (configural invariance) and models in which either the factor loadings (metric invariance) or the item intercepts were restricted to being equal (scalar invariance) among the groups. The configural model for the six items (initial model) did not yield a reasonable goodness-of-fit, χ2(45) = 83.75, p < .001; RMSEA = .07; 95 percent confidence interval, CI [.05, .10]; CFI = .88). To improve the model fit, two error covariances were allowed for (between EU2 and EU3 and between EU4 and EU6, Table 1). The error covariances were restricted to being equal in different rating scale groups to avoid an effect on the comparison of the MGCFA models. With these modifications, a tenable goodness-of-fit could be reached for the initial model (see Table 4: Configural). Restricting factor loadings to being equal significantly increased χ2, and decreased the value of CFI (Table 4, Metric), so that equality of loadings among the groups could not be confirmed. Restricting the item intercepts to being equal led to a very large and significant decrease in model fit for all model-fit statistics (Table 4: EU initial model, Scalar), so that the equality of items’ intercepts between different rating scale groups was not given.
Measurement Equivalence: Comparison of MGCFA Models.
Note: AS = attitude strength; BIC = Bayesian information criterion; CFI = comparative fit index; EU = European Union; MGCFA = multigroup confirmatory factor analysis; RMSEA = root mean square error of approximation. Configural is baseline; initial models with correlated error covariances (equal across groups): EU2 with EU3 and EU4 with EU6. Studying effort: 1 with 2; 2 with 3; 3 with 4; 4 with 8; and 6 with 8. We ran the same set of models (without AS), to compare experimental groups with only five and only seven categories. We obtained a lack of metric and scalar invariance for five category groups for both concepts that we addressed. For seven category groups, metric and scalar invariance were given for the model with only three EU items, but not for the models with all EU items. For the studying effort, the MGCFA analysis revealed a lack of metric and scalar invariance in the case of five categories, which was valid for both eight and four items. In the case of seven categories, metric and scalar invariance were given for eight items, whereas metric invariance was violated for four items. Therefore, violations of invariance could also be found in the case of comparisons with only five or only seven categories. Detailed results of this analysis are available from the corresponding author on request.
*p < .05.
**p < .01.
***p < .001.
However, in the case of the six items, the CMM assumptions could not be regarded as fulfilled. Significant factor loadings and an absence of correlated error terms were found with an MGCFA using the items EU1, EU4, and EU6. Restricting the factor loadings to being equal in these three items did not significantly change the goodness-of-fit statistics (Table 4, EU, three items, metric), which demonstrates that the factor loadings did not significantly differ between the five rating scale groups. However, restricting the items’ intercepts to being equal remarkably worsened the model fit for all goodness-of-fit statistics (Table 4, EU, three items, scalar). Thus, rating scales had an impact on the items’ intercepts, which was demonstrated for models with six and three items.
The nonnested MGCFA models with and without AS were compared in terms of BIC; the results are shown in Table 4. It can be seen that, for the configural and metric models, BIC is higher with AS than without AS. However, BIC is much lower for scalar models with AS than without AS; this holds true for both the initial and the three-item models. Therefore, AS moderated the effect of rating scales on intercepts.
Results Related to the Measurement of Studying Effort
Tests of Measurement Models
The goodness-of-fit statistics of the CFAs for the eight studying effort items, in which no equality of factor loadings were modeled within a rating scale group, are presented in Table 5 (initial model). In all groups, standardized factor loadings were significant and equal to or higher than λ = .30. Nevertheless, a relatively reasonable model fit with respect to RMSEA and CFI was found only for the 5END group (with the CFI near the cutoff value).
Measurement Model Test (CFA) Results for Studying Effort.
Note: CFA = confirmatory factor analysis; CFI = comparative fit index; CI = confidence interval; RMSEA = root mean square error of approximation. Experimental groups: 5ALL (five categories, fully verbalized), 5END (five categories, only end categories verbalized), 7ALL (seven categories, fully verbalized), 7END (seven categories, only end categories verbalized), 5ENDN (five categories, end categories verbalized, numeric labels for each category). Differences in n (number of cases) because of data missing.
*p < .05.
**p < .01.
***p < .001.
According to modification indices (MIs), the goodness-of-fit can be improved when error covariances between items are included. However, error covariances may mean that one-dimensionality is violated (e.g., Raykov and Marcoulides 2011). Therefore, with re-specifications, we excluded items that led to error covariances. One or two items were deleted in the 5ALL, 5END, and 7ALL groups, while three or even four items had to be excluded in the 7END and 5ENDN groups. Following deletion of items, we obtained a tenable model fit with respect to χ2 and CFI statistics in each rating scale group (Table 5). We conclude that the CMM level could be reached, in terms of the deletion of certain items; this was associated with differing amounts of information loss in the various rating scale groups.
Restricting the factor loadings to being equal within each rating scale group demonstrated that TSEM cannot be assumed in the initial and re-specified models. Detailed results are available on request.
Reliability
For the studying effort items, McDonald’s Ω was also the appropriate measure for reliability evaluation. To reach a tenable model fit in the initial model, correlated error terms were considered (note in Table 3). The results are presented in Table 3 (initial). A significantly higher reliability was found in the 5END group, compared to the other groups. The reliability also tended to be higher in the 7ALL group than in the 5ENDN group. Using a reduced number of items for which a CMM level is given (Table 3, re-specified) increased reliability in the 5ALL, 7ALL, and 7END groups but not in the 5ENDN group. Overall, reliability tended to be higher in the 5END and 7ALL groups than in the other groups.
Measurement Equivalence between Rating Scales
A one-factor MGCFA model of eight studying effort items reached inadequate goodness-of-fit: χ2 (100) = 309.33 (p < .001), CFI = .85, RMSEA = .12 (90 percent CI = [.10, .13]). An improvement of goodness-of-fit could be achieved by including numerous error covariances (see note in Table 4) that were modeled to be equal in all rating scale groups (Table 4). Restricting factor loadings to being equal among the groups remarkably increased χ2 and decreased CFI (Table 4, initial model, metric). Restricting the items’ intercepts to being equal led to a very large and significant decrease of model fit for all goodness-of-fit statistics (see Table 4, initial model, scalar), so that neither the equality of items’ loadings nor that of intercepts can be assumed for the eight studying effort items.
We found a four-item model (with items 2, 5, 6, and 7; see Table 1) in which no correlated error terms were observable (CMM). The tenable goodness-of-fit statistics for the corresponding configural model are presented in Table 4 (four items). With these four items, the equivalence of factor loadings and items’ intercepts could not be confirmed, as can be seen in Table 4. According to BIC, models with AS suited the data better than those without AS (Table 4).
Discussion
With reference to the first research question, the forms of rating scales were found to influence reliability for both concepts. Where a CMM measurement level could be assumed, reliability was somewhat higher in the case of the EU items when either verbal or numerical labels were used for all categories than for the forms with verbal labels only for the end categories. In the case of studying effort, the 5END and the 7ALL forms tended to have higher reliability than the other forms. The results are only partly in line with other studies, which reported that the ALL form increases reliability (e.g., Krosnick and Fabrigar 1997; Menold et al. 2014). In contrast to previous research, we used a reliability evaluation method that did not require the TSEM level. For both concepts, obtaining CMM was associated with less information loss for the 7ALL rating scale than for most other scales. The 7ALL form also reached an acceptable level of reliability for both concepts. The 7ALL scale seems to decrease heterogeneity and thereby to increase the measurement consistency of items. An explanation for this result could be that fully verbally labeled response categories improve the clarity of measurement dimension. Furthermore, seven categories allow for a higher differentiation of responses than do five categories.
Regarding the second research question, we found, for both concepts, that measurement equivalence between different rating scales was violated with respect to metric and scalar equivalence. Therefore, measurement results obtained with different rating scales are not comparable. This is consistent with the results of Krebs and Hoffmeyer-Zlotnik (2010), who addressed rating scale orientation. Respondents seemed to understand and handle varying rating scales differently; therefore, it cannot be assumed that the same latent variable is measured with the same items when different rating scales are used. We therefore advise researchers to be cautious when using nonuniform rating scales for the measurement of a construct. When rating scales are changed from wave to wave in a survey, it is important to check measurement equivalence between these waves.
With regard to the third research question, AS was a significant explanatory variable, capable of increasing measurement equivalence between different rating scales. However, even including this variable did not lead to the achievement of metric or scalar invariance. Further research could address additional explanatory variables, for example, the need for cognition (Cacioppo and Petty 1982), or other personal variables (e.g., Rammstedt and John 2007) that may affect respondents’ accuracy or motivation when responding to survey questions.
One limitation of this study is that a student sample was used. Generalization of the results to other respondent groups must therefore be undertaken with caution. In addition, numeric labels were used for only one rating scale, to enable a comparison with original survey data containing the EU items and, thus, the number of categories was not varied in this case. Because numeric labels are widely used in surveys, more research is needed to test their impact on reliability and measurement equivalence.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
