Abstract
This study describes a structural equation modeling (SEM) approach to reliability for tests with items having different numbers of ordered categories. A simulation study is provided to compare the performance of this reliability coefficient, coefficient alpha and population reliability for tests having items with different numbers of ordered categories, a one-factor and a bifactor structures, and different skewness distributions of test scores. Results indicated that the proposed reliability coefficient was close to the population reliability in most conditions. An empirical example was used to illustrate the performance of the different coefficients for a test of items with two or three ordered categories.
Items scored with ordered categories are common in social and behavioral sciences (Bollen, 1989; Finney & DiStefano, 2006). Different approaches to estimation of reliability have been proposed for tests with such items. Coefficient alpha (Cronbach, 1951) is probably the most commonly used to obtain an estimate of test reliability and is taken as a lower bound on the internal consistency of the test (Green, Lissitz, & Mulaik, 1977; Novick & Lewis, 1967; Sijtsma, 2009). It is based on the following assumptions: (a) errors in observed item scores are not correlated and (b) items in a test are essentially
Another approach for estimating reliability is through structural equation modeling (SEM; Bentler, 2009; Bollen, 1989; Green & Yang, 2009; Miller, 1995; Raykov, 1997; Raykov & Shrout, 2002). The SEM approach introduces a factorial structure of the test into the estimation of reliability. It has been shown to be especially useful for estimating reliability for tests having clusters that group items as in testlets (Cho & Kim, 2015; Green & Yang, 2015; Raykov & Shrout, 2002; Yang & Green, 2010). This could be items sharing the same passage in a reading test (e.g., DeMars, 2006; Li, Bolt, & Fu, 2006; Rijmen, 2010), items having similar reasoning methods as in a mathematics test (e.g., Cronbach & Shavelson, 2004; DeMars, 2006), or survey items measuring a specific aspect of personality (e.g., Hyland, Boduszek, Dhingra, Shevlin, & Egan, 2014; McAbee, Oswald, & Connelly, 2014). In such cases, the tests are assumed to have a single general factor (e.g., reading ability, math ability, or personality) and group factors that reflect the clusters.
With a standard linear SEM approach, item scores may be presented using a confirmatory factor analysis (CFA) model (e.g., Raykov & Shrout, 2002; Yang & Green, 2011), and reliability is calculated as the ratio of true score variance to observed score variance, where the true score variance is estimated using the CFA model. This reliability is also referred to as coefficient omega (McDonald, 1985, 1999). Green and Yang (2009) proposed a nonlinear SEM reliability coefficient for a test with ordinal categorical items within an SEM framework. For such a test, if the structure of the test is well-specified, the nonlinear SEM approach has been found to more accurately estimate reliability than the linear SEM approach, as the linear approach treats categorical scores as continuous (Green & Yang, 2009).
Green and Yang (2009) describe a test consisting of items with the same number of categories. However, in reality, tests often consist of items with different numbers of categories. For example, the PARCC (Partnership for Assessment of Readiness for College and Careers; 2016) assessment for mathematics and the Smarter Balanced Assessment (Smarter Balanced Assessment Consortium, 2016) include items that are scored with rubrics that have from two to seven points and two to four points, respectively. Likewise, a test such as the Early Development Instrument (Janus & Offord, 2007) for measuring children’s school readiness also has items scored with four to eight points.
The present study was designed to examine the effects on reliability of test scores when a test consisted of items having uneven numbers of categories. In this study, the authors focus on total test scores, that is, scores that are computed as sums of points over all items. Below, the study by Green and Yang (2009) is extended to utilize the nonlinear SEM reliability for tests with items having either the same or different numbers of ordered response categories. A numerical calculation is provided to derive the extended nonlinear SEM reliability and conduct a simulation study to evaluate the performance of this nonlinear SEM reliability. The performance of the resulting nonlinear reliability coefficient is compared to coefficient alpha and population reliability for tests with different numbers of categories compared to tests for which the items had the same number of categories. In addition, an example using real data with different numbers of ordered categories is provided to illustrate the application of the different reliabilities.
Reliability in Classical Test Theory (CTT)
Suppose there are
Let
where
Coefficient alpha (Cronbach, 1951) is often used to estimate reliability
where
Reliability Based on SEM
In the framework of SEM, item scores can be represented as follows (Green & Yang, 2009; Raykov & Shrout, 2002):
where
In this case,
When the observed data are ordinal categorical, fitting linear SEM models such as Equation 4 using the MLE method is not desirable as categorical data violate the assumption of the MLE method, and the MLE provides inflated chi-square estimates and attenuated factor loadings (Bollen, 1989). To address this problem, the authors consider the observed categorical scores
where
To estimate the reliability for the nonlinear measurement model, the correlation between two parallel tests is used:
where
where
The numerator of
Suppose item
where
where
The denominator of
because
The nonlinear reliability coefficient using the SEM framework is expressed as
This can be seen as an extension of Equation 21 in Green and Yang (2009). From Equation 14, we can estimate the internal consistency reliability of a test consisting of items with different numbers of ordered response categories by fitting the data using a nonlinear SEM model and replacing parameters in Equation 14 with corresponding estimates.
Simulation Study
A simulation study is presented in this section to investigate the performance of
Simulation Study Design
Factor structure
Four factor structures were simulated using one-factor or bifactor models. The four structures are summarized in Table 1. Models 1 and 2 are based on a one-factor model and Models 3 and 4 are based on a bifactor model with three group factors.
Factor Structures Considered in the Simulation Study.
For Models 3 and 4, the correlations between latent factors were all set to 0. For all models, errors were assumed independent. Factor loadings
Types of tests and numbers of ordered categories
Two types of tests were simulated: One type of test consisted of items with the same number of categories, and the other type of test consisted of items with uneven numbers of categories. Tests with either two or five ordered categories (labeled as conditions C2 and C5 in the simulation, respectively) were simulated for the first type. For the second type, a combination of two- and five-category items (labeled as condition C25) was generated to compare with reliability estimates from the first type of tests. In the second type of tests, every third item was simulated to have five ordered categories, and the rest were simulated to have two ordered categories.
Distribution of underlying continuous variables
The distribution of the underlying continuous variables
where
where
Distribution of thresholds
Three sets of thresholds were used to generate the categorical data: (a) normal, (b) moderate skew, and (c) mixed skew. To generate the normally distributed data for two- and five-response categories, the sets of thresholds, {0} and {–1.645, –0.643, 0.643, 1.645}, were used, respectively, to transform the underlying continuous variables. Similarly, for the moderate skew condition, {0.7} and {–0.050, 0.772, 1.341, 1.881} were used to generate ordered variables with two- and five-categories, respectively. In the case of the mixed skew condition, every third item response was generated with the negative moderate skew thresholds and the rest were generated with the positive moderate skew thresholds.
For the five-response category tests, the thresholds used to generate the normal and moderate skew distributions were taken from Muthén and Kaplan (1985). The thresholds for the two-response categories were determined to have the same skewness as the corresponding five-response category condition. The skewness values of data sets (Doane & Seward, 2011) from the normal and moderate skew conditions were about 0 and 1.2, respectively.
Data Generation
In total, there were 54 conditions: 18 conditions with a one-factor model (18 = 2 factor structures × 3 distributions of thresholds × 3 sets of item response categories) and 36 conditions with a bifactor model (36 = 2 factor structures × 2 group factor loadings × 3 distributions of thresholds × 3 sets of item response categories). For each of the conditions, the authors first generated population data, which consisted of 100,000 observations. From the 100,000 observations, the authors randomly sampled 500 observations without replacement, and replicated this 100 times. As sample size was not a focus of this study, they sampled 500 observations to avoid estimation problems due to small sample size. In addition, to calculate the population reliability,
Data Analysis
A one-factor model and a bifactor model were fit to the corresponding generated data. For each condition,
Simulation Study Results
The authors first examined whether models converged. Some estimation issues occurred for the bifactor models (Models 3 and 4), especially when the number of response categories was two and data had mixed skewness. The estimation issues included that a residual covariance matrix was not positive definite or residual variances were negative. To make meaningful comparison, the data sets that had no estimation issue were used to calculate
Table 2 presents the means of reliability coefficients for the one-factor models (i.e., Models 1 and 2). As presented in Table 2, the C2 condition had the lowest
The Means of Reliability Coefficients and Their Standard Deviations for Data From the One-Factor Models.
Note. M and Gr refer to Model and Group factor loading, respectively; C refers to the condition for the number of response categories; and
Coefficient
Table 3 presents the means of reliability coefficients for the bifactor models (Models 3 and 4). The performance of
The Means of Reliability Coefficients and Their Standard Deviations for Data From the Bifactor Models.
Note. M and Gr refer to Model and Group factor loading, respectively; C refers to the condition for the number of response categories; and
The one-factor models and the bifactor models fit the corresponding data well under each condition: Across all replications the comparative fit index (CFI) were all higher than .98, and the root mean square error of approximation (RMSEA) were all smaller than .06 with mean values for each condition ranging from .01 to .02.
Conclusions From Simulation Study
The values of
Illustration With Empirical Data
In this section, the authors illustrate the performance of the nonlinear SEM reliability, and compare it with that of coefficient alpha on a set of real data. The data were from an National Science Foundation (NSF)-funded project focusing on teaching science inquiry practices for students in Grades 6 to 10.
Test
The test consisted of 27 constructed response items designed to measure students’ use of academic language and understanding of science inquiry practices. Seventeen of the 27 items were scored with a 2-point rubric; the remaining 10 items were scored with a 3-point rubric. Ten items had skewness larger than 0.7, and two items had skewness less than –0.7. Descriptive statistics for each item are provided in Table C.2 in Online Appendix C.
Sample
The sample consisted of 906 students across Grades 6 to 10: 260 (29.05%) students were in the sixth grade, 209 (23.35%) students were in the seventh grade, 222 (24.80%) students were in the eighth grade, 80 (8.94%) students were in the ninth grade, and 124 (13.85%) students were in the 10th grade. There were 11 students for whom grade information was not available.
Results
A one-factor model and a bifactor model were fit to the data. The one-factor solution did not provide good model fit:
The bifactor model with five group factors was next fit to the data. This solution fit the data better:
Coefficient alpha for these data was .92, which was smaller than the nonlinear SEM reliability under the bifactor model solution. The uneven number of response categories might be one of the reasons for having the lower value for coefficient alpha than the nonlinear SEM reliability estimate. Factor loading estimates of the bifactor model varied across the 27 items (ranging from 0.18 to 0.79), which clearly suggests violation of
Based on the results from this study, the nonlinear SEM approach can provide an alternative reliability estimation method in that it is close to the population reliability when items are not
Discussion
In this study, a generalized nonlinear SEM reliability coefficient was designed to provide internal consistency reliability estimates for tests with items scored with equal or unequal numbers of ordered categories. A simulation study evaluated the performance of this coefficient compared to coefficient alpha and population reliability for observed sum scores. Results indicated that the nonlinear SEM reliability estimates and the population reliability values were close across all the conditions. The nonlinear SEM reliability coefficient and the population reliability coefficient always had the lowest values under the two-response-category condition and had the highest values under the five-response-category condition. The two reliability coefficients for items with mixed numbers of response categories fell in-between. This performance can be expected as having more response categories indicates more information. Furthermore, items with a mixed number of categories might be assumed to have more information than those having the smallest number of categories among the mixed categories and also to have less information than those having the largest number of categories. The nonlinear SEM reliability successfully captured this tendency by estimating reliability close to the population reliability. Results of this study suggest that the nonlinear SEM reliability is a flexible reliability estimate and can provide an accurate estimate of the population reliability for situations in which data consisted of either the same or different numbers of ordered categories and the data have either an unidimensional structure or have some factors due to clusters of items.
In general, coefficient alpha values were close to the corresponding population reliability coefficient for simulation conditions under the one-factor model and for equal numbers of categories (i.e., conditions C2 and C5). In addition, continuous scores that underlay the observed categorical scores were essentially
The simulation study in this article was conducted under the conditions of correctly specified factor models. When the factor model is misspecified, as was evident in the empirical study by the poor model fit indices for the unidimensional model, the nonlinear SEM reliability may fail to accurately estimate reliability. This is because misspecification can lead to incorrect parameter estimation, resulting in incorrect nonlinear reliability estimates. Second, the simulation study was conducted with the assumption of multivariate normality of the latent variables. For this assumption, Yang and Green (2015) found that the nonlinear SEM reliability was robust to the modest violation of normality assumption under the condition of the same number of response categories. Further study would be useful on the impact of model misspecification and the violation of the normality assumption of underlying latent variables on the nonlinear SEM reliability.
Supplemental Material
Online_appendix_1 – Supplemental material for Reliability for Tests With Items Having Different Numbers of Ordered Categories
Supplemental material, Online_appendix_1 for Reliability for Tests With Items Having Different Numbers of Ordered Categories by Seohyun Kim, Zhenqiu Lu and Allan S. Cohen in Applied Psychological Measurement
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported in part by the National Science Foundation (NSF) grant at the University of Georgia under Grant No. 1316398.
Supplemental Material
Supplemental material is available for this article online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
