A classic topic in the fields of psychometrics and measurement has been the impact of the number of scale categories on test score reliability. This study builds on previous research by further articulating the relationship between item response theory (IRT) and classical test theory (CTT). Equations are presented for comparing the reliability and precision of scores within the CTT and IRT frameworks. This study presented new results pertaining to the relative precision (i.e., the test score conditional standard error of measurement for a given trait value) of CTT and IRT, and the new results shed light on the conditions where total scores and IRT estimates are more or less precisely measured. The relative reliability of CTT and IRT scores is examined as a function of item characteristics (e.g., locations, category thresholds, and discriminations) and subject characteristics (e.g., the skewness and kurtosis of the latent distribution). CTT total scores were more reliable when the latent distribution was mismatched with category thresholds, but the discrepancy between CTT and IRT declined as the number of scale categories increased. This article also considered the appropriateness of linear approximations of polytomous items and presented circumstances where linear approximations are viable. A linear approximation may be appropriate for items with two response options depending on the item discrimination and the match between the item location and latent distribution. However, linear approximations are biased whenever items are located in the tails of the latent distribution and the bias is larger for more discriminating items.
Methodological developments have bridged the concept of reliability between the item response theory (IRT) and classical test theory (CTT) frameworks and discussed concepts that have traditionally been reserved for IRT (e.g., item information functions [IIFs]) within the context of CTT. The purpose of this article is to understand the circumstances where researchers should prefer estimating true scores (i.e., θ) with test scores derived from IRT (i.e., ) versus CTT (i.e., x). This article compares the precision and reliability of and x, and equations are presented for investigating how item and subject characteristics affect the reliability of and x. Mellenbergh (1996) noted that reliability is a population-dependent quantity that is affected by characteristics of latent distributions, whereas the conditional standard error of measurement (CSEM) quantifies error variance (i.e., precision or the inverse of information) for a specific θ value. New equations are presented that compare the relative precision of and x. The results show that IRT and CTT have CSEMs that are roughly the mirror image across values of θ. That is, is measured more precisely in portions of the θ continuum that include relatively more category thresholds, whereas x tends to be measured more precisely for θ values that are further from category thresholds.
It is important to articulate the contributions of this article to existing research. First, methodological advances concerning dichotomous items established a link between the IRT and CTT frameworks by articulating the reliability of total scores and percentile ranks with corresponding IRT item parameters, such as item difficulty, discrimination, and guessing parameters (Bechger, Maris, Verstralen, & Beguin, 2003; Dimitrov, 2003; Kolen, Zeng, & Hanson, 1996; May & Nicewander, 1994). Additional research has studied the impact of IRT item parameters on the reliability of gain scores (May & Jackson, 2005), the lower and upper bounds of the IRT reliability coefficient (Kim & Feldt, 2010), the reliability of scale scores and performance assessments using polytomous IRT (Wang, Kolen, & Harris, 2000), and the reliability of subscores using unidimensional (Haberman, 2008) and multidimensional IRT (Haberman & Sinharary, 2010). Researchers have also found that IRT scores provide more accurate estimates of interaction effects within the contexts of analysis of variance (Embretson, 1996) and multiple regression (Kang & Waller, 2005; Morse, Johanson, & Griffeth, 2012). This article offers new information about the connection between the concepts of reliability and precision in CTT and IRT. That is, no study has provided theoretical results about the relative precision of and x for a given θ value. The new derivations provide theoretical rationale for circumstances when CTT scores include relatively more or less measurement error than IRT scores. Moreover, no research study has analytically studied the interactive effect of item characteristics and latent distribution shape on the relative reliability of and x. This article uses Fleishman’s (1978) power transformation (PT) method probability density function to study the impact of nonnormal latent distributions on the reliability of total scores.
Second, additional research has extended the concept of item and test information to CTT under the assumptions that observed measurements are continuous, rather than polytomous, and θ is linearly related to x (Ferrando, 2002, 2009; McDonald, 1982; Mellenbergh, 1996). Ferrando notes that a linear model provides a good approximation when item discrimination indices are relatively small in value and coarsely measured items have five or more response categories. Moreover, Ferrando and Mellenbergh showed that, for the linear model, the CTT IIF is horizontal and unrelated to θ. However, whenever the relationship between latent and observed total scores is nonlinear (which occurs when items are polytomous), the CTT IIFs are no longer unrelated to θ. In fact, this article shows that the conditional standard error of x given θ is a downward-facing function where scores near category thresholds have the least amount of precision. The accuracy of a linear relationship is also explored, and the results in this article examine the effect of test characteristics (e.g., item locations and discrimination) and subject characteristics (e.g., latent distribution shape) on the appropriateness of linear approximations of polytomous items.
Third, previous Monte Carlo studies that have studied the relative performance of CTT and IRT estimates are limited by the combination of parameter values and type of reliability studied. For instance, Greer et al. (2006) studied the impact of skewness on coefficient alpha with the constraint that item variances were equal, which may not occur frequently in practice. The results in this article can be used to study any combination of IRT parameter values and latent distribution shape and provide more general results than previous Monte Carlo simulations. In addition, Wang et al. (2000) presented equations for computing the reliability of scale scores from performance assessments using the generalized partial credit model. However, unlike Wang et al., this study presents equations for evaluating how the number of items, number of scale categories, and the shape of the latent distribution affects the reliability of and x. Furthermore, R code (R Development Core Team, 2010) is available at http://publish.illinois.edu/sculpepper/, and researchers can use the R code to compute the reliability of and x for different item characteristics and latent distributions. Consequently, this study presents new results and provides applied researchers with guidance for reliably scoring tests in different situations.
This article includes five sections. The first section presents equations for the reliability and precision of scores within the CTT and IRT paradigms and includes new results about the CSEM for CTT. The second section compares IRT and CTT in terms of CSEMs to provide a general understanding of the circumstances researchers should prefer CTT versus IRT scores. The third section compares the relative reliability of and x for different item (e.g., item locations and number of response categories) and subject distribution characteristics (e.g., the skewness and kurtosis of θ), and the fourth section examines how item and subject characteristics affect the appropriateness of a linear approximation of polytomous items. The last section discusses the results and provides recommendations and concluding remarks.
Equations for the Reliability of CTT and IRT Estimates of θ
Let θ represent a latent variable and ui be an observed polytomous response, where i indexes items (). The observed polytomous response for item i can be expressed as a function of a true score and random error , such that . That is, the observed ui equals an item true score (May & Nicewander, 1994), which is a nonlinear function of θ, plus an error, ei. Let j index category thresholds () and J+1 is the corresponding number of categories for item i. That is, j is used to index categories as well (e.g., J+1 = 4 implies that ui has four categories). This article assumes that researchers code ui using integers from 1 to J+1. Several polytomous models exist to describe the relationship between θ and the chance that ui equals one of J+1 categories, which include the graded response (Molenaar, Dolan, & De Boeck, 2012; Muraki, 1990; Samejima, 1969), partial credit (Masters, 1982; Muraki, 1992, 1993), and rating scale (Andrich, 1978a, 1978b) models. The derivations in this article use Muraki’s (1990) modified graded response model,
where bi and are the item difficulty and discrimination parameters, respectively, and cj is the jth threshold, which is equal for all I items. Equation 1 represents the chance that given θ and the probability that item i equals the jth category is
Note that Muraki’s model was chosen because the form of Equation 1 is easier to manipulate analytically and the derivations of expressions for derivatives are less cumbersome. Muraki’s model assumes that the J cj are constant across items and are equally spaced. The goal of this article is not to estimate abilities or item parameters with Muraki’s model, and these assumptions can be relaxed to evaluate the impact of unequal item thresholds on the reliability and precision of x and . In fact, the new expressions are applicable for any polytomous model, and Muraki’s model is only used for computational examples. Furthermore, the aforementioned models tend to yield scores that are highly correlated (Embretson & Reise, 2000), so we should not anticipate the results would change significantly if the partial credit or rating scale models were used.
The derivations below require the specification of a distribution for θ. In this article, θ is assumed to follow a Fleishman PT distribution, , where to indicate that θ has a mean μ, variance , and skewness and kurtosis of and , respectively ( is discussed in greater detail in the Appendix). Note that any univariate distribution could be chosen for θ. One advantage of using Fleishman’s PT distribution is that it is flexible enough to explore of how changes in μ, , , and affect the relative reliability of IRT and CTT θ estimates. However, as noted by an anonymous reviewer, researchers often set to estimate item discriminations (i.e., the scale indeterminacy problem). Accordingly, the results in this article also use to understand how manipulating item discriminations affects reliability. Fleishman’s PT distribution does not encompass the universe of all univariate distributions, so future researchers can modify the associated R code to examine the reliability of CTT and IRT estimates when θ follows other distributions. Furthermore, the discussion below provides an argument as to how reliability within the CTT and IRT frameworks is dependent on the match between the shape of and the topography of the conditional variance of x and .
Reliability and Precision Within IRT Framework
One of the strengths of IRT over CTT relates to the well-known measure of precision for estimated trait scores, . Reliability is specific to a group and is a function of the unconditional standard error of measurement (SEM). CTT has traditionally calculated the SEM for a group of scores, whereas the CSEM is a measure of precision that corresponds to a specific trait level within the IRT framework. The CSEM of is related to the test information function (TIF), which is derived using the concept of Fisher’s information to measure the amount of information that a single observation provides about θ. In fact, the inverse of a TIF indicates the variance of for a given θ. Previous research (Muraki, 1993; Samejima, 1994) discussed TIFs for polytomous IRT models and noted that the IIF for item i is
For Muraki’s (1990) modified graded response model, is a function of , which is a logistic function. The first two derivatives of are as follows:
Accordingly, can be computed using the first and second derivatives of the item category probabilities, , which are as follows:
The TIF for a test of polytomous items is the sum of the respective IIFs:
The relationship between true and observed IRT estimates can be written as . If θ and ε are independent, the variance of is the sum of the true and error variances, . The conditional variance of given θ is defined as . The expected conditional variance of for a specific distribution of θ, , is as follows:
The expected reliability of is the well-known ratio of true to observed variance,
where is the variance of θ specified in . Clearly, is dependent on the characteristics of the test (i.e., ) and the distribution of latent scores (i.e., ).
Reliability and Precision Within CTT Framework
Let x be the total score, or sum, of the Iui, such that . For a subject with a given θ, ui equals one of categories each with probability , so is
That is, is a weighted average of the category scores (i.e., j = 1 to ) and the chance that subjects with a specific θ have an observed score ui. The previous section discussed an expression for the expected conditional variance of as a function of the inverse TIF and distribution of θ. This section analogously derives an expression for , which is the expected variance of x for a given value of θ. An equation for the reliability of x is presented as a function of and the variance of the expected true scores, . This subsection proceeds by first deriving an expression for the conditional error variance of x, , and then identifies an equation for the true score variance, .
Consider item i where the error is . Note that ui is a coarse measure of in that ui is ordinal and equals one of the values, whereas θ is measured on an interval scale. One immediate observation is that . Recall that ui is a polytomous item, so for to . The conditional expectation of the error within item i is
The variability of observed ui around expected values is determined by the error variance or precision; that is, and is defined as
Equation 11 is new to the literature and the following sections compare the properties of with .
is the conditional variance for a single item and an expression is needed for the conditional variance of x. An important observation is that errors within two polytomous items, say, ui and uh (let categories for uh be indexed by k), are independent whenever the items are locally independent, which assumes that . Specifically, the covariance between eh and ei conditioned on θ is
The finding that errors within items are independent, if local independence is assumed, is particularly important, because the conditional error variance of the total score, x, given θ is simply the sum of the conditional item variances, .
Recall that . Equation 9 implies that the expected value of x given θ is
The conditional variance of x given θ is the sum of conditional variances for the Iui, under the assumption of local independence (see Equation 12), which implies that . The expected conditional variance for a specific distribution of θ is
Recall that in IRT the expected variance of maximum likelihood estimates is the expected value of across the distribution of θ. Similarity, the expected conditional variance of x is found by replacing with . Consequently, researchers can compare the CSEM (i.e., and ) to understand which values of θ are associated with relatively more precision within the CTT and IRT frameworks.
Recall that is the expected total score for subjects with a given θ and the variance of across subjects provides a measure of the amount of true score variance. First, note that the unconditional mean of x is
The variance of across θ is
The reliability of x is the ratio of true to observed variance:
Factors That Affect the Precision of CTT and IRT Scores
The previous section derived expressions for the CSEMs for x and (i.e., and ). The characteristics of polytomous IRT CSEMs are well understood. For example, tends to be smaller at points along the latent continuum where category thresholds are located and declines in regions where more discriminating items are located. In contrast, the expression for is new and the purpose of this section is to compare with to provide researchers with a conceptual understanding of the testing situations where x may be preferred to in terms of measurement precision.
As an example, consider a test consisting of three items that each have four response categories. Moreover, let α and b be three-dimensional vectors of item discriminations and locations, such that α = (0.5, 1.5, 2.5) and b = (−1, 0, 1). Moreover, to simplify this example, let the category thresholds be equally spaced with values of (−1.6, 0, 1.6) units below and above (i.e., the thresholds for the first item are located at −2.6, −1.6, and 0.6 on the θ scale).
Figure 1 presents (see Equation 11) and (see Equation 3) for the three items. The IRT conditional variances exhibit expected behavior. That is, is lowest at the category thresholds and is generally smaller than the other two items because the third item is more discriminating. One additional nuance is that has local minima at the category thresholds, whereas Items 1 and 2 have inverse information functions that appear smoother. Stated differently, in IRT, items with larger discriminations tend to have IIFs with more topography in regions near category thresholds.
Conditional variances of CTT and IRT scored items for a hypothetical three-item test with four response options and equally spaced thresholds, cj = (−1.6, 0, 1.6).
As expected, is smaller for more discriminating items; however, the general behavior of is different from . Namely, tends to be a downward-facing function where measurement error is largest at category thresholds. For example, tends to be the smallest of the three items, but has maxima near the category thresholds, which differs from IRT where measurements are more precise at category thresholds. Under a CTT framework, is smallest for either more extreme θ or for θ values that lie between category thresholds. In short, is roughly the mirror image of , because tends to be lower in segments of the θ continuum where is larger and vice versa.
Figure 2 includes the same items discussed in Figure 1, but with the exception that the thresholds are no longer equally spaced. Figure 2 shows that and respond inversely to unequal item thresholds. For instance, tends to increase in portions of the latent continuum that include more item thresholds, whereas is smaller in segments of the latent continuum where there are more item thresholds and increases wherever there are fewer thresholds.
Conditional variances of a CTT and IRT scored items for a hypothetical three-item test with four response options and unequally spaced thresholds, cj = (−1.6, 0, 1.0).
Figure 3 demonstrates CTT and IRT test CSEMs along with error bars around conditional expected values, and . The top row of Figure 3 illustrates the CTT and IRT CSEMs for x and , respectively, for the hypothetical three-item test. Note that the vertical lines in the top row of panels in Figure 3 indicate category thresholds for Items 1, 2, and 3. Figure 3 shows that is larger for θ values that are near category thresholds, whereas is smaller at points on the latent continuum that have more thresholds and more discriminating items. For the three-item test, appears more responsive to item discriminations than as indicated by the fact that the slope of is steeper than in the θ range measured by Item 1.
Test score CTT and IRT conditional standard error of measurement and expected value plots with ±2 error curves for a hypothetical three-item test with four response categories.
The second row of panels in Figure 3 plots and as well as ±2 times the CSEMs. As discussed previously, is a more accurate indicator of θ for more extreme values on the θ continuum, whereas is a better indicator in the range where items and categories are located. The overall reliability of x and is dependent on the shape and location of the latent distribution. For example, if most subjects lie in the middle range of the latent continuum where x is less precisely measured relative to . In fact, the differences in and contribute to being significantly more reliable than x (i.e., 0.65 vs. 0.49). Certainly, ρxx and will change depending on the shape and location of the θ distribution.
The Reliability of x and for Item and Subject Characteristics
Figure 3 included results for a simple example to demonstrate the theoretical differences between and , which are useful pieces of information for developing tests and instruments. That is, and provide applied researchers with an understanding of ranges of θ values that are best measured with x or . Moreover, and offer researchers information about which measurement framework is most beneficial for various item characteristics and subject populations.
Equations 8 and 17 were used to compare the reliability of x and as a function of the number of scale categories, purpose of measurement (i.e., item locations dispersed along the continuum or clustered in a given region), and θ distribution shape. More specifically, Figure 4 includes and ρxx across scale categories (i.e., 2–10 response options) for three types of item locations and four types of distributions for θ. Item discriminations and test length were not manipulated and were fixed at 1.25 and 10, respectively, It is well known that increasing either item discrimination or test length increases reliability, and these parameters were held constant to focus on the other parameters.
Reliability of CTT and IRT scores of a 10-item test for different item locations, number of response categories, and subject latent distribution shapes.
To simplify the discussion, the three scenarios assume that items have equally spaced category thresholds. Specifically, the item category thresholds (i.e., the J cj) were equally spaced between −2.0 and 2.0 on the continuum. Let c be the vector of category thresholds that are defined as where J is a vector with elements equal to the integers from 1 to J. For example, the threshold is zero for items with two scale categories (i.e., J+1 = 2) and items with four scale categories have three thresholds at −1, 0, and 1. Whereas the following discussion assumes the thresholds are equally spaced, researchers can input any set of category thresholds into the R code, which is available at the author’s website.
Let b be a vector of item locations and I be a vector that includes integers from 1 to I. The three item location scenarios represent situations where researchers would be interested in measuring θ values in a narrow range in the lower or upper tails or measuring θ values across the latent continuum. The item locations for the three scenarios represent the following uniform distributions: (i.e., ), (i.e., ), and (i.e., ).
As noted, four subject distributions were examined to evaluate how the density of the population at various points on the latent continuum affected and ρxx. Specifically, the distributions were negatively skewed ( = −1.5, = 4.0), normal ( = 0, = 0), symmetric and peaked ( = 0, = 4.0), and positively skewed ( = 1.5, = 4.0).
Figure 4 includes 12 panels corresponding to the three item locations and four subject distribution types. The middle row of panels in Figure 4 demonstrates that CTT and IRT yield similar reliabilities in situations where tests consist of items that are evenly placed along the latent continuum. Given that is smaller in areas with fewer category thresholds, one explanation for the slight advantage of IRT is that CTT CSEMs decline outside of the (−2, 2) range of the items where there are fewer subjects in the skewed and symmetric distributions. More precisely, Figure 3 showed that is significantly larger relative to in portions of the latent continuum that has fewer category thresholds. In contrast, is relatively smoother across θ values and tends to decline in segments where there are fewer category thresholds. Furthermore, in the case where items are located across the latent continuum, is understandably more reliable than x when the latent distribution is more peaked (i.e., positive kurtosis). The middle row of Figure 4 also shows the expected positive relationship that increasing scale categories has on reliability for both and x. However, as found in previous research (e.g., see a review of relevant literature in Chafouleas et al., 2009, and Weng, 2004),the value of an additional response category increases at a decreasing rate and reliability does not increase significantly in value beyond four or five response categories.
In theory, x is expected to be more reliable than when the latent distribution is mismatched with item locations. The top and bottom panels of Figure 4 demonstrate the theoretical results discussed in this article concerning the effect that subject and item characteristics have on CTT and IRT reliability. For instance, CTT is superior to IRT whenever the items are located at the extremes of the latent distribution. Furthermore, the difference between ρxx and is largest for two category items and approaches zero as the number of categories increase. As discussed earlier, IRT is a tool for measuring specific θ values more precisely. In fact, the difference between ρxx and is smallest when the majority of the latent distribution overlaps with item locations, which occurs when items are located in the lower portion of the latent continuum and the latent distribution is positively skewed (e.g., , = 1.5, and = 4.0) or when items are in the upper tail and the distribution is negatively skewed (i.e., , = −1.5, and = 4.0).
In short, the results in this section provide new information about the influence of item and subject characteristics on the relative reliability of x and . In fact, the findings reflect the expected results given the nature of and . That is, total scores were more reliable whenever items were mismatched with the location of the latent θ distribution. The following section addresses another important concern related to the accuracy of linear approximations of polytomous items.
Appropriateness of a Linear Approximation of Polytomous Items
One approach for modeling polytomous items is to use a linear approximation of the relationship between θ and ui (e.g., a common factor model, Culpepper, 2012b). As noted in Equation 13, the relationship between θ and ui is nonlinear. Ferrando (2009) and Mellenbergh (1996) noted that a linear relationship tends to provide a reasonable approximation when item discrimination indices are relatively low and there are five or more response categories. The purpose of this section is to explore the appropriateness of linear approximations of polytomous items in greater detail. This section includes two subsections. The first subsection describes a measure for quantifying the appropriateness of a linear approximation, whereas the second subsection presents findings about the accuracy of a linear approximation for different item characteristics (e.g., items locations, number of response categories, and item discriminations) and subject characteristics (e.g., the skewness and kurtosis of the latent distribution).
A Measure for the Appropriateness of a Linear Approximation
Let be the loading relating latent θ to observed ui for a given θ. The relationship between θ and ui for a specific value of θ is the first derivative of Equation 9 with respect to θ:
The relationship between ui and θ will be larger for some values of θ and near zero at the extremes of the θ continuum (i.e., portions of the latent continuum where plateaus). A linear approximation is dependent on the characteristics of the examinee population. The expected loading for a given population of subjects can be found by averaging over the distribution of θ:
The function that constrains the relationship between θ and ui to be linear for a given population of examinees is
where , , and denote the linear approximation of the relationship between θ and ui. Also, note that has the same form as the least squares projection where passes through the centroid for θ and ui () and the relationship is the expected slope for agiven population of examinees, .
Let be the error when predicting the observed item ui with a linear approximation. Unlike, ,
That is, di will be biased for certain values of θ but across the θ range for a given population. The bias in di does not translate into larger conditional variances for . In fact, ,
Furthermore, the bias in di does not introduce dependence among residuals. Let be a residual when linearly approximating uh. The covariance between errors is
where the last inequality was obtained by recalling that the assumption of local independence implies as shown in Equation 12.
Ferrando (2009) described a measure of the appropriateness of a linear model, which is based on restricting observed scores to fall within feasible ranges (e.g., 1 and for a polytomous item with categories). More precisely, implies that a linear approximation is appropriate for θ scores within the following range:
Following Ferrando, the floor and ceiling indices for the appropriateness of a linear approximation of ui are and , respectively. Ferrando’s measure of appropriateness is the proportion of examinees with θ scores in the viable range:
Consider the examples presented in Figure 5, which compares with for three items with bi = (−2, 0, 2) and the following parameter values held constant, : for all i, four response categories (), category thresholds of c = −1.6, 0, 1.6, and . The middle panel in Figure 5 includes the case where bi = 0. Figure 5 shows that and deviate at the ends of the latent continuum. However, given that the distribution of θ is standard normal, nearly all of the subjects (i.e., 98.9%) have approximated scores between and . The top and bottom rows of Figure 5 include examples of items that are located at −2 and 2, respectively. In fact, when items are located at the extremes of the θ continuum = 0.876%, which implies that 12.4% of subjects have approximated scores outside the viable bounds. The examples in Figure 5 demonstrate that item location and the shape of the latent distribution affect the appropriateness of linear approximations. For instance, Figure 5 provides insight that is larger in value if items are located near the mean of a standard normal distribution.
Hypothetical and for items with different locations and four response options.
Impact of Item and Subject Characteristics on the Appropriateness of a Linear Approximation
The previous subsection discussed as a measure of the appropriateness of a linear model. The purpose of this subsection is to present more general findings about the role of item and subject characteristics on the accuracy of a linear approximation. Specifically, this subsection discusses the appropriateness of a linear approximation of ui for 12 scenarios as defined by three response category scenarios ( = 2-10) and four θ distribution scenarios (e.g., the same distributions examined in Figure 4) with the addition that the item discrimination indices are compared for = 0.5, 1.0, 1.5, and 2 and category thresholds are defined as discussed for the scenarios in Figure 4.
Figures 6 and 7 plot against item locations (i.e., bi) and provide evidence about circumstances where a linear approximation is appropriate. Note that the rows of Figures 6 and 7 correspond to the number of response categories (i.e., J+1), whereas columns relate to distribution shape. Comparing rows demonstrates that increases as the number of response options increases; however, the accuracy of a linear approximation does not improve significantly beyond four response categories. For instance, a linear approximation is appropriate and for items with five or more response categories that are located within one standard deviation of the mean. Furthermore, larger item discriminations have a negative effect on and the number of response categories and item discrimination has an interactive effect where declines more as increases and there are fewer response categories. Moreover, Figure 6 illustrates that a linear approximation is appropriate for as few as two response options when = 0.5 or whenever = 1.0 and items are located in the middle of the distribution.
The proportion of viable scores from a linear approximation of polytomous items with two to four response categories by item location and discrimination, and subject latent distribution shapes.
The proportion of viable scores from a linear approximation of polytomous items with five to seven response categories by item location and discrimination, and subject latent distribution shapes.
Comparing columns provides an indication of the effect of latent distribution shape and item location on . Specifically, the relationship between bi and reflects the shape of the latent distribution, and is smallest when bi is located in the tails of the latent distribution. For instance, the curves for the normal distribution appear more bell shaped when the distribution is normal as opposed to peaked (i.e., ). In fact, is smaller when , ceteris paribus. The relationship between item location and is cubic when the latent distribution is skewed, and is smaller in skewed distributions in segments where there is less density in the latent distribution (e.g., the right tail if and the left tail if ).
Figure 6 also demonstrates the effect of item discrimination on . Ferrando (2009) noted that a linear approximation is most appropriate when is smaller. In fact, Ferrando’s recommendations are supported given that a linear approximation is expected to be appropriate for almost all subjects and item locations when = 0.50. Figures 6 and 7 also show conditions where linear approximations are appropriate even when items are more discriminating (e.g., ). For example, is larger when and items are located near the middle of the latent distribution. declines as increases for all of the scenarios included in Figure 6 albeit at different rates. In short, the results in Figures 6 and 7 imply that linear approximations are best for items that measure θ near the central portion of the latent distribution. Furthermore, is least accurate for highly discriminating items, and a linear approximation is inaccurate even if there are more than four response categories when items are located in the tails of the latent distribution.
Discussion
The findings in this study offer guidance to applied researchers interested in the construction and analysis of polytomous items. In short, this article presented new theoretical results concerning item and subject characteristics that affect the relative precision and reliability of x and . This section summarizes the findings for psychometricians and applied researchers and offers concluding remarks.
This article studied the reliability of two alternative test scoring approaches: a CTT total score versus an IRT estimate. The majority of applied researchers in education and psychology are familiar with x, whereas fewer have knowledge about and IRT. Consequently, it is important to offer applied researchers information about the relative merits of x and . This article offered additional insights into fundamental differences between x and . One salient factor studied in this article was the interactive effect that item locations and θ distribution shape had on ρxx and . Suppose the purpose of testing is to accept high-scoring examinees into an institution as is the case with examinations for actuarial sciences where the latent distribution is either normal or positively skewed. For instance, the first actuarial exam includes items clustered in the upper portion of the θ continuum to identify those students who are most competent in the foundations of calculus and probability theory. The scores for examinees who are near the passing cutoff are measured more precisely than those who are in the middle or lower portions of the θ distribution as indicated by smaller values for . In contrast, the results in this article show that is larger near the cutoff score and would be relatively smaller for θ values in the lower and middle portions of the examinee distribution. Certainly, decision makers would prefer IRT scoring versus a total score in this instance, because the purpose of measurement is to evaluate whether test takers exceed a minimum proficiency level. However, secondary users of the actuarial test score data should prefer total scores, because, as indicated in Figure 4, x tends to be more reliable for a population of subjects than if items are difficult and located in the tails of the test-taker population distribution. For instance, some secondary users may want to gather validity evidence and correlate examinee scores with other indicators, such as undergraduate or graduate grade point average (Culpepper, 2010; Culpepper & Davenport, 2009), other aptitude tests, or job performance (Aguinis, Culpepper, & Pierce, 2010). In these instances, the validity coefficient associated with would be smaller than the coefficient for x because .
In addition, the results provide evidence that CTT and IRT scoring methods perform similarly when the goal of testing is to measure a construct across values of a latent continuum. For example, federal and state testing programs measure what students know in relation to given standards. The results in Figure 4 imply that ρxx and are similar and testing programs could report total scores, which may be easier to explain to certain stakeholders (e.g., teachers, parents, and students).
The results in this article also offer recommendations for the construction of scales that include polytomous items, such as educational or employment performance assessments, behavioral ratings, or affective measurements. In either case, researchers may prefer x to if the purpose of instrument development is to conduct correlational research rather than to measure specific trait levels. However, x should probably only be preferred to when researchers fit simple models rather than more complicated interactive models (Embretson, 1996; Kang & Waller, 2005; Morse et al., 2012).
In addition, the findings in this article provide information about the optimal number of item response options. The results in Figure 4 examined the reliability of total scores and IRT estimates across a range of parameter values for the number of scale categories, item locations, and the shape of the latent distribution. The findings in Figure 4 imply that using more than five or six scale categories does not significantly improve the reliability of x or regardless of the shape of the latent distribution or location of items. However, it is important to note that adding an additional scale value had a larger effect on than on ρxx.
Another relevant finding for applied researchers (and methodologists) relates to the appropriateness of a linear approximation of polytomous items. The results in this article confirm arguments in previous research (Ferrando, 2009) that linear approximations are more accurate for less discriminating items and items with more response categories. Additional findings suggest that linear approximations of polytomous items seem appropriate for items that measure trait levels in denser segments of the latent distribution, and linear approximations were least appropriate for items located in the tails of the θ distribution. In contrast to previous research, the results provided new evidence that researchers need to consider item locations, in addition to item discriminations and the number of response categories, when employing linear approximations.
Last, another contribution of this study is the availability of the associated R code. More specifically, researchers can use the R code when designing instruments in an effort to evaluate conditions where total scores and IRT scores are more or less reliable and precise. Furthermore, the R code has pedagogical value, as well, for computational applications of the theoretical results.
There are several directions for future research to build on this study. First, this article offers recommendations for researchers who are interested in using x as a measure of θ in applied research. Specifically, designing a reliable x requires the inclusion of items located at the boundaries of the θ range of interest. For instance, if the measurement goal is to distinguish high versus low scorers on some trait, the results pertaining to CTT CSEMs dictate that the items should be located in the middle of the distribution, because low and high scorers will be measured more precisely. Likewise, items should be located around a cutscore (and not at the cut-score) if the purpose of measurement is to make inferences about whether examinees exceed or fall below some minimum proficiency level. The behavior of CSEMs under CTT is counterintuitive, because, in contrast, item design under the IRT framework dictates that developers should write items that are specific to certain θ levels and measurement purposes. Additional research is needed to understand differences in optimal test assembly (H. Chang & Ying, 2009) within the IRT and CTT frameworks.
Second, there could be benefits in revisiting some topics in modern IRT within the context of CTT. For instance, there may be new insights available by reexamining topics in CTT, such as computer adaptive testing (H. Chang & Ying, 1996) or equating techniques (Kolen & Brennan, 2004), which could lead to new methodologies, refinements of existing approaches, or other unanticipated discoveries.
Third, researchers could extend the results in this article to understand the impact of using x and . Specifically, researchers use total scores as dependent variables and predictors in every subdiscipline of psychology and education. Despite the widespread use of total scores, few methodological studies have examined the impact of using total scores on the power and Type I error rates of tests that researchers employ (Embretson, 1996; Kang & Waller, 2005; Morse et al., 2012). Furthermore, with the exception of Embretson (1996), previous studies utilized Monte Carlo techniques that are limited by the parameter values studied. Consequently, future analytic explorations could provide additional insights into the effect of using total scores, and future research should accordingly examine the effect that the number of scale categories, the shape of the latent distribution, and IRT parameters have on the performance of commonly used statistical tests (Culpepper, 2012a; Culpepper & Aguinis, 2011).
Fourth, this study examined the theoretical reliability of x and using a polytomous IRT model. As one anonymous reviewer noted, this article addressed reliability from a mathematical perspective and does not consider factors related to subjects’ cognitive decision making. For example, this article did not address issues related to category labels; however, previous research identified a causal effect of scale labels, category position, and rating scale intensity and length on certain observed item characteristics (Dunham & Davison, 1991; Lam & Stevens, 1994; Murphy & Constans, 1987). For instance, existing evidence suggests that scale labels can affect the observed item means, but there is less evidence that manipulating category labels alters observed item variances (L. Chang, 1997; Dunham & Davison, 1991). Moreover, positive-packed scales tend to affect item means (Dunham & Davison, 1991; Lam & Kolic, 2008), and semantic compatibility of category labels improves reliability (Lam & Kolic, 2008). Most of the previous literature on category labels used CTT or generalizability theory, and additional empirical research is needed to understand how category labeling decisions affect polytomous IRT item parameters (e.g., item locations, discriminations, and category thresholds) and consequently alter the reliability of x and . Future research may identify relationships between category labels, cognitive decision making, and IRT parameters, and the results in this article provide mathematical arguments for describing how subsequent changes in IRT parameters affect test score reliability.
In conclusion, this study presented new results concerning the relative reliability and precision of total scores and IRT scores. The derivations in this article offer the most extensive analysis of the reliability of total scores by linking parameters of polytomous IRT models with CTT. In addition, new equations were discussed that described the CTT CSEM to provide new conceptual understanding of differences in the precision of scores estimated within the CTT and IRT frameworks.
Footnotes
Appendix
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
References
1.
AdelsonJ. L.McCoachD. B. (2010). Measuring the mathematical attitudes of elementary students: The effects of a 4-point or 5-point Likert-type scale. Educational and Psychological Measurement, 70, 796-807.
2.
AguinisH.CulpepperS. A.PierceC. A. (2010). Revival of test bias research in preemployment testing. Journal of Applied Psychology, 95, 648-680.
3.
AguinisH.PierceC. A.CulpepperS. A. (2009). Scale coarseness as a methodological artifact: Correcting correlation coefficients attenuated from using coarse scales. Organizational Research Methods, 12, 623-652.
4.
AndrichD. (1978a). Application of a psychometric model to ordered categories which are scored with successive integers. Applied Psychological Measurement, 2, 581-594.
5.
AndrichD. (1978b). A rating formulation for ordered response categories. Psychometrika, 43, 561-573.
6.
BandalosD. L.EndersC. K. (1996). The effects of nonnormality and number of response categories on reliability. Applied Measurement in Education, 9, 151-160.
7.
BechgerT. M.MarisG.VerstralenH. H.BeguinA. A. (2003). Using classical test theory in combination with item response theory. Applied Psychological Measurement, 27, 319-334.
8.
BendigA. W. (1954). Reliability and the number of rating scale categories. Journal of Applied Psychology, 38, 38-40.
9.
ChafouleasS.ChristT.Riley-TillmanT. (2009). Generalizability of scaling gradient on direct behavior ratings. Educational and Psychological Measurement, 69, 157-173.
10.
ChangH.YingZ. (1996). A global information approach to computerized adaptive testing. Applied Psychological Measurement, 20, 213-229.
11.
ChangH.YingZ. (2009). Nonlinear sequential designs for logistic item response theory models with applications to computerized adaptive tests. Annals of Statistics, 37, 1466-1488.
12.
ChangL. (1994). A psychometric evaluation of 4-point and 6-point Likert-type scales in relation to reliability and validity. Applied Psychological Measurement, 18, 205-215.
13.
ChangL. (1997). Dependability of anchoring labels of Likert-type scales. Educational and Psychological Measurement, 57, 800-807.
14.
CicchettiD. V.ShoinralterD.TyrerP. J. (1985). The effect of the number of rating scale categories on levels of interrater reliability: A Monte Carlo investigation. Applied Psychological Measurement, 9, 31-36.
15.
CulpepperS. A. (2010). Studying individual differences in predictability with gamma regression and nonlinear multilevel models. Multivariate Behavioral Research, 45, 153-185.
16.
CulpepperS. A. (2012a). Evaluating EIV, OLS, and SEM estimators of group slope differences in the presence of measurement error: The single indicator case. Applied Psychological Measurement, 36, 349-374.
17.
CulpepperS. A. (2012b). Using the criterion-predictor factor model to compute the probability of detecting prediction bias with ordinary least squares regression. Psychometrika, 77, 561-580.
18.
CulpepperS. A.AguinisH. (2011). Analysis of covariance (ANCOVA) with fallible covariates. Psychological Methods, 16, 166-178.
19.
CulpepperS. A.DavenportE. C. (2009). Assessing differential prediction of college grades by race/ethnicity with a multilevel model. Journal of Educational Measurement, 46, 220-242.
20.
DimitrovD. M. (2003). Marginal true-score measures and reliability for binary items as a function of their IRT parameters. Applied Psychological Measurement, 27, 440-458.
21.
DunhamT. C.DavisonM. L. (1991). Effects of scale anchors on student ratings of instructors. Applied Measurement in Education, 4, 23-35.
22.
EmbretsonS. E. (1996). Item response theory models and spurious interaction effects in factorial ANOVA designs. Applied Psychological Measurement, 20, 201-212.
23.
EmbretsonS. E.ReiseS. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum.
24.
EndersC. K.BandalosD. L. (1999). The effects of heterogeneous item distributions on reliability. Applied Measurement in Education, 12, 133-150.
25.
FerrandoP. J. (2002). Theoretical and empirical comparisons between two models for continuous item responses. Multivariate Behavioral Research, 37, 521-542.
26.
FerrandoP. J. (2009). Difficulty, discrimination, and information indices in the linear factor analysis model for continuous item responses. Applied Psychological Measurement, 33, 9-24.
27.
FleishmanA. I. (1978). A method for simulating non-normal distributions. Psychometrika, 43, 521-532.
28.
GreerT.DunlapW. P.HunterS. T.BermanM. E. (2006). Skew and internal consistency. Journal of Applied Psychology, 91, 1351-1358.
29.
HabermanS. J. (2008). When can subscores have value?Journal of Educational and Behavioral Statistics, 33, 204-229.
30.
HabermanS. J.SinhararyS. (2010). Reporting of subscores using multidimensional item response theory. Psychometrika, 75, 209-227.
31.
HeadrickT. C. (2002). Fast fifth-order polynomial transforms for generating univariate and multivariate non-normal distributions. Computational Statistics & Data Analysis, 40, 685-711.
32.
HeadrickT. C.KowalchukR. K. (2007). The power method transformation: Its probability density function, distribution function, and its further use for fitting data. Journal of Statistical Computation and Simulation, 77, 229-249.
33.
HeadrickT. C.SawiloskyS. S. (1999). Simulating correlated multivariate nonnormal distributions: Extending the Fleishman power method. Psychometrika, 64, 25-35.
34.
JenkinsG. D.TaberT. D. (1977). A Monte Carlo study of factors affecting three indices of composite scale reliability. Journal of Applied Psychology, 62, 392-398.
KimS.FeldtL. S. (2010). The estimation of the IRT reliability coefficient and its lower and upper bounds, with comparisons to CTT reliability statistics. Asia Pacific Education Review, 11, 179-188.
37.
KolenM. J.BrennanR. (2004). Test equating, scaling, and linking. New York, NY: Springer.
38.
KolenM. J.ZengL.HansonB. A. (1996). Conditional standard errors of measurement for scale scores using IRT. Journal of Educational Measurement, 33, 129-140.
39.
KomoritaS. S.GrahamW. K. (1965). Number of scale points and the reliability of scales. Educational and Psychological Measurement, 25, 987-995.
40.
KriegE. F. (1999). Biases induced by coarse measurement scales. Educational and Psychological Measurement, 59, 749-766.
41.
LamT. C.KolicM. (2008). Effects of semantic incompatibility on rating response. Applied Psychological Measurement, 32, 248-260.
42.
LamT. C.StevensJ. J. (1994). Effects of content polarization, item wording, and rating scale width on rating response. Applied Measurement in Education, 7, 141-158.
43.
LissitzR. W.GreenS. B. (1975). Effect of the number of scale points on reliability: A Monte Carlo approach. Journal of Applied Psychology, 60, 10-13.
44.
LyhagenJ. (2008). A method to generate multivariate data with the desired moments. Communications in Statistics-Simulation and Computation, 37, 2063-2075.
45.
MastersG. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174.
46.
MatellM. S.JacobyJ. (1971). Is there an optimal number of alternatives for Likert scale items? Study I: Reliability and validity. Educational and Psychological Measurement, 31, 657-674.
47.
MayK.JacksonT. S. (2005). IRT item parameters and the reliability and validity of pretest, posttest, and gain scores. International Journal of Testing, 5, 63-73.
48.
MayK.NicewanderW. A. (1994). Reliability and information functions for percentile ranks. Journal of Educational Measurement, 31, 313-325.
49.
McDonaldR. P. (1982). Linear versus nonlinear models in item response theory. Applied Psychological Measurement, 6, 379-396.
50.
MellenberghG. J. (1996). Measurement precision in test score and item response models. Psychological Methods, 1, 293-299.
51.
MolenaarD.DolanC.De BoeckP. (2012). The heteroscedastic graded response model with a skewed latent trait: Testing statistical and substantive hypotheses related to skewed item category functions. Psychometrika, 77, 455-478.
52.
MorseB. J.JohansonG. A.GriffethR. W. (2012). Using the graded response model to control spurious interactions in moderated multiple regression. Applied Psychological Measurement, 36, 122-146.
53.
MurakiE. (1990). Fitting a polytomous item response model to Likert-type data. Applied Psychological Measurement, 14, 59-71.
54.
MurakiE. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159-176.
55.
MurakiE. (1993). Information functions of the generalized partial credit model. Applied Psychological Measurement, 17, 351-363.
56.
MurphyK.ConstansJ. (1987). Behavioral anchors as a source of bias in rating. Journal of Applied Psychology, 72, 573-577.
57.
R Development Core Team. (2010). R: A language and environment for statistical computing. [Computer software]. Vienna, Austria: R Foundation for Statistical Computing.
58.
SamejimaF. (1969). Estimation of latent ability using a response pattern of graded scores (Psychometrika Monograph No. 17). Richmond, VA: Psychometric Society.
59.
SamejimaF. (1994). Estimation of reliability coefficients using the test information function and its modifications. Applied Psychological Measurement, 18, 229-244.
60.
SymondsP. M. (1924). On the loss of reliability in ratings due to coarseness of the scale. Journal of Experimental Psychology, 7, 456-461.
61.
TadikamallaP. R. (1980). On simulating nonnormal distributions. Psychometrika, 45, 273-279.
WangT.KolenM. J.HarrisD. J. (2000). Psychometric properties of scale scores and performance levels for performance assessments using polytomous IRT. Journal of Educational Measurement, 37, 141-162.
64.
WengL. (2004). Impact of the number of response categories and anchor labels on coefficient alpha and test-retest reliability. Educational and Psychological Measurement, 64, 956-972.