Abstract
Recent simulation research has demonstrated that using simple raw score to operationalize a latent construct can result in inflated Type I error rates for the interaction term of a moderated statistical model when the interaction (or lack thereof) is proposed at the latent variable level. Rescaling the scores using an appropriate item response theory (IRT) model can mitigate this effect under similar conditions. However, this work has thus far been limited to dichotomous data. The purpose of this study was to extend this investigation to multicategory (polytomous) data using the graded response model (GRM). Consistent with previous studies, inflated Type I error rates were observed under some conditions when polytomous number-correct scores were used, and were mitigated when the data were rescaled with the GRM. These results support the proposition that IRT-derived scores are more robust to spurious interaction effects in moderated statistical models than simple raw scores under certain conditions.
Operationalizing a latent construct such as an attitude or ability is a common practice in psychological research. Stine (1989) described this process as the creation of a mathematical structure (scores) that represents the empirical structure (construct) of interest. Typically, researchers will use simple raw scores (e.g., either as a sum or a mean) from a scale or test as the mathematical structure for a latent construct. However, much debate regarding the properties of such scores has ensued since S. S. Stevens’s classic publication of the nominal, ordinal, interval, and ratio scales of measurement (Stevens, 1946). Although it is beyond the scope of this article to enter the scale of measurement foray, an often agreed-on position is that simple raw scores for latent constructs do not exceed an ordinal scale of measurement. This scale imbues such scores with limited mathematical properties and permissible transformations that are necessary for the appropriate application of parametric statistical models. Nonparametric, or distribution-free, statistics have been proposed as a solution for the scale of measurement problem. However, many researchers are reluctant to use nonparametric techniques because they are often associated with a loss of information pertaining to the nature of the variables (Gardner, 1975). McNemar (1969) articulated this point by saying, “Consequently, in using a non-parametric method as a short-cut, we are throwing away dollars in order to save pennies” (p. 432).
Assuming that simple raw scores are limited to the ordinal scale of measurement and researchers typically prefer parametric models to their nonparametric analogues, the empirical question regarding the robustness of various parametric statistical models to scale violations arises. Davison and Sharma (1988) and Maxwell and Delaney (1985) demonstrated through mathematical derivations that there is little cause for concern when comparing mean group differences in the independent samples t test when the assumptions of normality and homogeneity of variance are met. However, Davison and Sharma (1990) subsequently demonstrated that scaling-induced spurious interaction effects could occur with ordinal-level observed scores in multiple regression analyses. These findings suggest that scaling may become a problem when a multiplicative interaction term is introduced into a parametric statistical model.
Scaling and Item Response Theory (IRT)
An alternative solution to the scale of measurement issue for parametric statistics is to rescale the raw data itself into an interval-level metric, and a variety of methods for this rescaling have been proposed (see Embretson, 2006; Granberg-Rademacker, 2010; Harwell & Gatti, 2001). A potential method for producing scores with near interval-level scaling properties is the application of IRT models to operationalize number-correct scores into estimated theta scores—the IRT-derived estimate of an individual’s ability or latent construct standing. Conceptually, the attractiveness of this method rests with the invariance property in IRT scaling, and such scores may provide a more appropriate metric for use in parametric statistical analyses. 1 Reise, Ainsworth, and Haviland (2005) stated that
Trait-level estimates in IRT are superior to raw total scores because (a) they are optimal scalings of individual differences (i.e., no scaling can be more precise or reliable) and (b) latent-trait scales have relatively better (i.e., closer to interval) scaling properties. (p. 98, italics in original)
In addition, Reise and Haviland (2005) gave an elegant treatment of this condition by demonstrating that the log-odds of endorsing an item and the theta scale form a linearly increasing relationship. Specifically, the rate of change on the theta scale is preserved (for all levels of theta) in relation to the log-odds of item endorsement.
Empirical Evidence of IRT Scaling
In a simulation testing the effect of scaling and test difficulty on interaction effects in factorial analysis of variance (ANOVA), Embretson (1996) demonstrated that Type I and Type II errors for the interaction term could be exacerbated when simple raw scores are used under nonoptimal psychometric conditions. Such errors occurred primarily due to the ordinal-level scaling limitations of simple raw scores, and the ceiling and floor effects imposed when an assessment is either too easy or too difficult for a group of individuals—a condition known as assessment inappropriateness (see Figure 1). Embretson fitted the one-parameter logistic (Rasch) model to the data and was able to mitigate the null hypothesis errors using the estimated theta scores rather than the simple raw scores. These results illuminated the usefulness of IRT scaling for dependent variables in factorial models, especially under suboptimal psychometric conditions. Embretson argued that researchers are often unaware when these conditions are present and can benefit from using appropriately fitted IRT models to generate scores that are more appropriate for use with parametric analyses.

A representation of the latent construct distribution and test information (reliability) distributions for appropriate assessments (top) and inappropriate assessments (bottom)
An important question that now arises is whether these characteristics extend to more complex IRT models such as the two- and three-parameter logistic models (dichotomous models with a discrimination and guessing parameter, respectively) and polytomous models. Although the Rasch model demonstrates desirable measurement characteristics (i.e., true parameter invariance; Embretson & Reise, 2000; Fischer, 1995; Perline, Wright, & Wainer, 1979), it is sometimes too restrictive to use in practical contexts. However, the consensus regarding the likelihood that non-Rasch models could achieve interval-level scaling properties is “yes” (Embretson & Reise, 2000; Hambleton, Swaminathan, & Rogers, 1991; Harwell & Gatti, 2001; Reise et al., 2005). Investigations into the scaling properties of these more complex IRT models are thus necessary.
In one extension of this sort, Kang and Waller (2005) simulated the scaling properties of simple raw scores, estimated theta scores derived from a two-parameter logistic IRT model, and assessment appropriateness with the interaction term in a moderated multiple regression (MMR) analysis. Similar to the findings of Embretson (1996), Kang and Waller discovered that using simple raw scores to operationalize a latent construct resulted in substantial inflations of the Type I error rate (>50% or p > .50) for the interaction term in MMR under conditions of assessment inappropriateness. However, the IRT-derived theta score estimates were found to mitigate the Type I error rate to acceptable levels (<10% or p < .10) under the same conditions. This extension demonstrated that the estimated theta scores from a non-Rasch IRT model could be used to better fit the assumptions of parametric statistical models involving an interaction term. Finally, Harwell and Gatti (2001) investigated the congruence of estimated (theta) and actual construct scores using a popular polytomous IRT model, the graded response model (GRM; Samejima, 1969, 1996). The authors posited that if the estimated construct (theta) scores were sufficiently similar to the actual construct (theta) scores, which can be defined (albeit arbitrarily) as a metric with interval-level scaling properties, then the GRM results in scores that are sufficiently interval level. The results of their study supported this relationship. However, a concrete endorsement of the scaling properties should be made with caution due to the inherently arbitrary metric of the theta scale in most (if not all) IRT models.
The preceding theoretical and simulation evidence suggests that scale of measurement violations accompanied with suboptimal psychometric conditions (i.e., assessment inappropriateness) may have nonnegligible effects on the accuracy of common parametric analyses. However, this evidence still has limited generalizability to the majority of psychological research due to the nature of the simulated data. Specifically, Embretson (1996) and Kang and Waller (2005) simulated dichotomous data and fit logistic IRT models appropriate for such data. However, the majority of psychological research uses polytomous data, or data with multiple response options, such as Likert-type scales. Therefore, the purpose of the current study is to extend the understanding of scaling and assessment appropriateness on parametric statistical analyses by simulating polytomous data and fitting an appropriate polytomous IRT model. The authors’ primary null hypothesis against which the performance of number-correct scores and estimated theta scores are being compared is that there is no significant interaction on the actual theta scale. Thus, any significant interaction identified in the simulation results represents a spurious observed effect (Type I error).
Method
This study used a Monte Carlo simulation to identify the psychometric conditions that lead to an elevated risk of Type I errors for interaction effects in MMR when the theta scale is considered to be the true metric. This simulation was similar to the simulation conducted by Kang and Waller (2005) and extends this work into polytomous scales indicative of those commonly used in applied psychological research.
The GRM
The GRM (Samejima, 1969, 1996) is an IRT model suitable for modeling data with ordered categories such as Likert-type scales, and is an extension of the two-parameter logistic model. The GRM is considered a difference family model and was developed specifically to model polytomous data that represent the psychological processes underlying multicategory decision making (Ostini & Nering, 2006). In addition, theta estimates derived using the GRM may show evidence of interval-level scaling properties (Harwell & Gatti, 2001).
Using the GRM, an individual’s likelihood of responding in a particular response category is derived using a two-step process. First, category boundary functions (CBRFs) are calculated to determine boundary decision probabilities of j− 1 response categories for each item. The CBRFs in the GRM can be derived with Equation 1 (adapted from Embretson & Reise, 2000).
In Equation 1, Pix∗ (θ) is the probability that an individual with a trait (construct) level θ will respond positively at the boundary of category j for item i where x = j = 1 . . . mi. Theta (θ) represents the individual’s trait (construct) level, ai represents the item discrimination or slope, and bij represents the category location or difficulty parameter with respect to the trait continuum. Importantly, the values of bij should be successive integers reflecting increased difficulty in progressing through the response options in well-functioning items.
In the second step of the GRM, the probability of responding in a particular category is determined using category response functions (CRFs), which are derived by subtracting Pix∗ (θ) from the following category. This process is illustrated in Equation 2 (adapted from Embretson & Reise, 2000).
Determining the first category is done by simply subtracting Pi1∗ (θ) from 1.0, and the last category is simply Pim∗ (θ). The GRM was used as a model for the number-correct score algorithm as well as to estimate theta scores in this study.
Independent Variables
The independent variables in this study were respondent sample size (n: two levels), scale length (k: two levels), item discrimination (ai: two levels), item difficulty (bi,1 . . . j−1: three levels), scale bandwidth (fidelity: two levels), and the regression coefficients (β1 and β2: two levels). The structure of this study was therefore a 2 × 2 × 2 × 3 × 2 × 2 design comprising 96 conditions.
Sample size (n)
Two respondent sample sizes were simulated according to recent evidence of the stability of parameter estimates in polytomous IRT, and actual sample sizes in MMR studies in applied psychology. Ostini and Nering (2006) reported that stable estimates for polytomous IRT models could be obtained with as few as 250 individuals, but that samples between 500 and 1,000 are still considered to be desirable. In addition, Aguinis Beaty, Boik, and Pierce (2005) indicated that the average sample size for MMR studies in applied psychological research is
Scale length (k)
Two scale lengths of k = 15 and k = 30 items were simulated in this study to model typical scales used in applied psychological research. In a review of validated scales used in applied psychological research, Fields (2002) indicated a modal scale length of 15 items with a mean of 15.43 and a standard deviation of 10.43 for validated scales in applied psychology. The distribution related to these values is also slightly positively skewed, indicating the existence of several very long scales.
Discrimination (ai)
To derive the highest level of generalizability from this study, item parameter values were randomly selected from specified distributions as opposed to using constant values. Following the structure of Kang and Waller (2005), item discrimination values were selected from a uniform distribution between the values of 0.31 to 0.58 for moderate discrimination and 0.58 to 1.13 for high discrimination. Estimating discrimination values from a uniform distribution has been demonstrated to appropriately represent empirically determined item discrimination values (Reise & Waller, 2003), and the particular cutoff values of 0.31, 0.58, and 1.13 were demonstrated to appropriately represent low, moderate, and high factor loadings for items (Kang & Waller, 2005; Takane & De Leeuw, 1987). Because the GRM is a polytomous extension of the two-parameter logistic model, these values can be deemed appropriate for use in this study. Furthermore, the decision to retain the values from the Kang and Waller (2005) study was made to maintain a basis of comparison for the extension to polytomous data.
Item difficulty (bi,1 . . . j−1)/assessment appropriateness
Three item difficulty conditions were simulated to represent a “difficult,”“moderate,” and “easy” scale with respect to the simulated distribution of construct scores (see Figure 1). The item difficulty conditions are also analogous to the assessment appropriateness conditions with the “difficult” and “easy” conditions representing assessment inappropriateness and the “moderate” condition representing assessment appropriateness. Item difficulty values were randomly selected from a N(−1.5, 1.0) distribution for the easy (inappropriate) conditions, a N(0.0, 1.0) distribution for the moderate (appropriate) conditions, and a N(1.5, 1.0) distribution for the difficult (inappropriate) conditions.
Four item difficulty parameters were randomly selected from the appropriate distribution for each item to represent the j– 1 CBRFs specified in Equation 1 for the GRM. An important aspect of the difficulty parameters in polytomous IRT models is that the difficulty parameter for each CBRF is sequentially ordered. Therefore, the difficulty parameters were modeled with the sequential ordering restriction imposed similar to that implemented by Meade, Lautenschlager, and Johnson (2007).
Scale fidelity
An assessment’s fidelity is measured as the inverse of variability (i.e., bandwidth) in the difficulty of the items (Stocking, 1987). Fidelity contributes to assessment appropriateness by either restricting (high fidelity) or expanding (low fidelity) the width of the item difficulty distribution. The high-fidelity conditions were simulated in this study by generating a second set of item difficulty values from more restricted normal distributions with a mean and standard deviation of N(−1.50, 0.50) for easy scales, N(0.00, 0.50) for moderate scales, and N(1.50, 0.50) for difficult scales. These restricted distributions will create the high-fidelity and low-bandwidth situation in which Kang and Waller (2005) observed the highest prevalence of Type I errors. As in the previous difficulty parameter selection, there were four (j− 1) difficulty values for each item sampled from within the specified distribution with the sequential ordering restriction imposed.
Regression weights
In accordance with Kang and Waller (2005), regression weights were set at a value of 0.30 or 0.50 for both β1 and β2. An intercept of 0 is used and therefore omitted from the regression models. It should be noted that these regression weights are fixed only for the purposes of simulating the dependent variables.
Fixed Effects
Item response categories (j)
Five item response categories were used to simulate a five-category Likert-type response scale. Fields (2002) identified 134 validated construct assessments that are used in applied psychological research of which five-category Likert-type response scales were the most common (n = 57).
Regression models
The purpose of this study was to observe the prevalence of Type I errors in MMR in three different pairs of models. In the first regression model pair, actual latent trait scores θ will be analyzed (see Equations 3a and 3b). In the second regression model pair, number-correct scores (X) will be analyzed (see Equations 4a and 4b). In the third regression model pair, estimated theta scores
The first model of each pair is the additive model, and the second model in each pair contains a multiplicative (interaction) term. Each model pair was structured as a hierarchical regression analysis where the interaction term is entered at the second step (Aiken & West, 1991; Cohen, Cohen, West, & Aiken, 2003). A significant change in variance accounted for (ΔR2) between the first and second model indicated the existence of a spurious interaction effect based on the null hypothesis that the data were created with no significant interaction on the actual theta scale.
Regression Main Effects
Two continuous predictor variables were simulated for each regression model specified in Equations 3 through 5b. Predictor variables θ1 and θ2 were randomly selected for the number of observations (n) from normal distributions with a mean and standard deviation equal to N(0.00, 1.00). These variables served as the main effect scores in the regression models. It is important to note that θ1 and θ2 were sampled from identical but independent distributions; thus, no multicollinearity was modeled.
Regression Criterion Variables
One continuous criterion variable was calculated for each regression model specified in Equations 3 through 5b. In accordance with Kang and Waller (2005), the general form of the criterion variables is given by the following equation, which represents a multiple regression model with two significant main effects and no interaction on the actual theta scale.
In Equation 6, β1 and β2 are the simulated regression weights and ε is an error term. Note that the intercept term, β0, was set to equal 0 and thus omitted from the model. The term
Number-correct scores
To generate the number-correct scores, X1, X2, and X3, the values of the previously defined construct scores θ1, θ2, and θ3 were entered into the GRM algorithm (Equations 1 and 2) for each simulated participant.
A matrix of response scores was generated by reporting the number-correct score (1, 2, 3, 4, or 5) corresponding to the highest-category response likelihood for each simulated participant on each item. These values were derived using an algorithm written by the first author in the R language based on response probabilities calculated in Equations 1 and 2. Actual number-correct score responses were generated by comparing a randomly selected value from a uniform distribution, U(0.0, 1.0), with the relative response probabilities that are generated for each level of theta (or individual) and each item. This process can be thought of as determining the relative likelihood of a category response given the item and person parameters with a realistic level of decision-making error (Kang & Waller, 2005; Stone, 1992). This integration of response error is important so as to not assume perfect responding by simulated individuals. A mean score for X1, X2, and X3 for each simulated individual was calculated from the number-correct score response matrices for analysis in the regression models.
Estimated Theta Scores
Finally, theta scores
Iterations
For the purposes of estimating Type I error rates in Monte Carlo studies, Robey and Barcikowski (1992) specify that approximately 1,000 iterations will achieve a power equal to .90 when approximating an alpha level of α = .05 and using the interval of
Simulation-Dependent Variables
Type I errors
The primary dependent variable for this study was the empirical Type I error rate (π) that is observed for the interaction term of the MMR models. The specific value of π was identified in a three-step process. In each iteration of the simulation, the variance in θ3 accounted for by θ1 and θ2 was recorded as the R2 value for the additive and multiplicative regression models specified in Equations 3 through 5b. Second, the significance of the change in variance accounted for, ΔR2, between the respective additive and multiplicative models was tested at an alpha level of
Procedure
The simulation for the current study was conducted in the R environment (version 2.9.0; Ihaka & Gentleman, 1996; R Development Core Team, 2008) using a series of functions written by the authors, contributed code from Kang and Waller (2005), and PARSCALE 4.1 (Muraki & Bock, 2003). For ease of interpretation, four separate simulations were conducted. The four simulations were separated based on sample size (n = 250, 750) and scale fidelity (normal, high). In each simulation, the independent variables of scale length, regression weights, discrimination, and difficulty will be systematically varied. Therefore, the summary statistics for each simulation are included in four tables, each with 24 rows.
Each simulation was run using the following process. First, using the pseudorandom number generator in R, theta vectors were sampled from a standard normal distribution N(0.0, 1.0) for θ1 and θ2. Next, corresponding vectors for θ3 were calculated using Equation 6. These vectors were saved as the actual latent construct scores. To calculate the number-correct score matrices, X1, X2, and X3, each of these three score vectors were evaluated in an algorithm written by the first author that implements Equation 1 and Equation 2 to determine the probability of a category response. Final number-correct score values were determined by the comparison of a randomly selected value from a uniform distribution as previously described. Finally, the estimated theta scores
Finally, the nine score vectors to be entered into the corresponding additive and multiplicative regression models specified in Equations 3 through 5b and the change in variance accounted for between the two corresponding models was recorded. The final summary statistics and tables were generated using portions of code provided by Niels Waller and used in the Kang and Waller (2005) study.
Results
Using the
Results of Simulation 1 (Normal Fidelity, Distribution of Latent Construct Scores = Standard Normal N(0, 1))
Note: c = condition; n = number of individuals; bi j− 1 = item category difficulty distribution, Easy (assessment inappropriateness) = N(−1.5, 1), Moderate (assessment appropriateness) = N(0, 1), Difficult (assessment inappropriateness) = N(1.5, 1); ai = item discrimination distribution, Low = U(.31,.58), High = U(.58,1.13); β = regression weight; k = number of items;
Significant Type I Error rate based on the results of a binomial test.
Significant Type I Error rate based on the alpha +/- .5alpha criterion.
Results of Simulation 2 (Normal Fidelity, Distribution of Latent Construct Scores = Standard Normal N(0, 1))
Note: c = condition; n = number of individuals; b
i j
− 1 = item category difficulty distribution, Easy (assessment inappropriateness) = N(−1.5, 1), Moderate (assessment appropriateness) = N(0, 1), Difficult (assessment inappropriateness) = N(1.5, 1); ai = item discrimination distribution, Low = U(.31,.58), High = U(.58,1.13); β = regression weight; k = number of items;
Significant Type I Error rate based on the results of a binomial test.
Significant Type I Error rate based on the alpha +/- .5alpha criterion.
Results of Simulation 3 (High Fidelity, Distribution of Latent Construct Scores = Standard Normal N(0, 1))
Note: c = condition; n = number of individuals; b
i,j
− 1 = item category difficulty distribution, Easy (assessment inappropriateness) = N(−1.5, 0.5), Moderate (assessment appropriateness) = N(0, 0.5), Difficult (assessment inappropriateness) = N(1.5, 0.5); ai = item discrimination distribution, Low = U(.31,.58), High = U(.58,1.13); β = regression weight; k = number of items;
Significant Type I Error rate based on the results of a binomial test.
Significant Type I Error rate based on the alpha +/- .5alpha criterion.
Results of Simulation 4 (High Fidelity, Distribution of Latent Construct Scores = Standard Normal N(0, 1))
Note: c = condition; n = number of individuals; b
i,j
−1 = item category difficulty distribution, Easy (assessment inappropriateness) = N(−1.5, 0.5), Moderate (assessment appropriateness) = N(0, 0.5), Difficult (assessment inappropriateness) = N(1.5, 0.5); ai = item discrimination distribution, Low = U(.31,.58), High = U(.58,1.13); β = regression weight; k = number of items;
Significant Type I Error rate based on the results of a binomial test.
Significant Type I Error rate based on the results of the alpha +/- .5alpha criterion.
In addition, Figure 2 represents the frequency of the empirical Type I error rates for the number-correct scores and estimated theta scores. For the number-correct scores, these data indicate a positively skewed distribution (skew = 2.04) ranging from 3.1% to 84.9%, with a mean empirical Type I error rate of 17.5%, median of 8.7%, and a standard deviation of 20%. Restricting the summary to only those occurrences outside of the

Distribution of spurious interactions for number-correct scores and estimated theta scores
These results indicate that there were instances of spurious interactions regardless of the scoring method. However, it is clear that the number-correct scores performed much worse than the estimated theta scores in comparable conditions. Finally, an important finding to highlight is that, of the conditions with meaningfully inflated Type I error rates for the estimated theta scores, none were unique with regard to the number-correct scores (see Tables 1-4). In other words, no meaningful inflations existed for the estimated theta scores that did not also exist for the number-correct scores.
Assessment Appropriateness and Type I Errors
The results of this simulation also clearly indicate the anticipated effect of scoring method and assessment appropriateness on the occurrence of Type I errors for the interaction term of a MMR analysis. Figure 3 represents the mean and maximum Type I error rates for each scoring method collapsed across the 32 assessment appropriateness conditions. Under these conditions, there is no significant departure from the nominal Type I error rate, regardless of whether one uses simple raw scores or estimated theta scores in the MMR analysis. These results are consistent with previous findings related to scaling effects on Type I error rates for moderated statistical models (Davison & Sharma, 1990; Embretson, 1996; Kang & Waller, 2005).

Empirical Type I error rates for the interaction term of a simulated moderated multiple regression model under conditions of assessment appropriateness
However, striking differences in the empirical Type I error rate can be observed for each scoring method when the assessment is inappropriate for the individuals. Figure 4 represents the mean and maximum Type I error rates for each scoring method collapsed across the 64 assessment inappropriateness (easy/difficult) conditions. Number-correct scores resulted in empirical Type I error rates that were above the acceptable interval in 53 of the 64 (83%) inappropriate assessment conditions. At the iteration level, a direct logistic regression analysis indicated that the likelihood of committing a Type I error was 8.13 times greater when number-correct scores from inappropriate assessments were used, χ2(1, N = 96,000) = 5,008.55, p < .001, odds ratio = 8.13. In addition, estimated theta scores resulted in empirical Type I error rates that were above the acceptable interval in 33 of the 64 (51%) inappropriate assessment conditions. A direct logistic regression analysis indicated that the likelihood of committing a Type I error was 2.4 times greater when estimated theta scores from inappropriate assessments were used, χ2(1, N = 96,000) = 918.15, p < .001, odds ratio = 2.40.

Empirical Type I error rates for the interaction term of a simulated moderated multiple regression model under conditions of assessment inappropriateness
Impact of the Independent Variables on the Empirical Type I Error Rate
Table 5 represents the mean empirical Type I error rate as well as direct logistic regression tests for the levels of each independent variable. The dependent variable in the logistic regression analyses was the occurrence of a Type I error and the iteration level, and was coded as a 1 if the ΔR2 between the additive and multiplicative model was significant (p < .05) or a 0 if it was not significant. All of the independent variables were entered into the model simultaneously as categorical predictors. A general pattern can be identified in these results such that higher empirical Type I error rates were observed for the stronger level of each independent variable. This pattern would indicate that each psychometric characteristic that was varied in the simulations had an overall effect on the empirical Type I error rates for the interaction term.
Impact of Individual Predictors on Empirical Type I Error Rates
Note: OR = odds ratio.
Omnibus full model, χ2 (1, N = 96,000) = 17,157.51, p < .001, R2 = .27.
Omnibus full model, χ2 (1, N = 96,000) = 3,571.47, p < .001, R2 = .08.
In each case, the OR reported corresponds to increases in the predictor variable (e.g., increased assessment inappropriateness results in higher likelihoods of Type I errors, increases in discrimination results in higher likelihoods of Type I errors, etc.).
p < .001.
Several important findings can be identified from these results. First, the psychometric characteristics that were manipulated in this simulation had a stronger overall effect on Type I errors when the variables were operationalized as number-correct scores when compared with estimated theta scores. These results suggest that number-correct scores are more sensitive to measurement effects in parametric analyses than are IRT-derived theta estimates. For both dependent variables, assessment appropriateness was the most impactful predictor of Type I errors, followed by item discrimination and regression weights. This result confirms and extends the effects of assessment appropriateness identified by Kang and Waller (2005), as well as arguments raised by Busemeyer (1980) on the role of assessment difficulty in parametric statistics.
Strength of Spurious Interaction Effects
Finally, the authors were interested in understanding how assessment appropriateness affected the strength of spurious interactions for the different scoring methods. The columns labeled with ΔR2 for each respective scoring method in Tables 1 through 4 indicate the average strength of the interaction when a spurious interaction was identified. Because sample size is known to affect the strength of interaction effects, the authors used a multivariate analysis of covariance (MANCOVA) to determine the effect of assessment appropriateness using sample size as a covariate. After adjusting for the effects of sample size, the results indicated a significant effect of assessment appropriateness on the strength of spurious interaction effects for number-correct scores F(1, 93) = 51.92, p < .001, partial η2 = .36 such that the average interaction strength in the inappropriate assessment conditions (M = 0.015, SD = 0.007) was significantly greater than the appropriate assessment conditions (M = 0.011, SD = 0.006). A similar, albeit much weaker, result was also identified for estimated theta scores F(1, 93) = 4.47, p < .05, partial η2 = .05 such that the average interaction strength in the inappropriate assessment conditions (M = 0.013, SD = 0.006) was significantly greater than the appropriate assessment conditions (M = 0.012, SD = 0.006). No significant difference was identified for actual theta scores. These results indicate that assessment appropriateness has an effect on the strength of spurious interaction effects for number-correct scores and estimated theta scores and that the effect is considerably stronger in the number-correct score conditions.
Discussion
Theoretical and empirical evidence has emerged to suggest that using IRT to operationalize an individual’s standing on a latent construct has important measurement implications over the use of number-correct scores (Borsboom, 2008; Embretson, 1996, 2006; Embretson & DeBoeck, 1994; Harwell & Gatti, 2001; Kang & Waller, 2005; Perline et al., 1979; Reise & Haviland, 2005; Reise et al., 2005; Wainer, 1982). Specifically, IRT-derived theta scores have been demonstrated to be resistant to inflated Type I error rates in moderated statistical models potentially due to achieving an interval, or nearly interval, scale of measurement (Embretson, 1996; Kang & Waller, 2005). This previous work has been limited to applications of dichotomous data and restrictive IRT models. Therefore, the authors’ goal was to extend their understanding of these potentially beneficial measurement properties by modeling multicategory data and implementing a polytomous IRT model. These studies represent a generalizability trend such that each successive study branches further away from the measurement ideal and into the realities of psychological data.
It is imperative to point out that under certain conditions, significantly inflated Type I error rates were observed for both the estimated theta scores and the number-correct scores. This result for the number-correct scores was expected; however, this result for the estimated theta scores was somewhat unexpected. Strong psychometric influences still resulted in inflated Type I error rates for the estimated theta scores from the GRM. However, it was often the case that the Type I error rate of the number-correct scores far exceeded that of the estimated theta scores. For example, in conditions 7, 8, 23, and 24 in Table 1, the Type I error rates for the number-correct scores ranged from .296 to .386 whereas the respective Type I error rates for the estimated theta scores ranged from .089 to .120. Clearly, given the alternative, the estimated theta scores would be more attractive to researchers in these conditions. In addition, in cases where the Type I error rates were grossly inflated, such as conditions 31, 32, 47, and 48 in Table 2 and 79, 80, 95, and 96 in Table 4, the Type I error rate for the number-correct scores was approximately 200% to 450% higher than the Type I error rate for the estimated theta scores. An examination of Figure 2 clearly reveals these disparities between the number-correct scores and estimated theta scores. Although the estimated theta scores did not perform perfectly within the acceptable limits, these results demonstrate a clear preference for their use in applied research when certain psychometric conditions exist.
Another prominent result was that of the role of assessment appropriateness on Type I error rates for both number-correct scores and estimated theta scores. Assessment appropriateness is defined as the congruence between the reliability of an assessment and the latent construct distribution of the individuals responding to an assessment (see Figure 1). The results of these simulations demonstrated that, under conditions of assessment appropriateness, there is no concern as to unacceptable Type I error rates for any psychometric condition or scoring technique. The complimentary result, that spurious interactions were only observed under some conditions of assessment inappropriateness, was also true. Embretson (1996) and Kang and Waller (2005) also identified assessment appropriateness and inappropriateness as the primary factor in their simulations of spurious interaction effects. Embretson (1996) determined that the degree and direction of the inappropriateness fully accounted for the nature of the interaction with regard to treatment groups in a simulated factorial ANOVA. Prior to these studies, Maxwell and Delaney (1985) demonstrated how various distributional shapes of latent constructs can interact with assessment difficulty (appropriateness) to result in artificial group mean differences when the observed scores and latent scores were related through a nonlinear, monotonic relationship. However, the expectation that this effect would not occur when using estimated theta scores was not fully supported. This finding suggests that certain conditions may reduce the level of linearity in the theta–estimated theta relationship. Specifically, an examination of the root mean square error (RMSE) values in these conditions indicates less precise parameter recovery. This would suggest that the degree of congruence between the actual and estimated theta scores was being eroded in exceedingly easy or difficult assessments. Ferrando (2009) explored the issue of measurement inappropriateness in which the appropriateness of an item can be constrained to a particular range of the trait continuum, thus degrading the fit of a linear model above and below the ceiling and floor limits. Indices for determining this range are provided for binary items in a unidimensional, two-parameter model as well as in multidimensional cases and can provide useful information for determining an acceptable range of measurement appropriateness.
These results also suggest a more complex relationship underlying data structures and the assessment of moderators in MMR analyses than perhaps previously thought. Paunonen and Jackson (1988) conducted a simulation in which Type I error rates were compared between ordinary least squares regression (OLS) and principal components regression (PCR) in relation to the multicollinearity of the predictors. Their results indicated that OLS performed much better than did PCR with regard to accurate moderator detection, and that linear transformations of the data had little effect on the Type I error rates for either procedure. In their study, the researchers simulated random effects data from normal distributions just as in this study, but did not investigate any influences of psychometric characteristics on the data (i.e., difficulty, discrimination, assessment appropriateness, etc.). Conceptually, Paunonen and Jackson (1988) generated data as if they were able to collect actual theta scores. It is not surprising, therefore, that their results were well within the normal Type I error rate for MMR. An examination of the empirical Type I error rates for the actual theta scores in Tables 1 through 4 replicate these findings. The generalizability of the results from Paunonen and Jackson’s simulation are therefore limited because actual latent trait scores are unobservable (directly). The results of this study indicate that psychometric characteristics such as assessment appropriateness and the operationalization of the scores have a significant influence on the performance of MMR analyses.
Limitations and Extensions
Violations of the normality assumption in MMR models were present in these simulations, lending some caution to the interpretation of the results. Specifically, skewness was observed in the simulated dependent variables, and the magnitude of the skew was positively related to the empirical Type I error rate for both number-correct scores and estimated theta scores. In addition, the results of the Shapiro–Wilk tests indicated that cases of residual nonnormality also increased with the empirical Type I error rate for both number-correct scores and estimated theta scores (see Tables 1-4). To examine these results more closely, score distributions were generated for three separate conditions in which the skew of the scores appeared to be strongly related to their respective Type I Error rate. Figure 5 contains distributions for the actual theta scores and the derived scores (estimated theta and number-correct scores) from conditions 12, 44, and 96. These conditions represent cases in which number-correct scores and theta scores performed equally well (condition 12), number-correct scores performed poorly but theta scores performed well (condition 44), and neither number-correct scores nor theta scores performed well (condition 96). The commonality that can be observed here is that the poor performance is clearly aligned with the nonnormal score distributions, and the greatest amount of nonnormality is observed under conditions of assessment inappropriateness in which the latent construct distribution is poorly matched with the test information function (reliability) of the assessment.

Score distributions for conditions 12, 44, and 96
These findings create a potential confound such that the empirical Type I error rates may be partially related to these violations of normality, which were observed as a result of the simulations rather than a controlled factor. Kang and Waller (2005) identified a similar pattern in their study involving dichotomous data, and in a brief investigation, the researchers were able to correct some of the effects of nonnormality using the Box-Cox transformation. Specifically, moderate empirical Type I error rates responded well to this correction, but higher rates were not fully mitigated to the nominal level of .05 (Kang & Waller, 2005).
It is conceivable that variations in observed nonnormality are a result of variations in nonlinearity between the overall scale (test) response function and the latent trait. A preliminary investigation into scatterplots of individual iterations confirmed this relationship (results available on request). Thus, these relationships could be used as an indicator of a risk for spurious interactions. 2 An additional indication of this effect is the pattern of parameter recovery values (RMSE) reported in Tables 1 through 4. As nonnormality increased (and in inappropriate assessment conditions), the IRT parameter estimation of the theta scores worsened.
Finally, it is pertinent to discuss the inherent arbitrariness of the scaling metrics and the use of the GRM to generate the response score matrices. The actual theta scores were generated without a significant interaction and derived scores were tested for a significant interaction. Although this practice is common in measurement simulations involving the translation from IRT scores to other scoring methods (Embretson, 1996; Harwell et al., 1996; Kang & Waller, 2005), an argument can be made that the authors are favoring the GRM scores as a nonarbitrary trait scale and implicitly satisfying the assumptions of the IRT scale but not the number-correct score scale. A limiting factor that may arise here is that the results only extend to empirical situations in which the theta scale is regarded as the true metric. This limitation highlights the possibility that the interaction null hypothesis is scale specific and the findings may not generalize to situations in which another metric is considered to be the true measurement scale. 3
Conclusion
Overall, the results of this study provide support for the use of IRT for the operationalization of latent constructs. However, this support is not ubiquitous and the direct utility of IRT-derived theta scores for parametric statistics is limited to suboptimal psychometric conditions (i.e., assessment inappropriateness). These conditions resulted in ceiling and floor effects, which can be identified in preliminary data analyses; however, some evidence indicates that extreme cases will not be fully corrected with typical score transformations.
If a researcher can demonstrate that the reliability of the assessment and the distribution of construct scores for the individuals are reasonably matched, there is no evidence here that the Type I error rate will reach an unacceptable level for any scoring technique. However, Embretson (1996) cautioned that these conditions are difficult to anticipate, and the use of theta estimates could be justified as a default measurement technique to assuage any concerns. Furthermore, De Boeck and Wilson (2004) suggested that experimental hypotheses can be tested within an appropriate IRT model. This integration of measurement and experimental testing could provide another avenue for greater decision-making accuracy.
The investigation of the performance of polytomous IRT models in a variety of contexts is still an important avenue of measurement research. Harwell and Gatti (2001) specifically identified a need for a deeper understanding of the properties of the scores generated from complex models such as the GRM as well as various item and scale properties that can influence those scores. The simulations conducted in this study represent an important step forward in filling these needs.
Footnotes
Appendix
The error variance term can be derived in the following manner. First, the predictor variables θ1 and θ2 and the criterion variable θ3 are normally distributed with a mean and standard deviation equal to N(0.00, 1.00). Because the standard deviation is simply the square root of the variance, the variance of the predictor and criterion variables is equal to one. Given these conditions, the following derivation gives the error term for the regression models:
where
Finally, given that the two levels of regression weights β1 and β2 were simulated as .3 and .5, Equation 6 can be further reduced to the following form.
In the simulation, the criterion variable θ3 was operationalized with Equations A2a and A2b for the two levels of the regression weights, .3 and .5, respectively. An alternative way of specifying this derivation would be to say that the error associated with each regression model is being sampled from a N(0.00, 0.7071) and N(0.00, 0.9055) distribution for each level of β.
Acknowledgements
The authors would like to acknowledge Niels Waller for his assistance with the R programming code for portions of the simulation. We would additionally like to thank Dr. Paula Popovich, Dr. Jeff Vancouver, and Dr. Victor Heh for their comments on a previous version of this manuscript.
The data presented in this study were generated as a part of the first author’s doctoral dissertation.
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
The author(s) received no financial support for the research, authorship, and/or publication of this article.
