The generalized –test is a test of item fit for items with polytomous responses format. The test is based on a comparison of the observed and expected number of responses in strata defined by the test score. In this article, we make four contributions. We demonstrate that the performance of the generalized –test depends on how sparse cells are pooled. We propose alternative implementations of the test within the framework of limited information testing. We derive the distribution of the –residuals that can be used for post hoc analyses. We suggest a diagnostic plot that visualizes the form of the misfit. The performance of the alternative implementations is investigated in a simulation study. The simulation study suggests that the alternative implementations are capable of controlling the Type-I error rate well and have high power. An empirical application concludes this article.
Item response theory is a field of psychology that deals with the development of item response models. Item response models specify how a latent trait is related to the responses of a test taker in a series of test items that are supposed to be manifest indicators of the trait. The models are used for deductive statements about the test taker’s trait levels and provide a mathematical basis for trait inference, adaptive testing, and test linking. Such applications, however, are only justified in case the item response model used for assessment is capable of representing the true relation between the responses and the latent trait well. The guidelines of the American Psychological Association (2014) demand that evidence is given for the validity of an item response model before it is applied in psychological assessment.
The fit of an item response model to specific data is often assessed with a test of model fit. Tests of model fit assess whether certain aspects of the distribution of the responses in a sample conform to predictions from the model. The tests are sometimes distinguished between tests of global fit and tests of item fit. The tests of global fit are omnibus tests that assess whether the item response model fits to the observed data in all items. The tests of item fit assess whether the model is capable to represent the distribution of the responses in a single item. Overviews of tests of model fit have been given by Swaminathan et al. (2006), Mavridis et al. (2007), Maydeu-Olivares (2013), Glas (2016), and Ranger and Much (2020). In this article, we focus on one specific test of item fit, the –test proposed by Orlando and Thissen (2000). The –test is among the most popular tests of item fit. It adheres to the nominal Type-I error rate well and has high power (Orlando & Thissen, 2003; Sinharay & Ying, 2007). Although originally proposed for the three-parameter logistic model, the –test has been extended to polytomous (Kang & Chen, 2008; LaHuis et al., 2011) and multidimensional item response models (Zhang & Stone, 2008; Zhang et al., 2018).
In this article, we make four additions to the knowledge in the field. First, we investigate the performance of the –test in polytomous item response models. We demonstrate that the performance of the test crucially depends on how sparse table cells are pooled. Pooling too many cells or pooling too few cells increases the Type-I error rate of the test beyond the intended level . Second, we propose two alternative implementations of the test that adhere to the nominal Type-I error rate even when no cells are pooled. Third, we derive the distribution of the residuals the –test is based on. This can be used for exploratory post hoc tests in case the –test indicates a misfit. Fourth, we suggest a graphical check of item fit that is similar to the residual plots used in regression analysis. The graphical check allows for an exploratory analysis of the form of misfit.
The outline of this article is as follows. In the first part of this article, we review the logistic model for graded response data, describe the –test, and give an overview of the literature in the field. Then, we discuss the alternative implementations of the test and derive their asymptotic distribution. We also describe the graphical check of item fit. In the second part of this article, we investigate the performance of the tests in a simulation study. The third part contains an empirical application.
1. The Logistic Model for Graded Response Data
Item response models represent the distribution of the responses in the items of a test for a population of test takers. In this article, we focus on one specific item response model, the logistic model for graded response data (GRM model) of Samejima (1997).
Let a psychological test consist of items with graded response options coded from 0 to Mi. The response xi of a randomly drawn test taker in item i is regarded as a realization of the discrete random variate Xi with probability mass function and sample space . Define by the response pattern of the test taker in all items of a test. The response pattern is the realization of the discrete random vector with probability mass function and sample space . In item response theory, each test taker can be characterized by a latent trait. The trait value of a randomly chosen test taker is the realization of the continuous random variate with density function and sample space . Here, we consider as standard normally distributed. Conditionally on the latent trait, the responses and the response pattern have conditional distributions with probability mass functions and . The GRM model specifies the functional relation between the values of the conditional probability mass functions and the latent trait. Its integral parts are the cumulative category response function and the category response function.
The cumulative category response function , , , maps the latent trait and the response to the value of the conditional probability of responding with graded response option x or higher in item i. In the GRM model, the cumulative category response function is
The CCF function depends on two kinds of item parameters. Item parameter ai () is the item discrimination that determines the amount of influence the latent trait has on the response in item i. Item parameters (; ) determine the value of the function at . Note that the item parameters are assumed to be ordered (). The category response function , , , links levels of the latent trait and the responses to the conditional probability of responding with response option x in item i. The category response function can be derived from the CCF function as follows:
In the GRM model, the responses to different items are assumed to be independent when conditioning on the latent trait. This assumption determines—in combination with the CCF functions—the probability mass function of the discrete random vector X representing the response pattern. The GRM model implies the probability mass function
2. The Generalized –Test of Item Fit
One aspect of model fit is the question whether the category response function of an item response model is representative of the true conditional mass functions of the items. Statistical tests intended to assess this aspect of model fit are usually denoted as tests of item fit, although in a strict sense, it is the whole model that fits or not. The first tests of item fit date back to Bock (1972), Yen (1981), and McKinley and Mills (1985). The general proceeding in these tests is very similar. First, the unknown latent traits of the test takers are replaced by estimates (e.g., maximum likelihood estimates or empirical Bayes estimates). The test takers are then grouped into strata of increasing trait level. For each stratum characterized by a typical trait level, the observed frequencies of the responses are determined as well as the frequencies that are expected according to the assumed category response function and the typical trait level. Item fit is then assessed by quantifying the correspondence of the observed and the expected frequencies over all strata. This is accomplished with one of the test statistics originally proposed for the analysis of frequency tables. The test statistics are usually compared to the quantiles of a distribution. This proceeding, however, lacks a mathematical justification. More important, the distribution does not work well as a reference distribution in practice. This is—at least partly—due to the fact that the test takers are grouped on the basis of an error prone trait estimate. For this reason, later implementations of the tests approximate the distribution of the test statistic via a parametric bootstrap (Rizopoulos, 2006) or account for the measurement error in the trait estimates explicitly (Stone, 2003). In an influential paper, Orlando and Thissen (2000) proposed an alternative test of item fit that does not require the grouping of the test takers into strata on the basis of an error prone trait estimate. The strata are formed on the basis of the observed test scores instead. The test was originally proposed for tests with binary responses but later generalized to tests with graded responses (Kang & Chen, 2008). This extended test—the generalized –test of item fit—is considered in the following.
Let the data consist of the response patterns of N randomly chosen test takers in a test with I items. The response patterns of the test takers are considered as independent realizations of the random vector X whose probability mass function is given in Equation 3 in case the GRM model holds. The number of possible response patterns is . We assume that the different response patterns have been numbered in an arbitrary order with index indicating a specific response pattern. For simplicity, we will use the place holder G for the cardinality of the sample space in the following. Denote by the absolute frequency of response pattern xg in the sample. The vector of all frequencies is . The vector of all frequencies is the realization of the random vector ( ) with a multinomial distribution whose parameters are denoted as N and p. The values of p are the values of the probability mass function given in Equation 3 provided that the GRM model holds.
The theoretical counterparts to the observed frequencies of the response patterns are their model-based expectations . As the frequencies of the response patterns have a multinomial distribution, the model-based expectations are just where p is given in Equation 3. In practice, the model-based expectations are not available as the values of the item parameters (Equation 1) are not known. The model-based expectations, however, can be estimated on the basis of Equation 3 when the unknown values of the item parameters are replaced by marginal maximum likelihood estimates (Baker & Kim, 2004). We denote these estimated model-based expectations in the sample as with corresponding random variate (). Here, as in the following, we do not explicitly indicate the dependency of the estimated model-based expectations on the marginal maximum likelihood estimates of the item parameters.
In the –test, the observed frequencies are compared to the model-based expectations . In order to avoid sparse tables, the data are aggregated as follows. First, the frequency distribution of the test scores is determined. The test score is simply the sum of the graded responses in a response pattern; note that in a test with I items, the possible test scores range from 0 to . We denote a specific test score as s and the maximal test score as S in the following. Test score s occurs in the sample with frequency , . The frequency is determined by summing the frequencies of all response patterns with a test score of s as . Here, function is the indicator function and vector a vector of ones. Additionally, the frequency of the joint occurrence of a test score s and a response x in a selected target item i is determined by summing the frequencies of all response patterns with test score s and response x in item i as , , . With , we refer to the response to item i in response pattern g. The corresponding model-based expectations are determined similarly by , and , , . Note that the frequency and its estimated model-based expectation is zero whenever or . The sample statistics are realizations of the random variates (, ), (, , , ), (, ), and (, , , ).
The –test is an item specific test that evaluates the correspondence of the observed and expected frequencies. Its test statistic is defined for item i as
In Equation 4, the sum excludes those levels of the test scores that imply at least one observed and one expected frequency of zero. In practice, some estimated model-based expectations are small. In this case, it is recommended to pool the frequencies over responses or test scores (Kang & Chen, 2008; Orlando & Thissen, 2000).
The –test has similarities to the standard X2–tests for frequency tables (Moore, 1978). In order to notice this, the observed frequencies have to be arranged in a table where columns denote the test score and rows the response in item i. The quantity is the predicted probability of choosing response x in item i conditionally on receiving test score s. This is multiplied by , the observed frequency of the test score. The –test, thus, compares the observed frequencies of the responses to their model-based predictions for strata defined by the test score. Despite this similarity, the –test and the standard X2–tests differ fundamentally with respect to the underlying sampling models, such that the asymptotic theory for X2–tests cannot be applied to the –test. On the basis of heuristic considerations, the statistic is evaluated with respect to a distribution with degrees of freedom. The degrees of freedom are the number of test scores considered in Equation 4 multiplied by the number of response categories minus one reduced by the number of the item parameters in item i. In practice, some test scores have a frequency of zero. In this case, the test score is not considered when determining the degrees of freedom. Pooling cells within a test score or merging rows of adjacent test scores reduces the degrees of freedom further. The –test usually controls the Type-I error well. It, however, does not perform well in small samples and tests with many items (Kang & Chen, 2008).
3. Alternative Implementations of the Test and the –Residuals
In this section, we present two alternative versions of the –test. The first version—denoted as LI test in the following—is a linear function of limited information statistics. The second test—denoted as –test—is a modification of the original –test.
3.1. LI Test
The formula of the statistic given in Equation 4 can be modified as follows. The statistic contains the factor with corresponding random variate . The factor converges to 1 in probability provided that the GRM model holds. For this reason, it suggests itself to consider simply the differences . Note that the components of the differences are linear combinations of the observed frequencies and their estimated model-based expectations in the sample. They are, thus, limited information statistics and can be interpreted as generalized residuals (e.g., Haberman & Sinharay, 2013; Maydeu-Olivares & Joe, 2005). This is beneficial as all theoretical findings for limited information statistics can be applied; see Online Appendix A for more details. As an alternative to the statistic, we suggest assessing item fit by simply adding the squared differences over the test scores and the response options:
Similar to the statistic, we exclude test scores where observed frequencies are necessarily zero. The value of the statistic in the sample is the realization of a random variate whose distribution can be approximated as follows. The differences are realizations of the random variates that have a multivariate normal distribution in the limit when divided by and the GLM model holds. The expected values of the differences are zero. This follows from findings for limited information statistics; see Online Appendix A for more details. The value of the statistic in the sample is, thus, the realization of a sum of squared centered multivariate normally distributed random variates. The asymptotic distribution of such sums is a mixture of distributions when scaled properly (Yuan & Bentler, 2010). The asymptotic distribution can be approximated as follows. Denote by the variance covariance matrix of all differences that are considered in Equation 5. A description of how can be estimated is given in Online Appendix A. Determine then the two quantities and , where ( are the eigenvalues of . With and , the distribution of the statistic can be approximated by that of the scaled χ2 random variate with degrees of freedom. The scaling factor and degrees of freedom are chosen in order to match the moments of with those of ; see Yuan and Bentler (2010) for more details. Similar approximations are of wide use in categorical data analysis and date back to Welch (1938).
As a reviewer pointed out, a scaled χ2 distribution is a special case of the gamma distribution. Hence, the asymptotic distribution of can also be approximated by a Gamma distribution with shape parameter and scale parameter :
This approximation is more natural as the degrees of freedom are usually not a natural number. Although the components of the –test and the LI–test appear to be similar and converges to one in probability, their asymptotic distribution is different. This is due to the fact that the –test and the LI–test differ in how the distribution of the test score affects the test. In the –test, the sum of the predicted frequencies over x is necessarily identical to . This is not the case in the LI–test; for further differences, see Online Appendix A.
3.2. –Test
The –test is based on the –residuals that appear in the numerator of the statistic:
All components of –residuals are limited information statistics that result from aggregating the frequencies of certain response patterns (Maydeu-Olivares & Joe, 2005). The components can, thus, be interpreted as generalized residuals (Haberman & Sinharay, 2013); see Online Appendix A for more details. The observed residual is a realization of the random variate (). This random variate is a function of the random variates , , , and . As these random variates are linear functions of the frequencies of the response patterns, they have a multivariate normal distribution in the limit provided they are properly scaled; see the paragraph above. In contrast to the LI-test, the –residuals are not a linear function of their constituents. The asymptotic distribution of can be derived by an application of the Delta method where Equation 7 is approximated by a truncated Taylor expansion (Agresti, 1990, p. 424). We refer to Online Appendix A for more details. In summary, it can be shown that the residuals are asymptotically distributed according to a multivariate normal distribution with expectation of zero and variance of when divided by .
As an alternative to the statistic, we suggest summing the squared –residuals:
In Equation 8, we exclude the same residuals as in the original statistics. We additionally exclude the residuals in the highest response option (Mi) as they are linearly dependent on the remaining residuals. The value of the statistic is the realization of a random variate whose asymptotic distribution is a mixture of random variates when scaled properly for the same reasons as given above. We suggest the same approximation for this distribution. Denote by the variance covariance matrix of all differences that enter the calculation of the test statistic and let ( be the eigenvalues of . With and , the distribution of the statistic is approximated as
where is the shape parameter and the scale parameter. The –test differs from the –test in how the discrepancy is measured. There is, however, another difference. The –test is based on the interpretation of the test scores as multinomial distributed random variates. Our framework is, thus, the framework of multinomial sampling. The –test, on the other hand, is based on a different sampling model as the distribution of the test score is considered as fixed.
4. A Graphical Check of Item Fit
Tests of item fit can be used in order to identify items that require a closer inspection. The –residuals can be used for post hoc tests that might clarify for which test scores the observed frequencies and the model-based expectations diverge. Such analyses are sometimes useful in order to clarify at what trait level the assumed category response function is invalid. For a further analysis of the source of the misfit, we suggest a graphical plot. The plot is based on the interpretation of the graded responses as realizations of originally continuous but then categorized random variates. For the plot, the values of the originally continuous random variates—the so-called augmented residuals—are recovered. The augmented residuals are then used for graphical checks of model fit in a similar way as the residuals in regression analysis. The idea of using augmented residuals was introduced by Liu and Zhang (2018) for the visualization of model fit in the ordinal logit model. Here, we adapt this approach to item response theory. Graphical checks of model fit have been proposed before; see, for example, the suggestions of Sijtsma (1998), Douglas and Cohen (2001), Haberman et al. (2013), or Sinharay (2006). The latter two suggestions even allow for inferential statements. Our plot is not supposed to replace these suggestions. We consider the plot as a complement that gives additional insight into the source of misfit. The proposed plot is simple to interpret as it is not based on estimates of the different category response functions. Furthermore, it allows for a distinction of different sources of misfit. This is similar to the residual plots in regression analysis where linearity and variance homogeneity can be evaluated separately.
The GRM model can be interpreted as a linear factor model with logistic distributed residuals where the originally latent continuous responses are categorized at thresholds (e.g., Takane & de Leeuw, 1987). In this interpretation, the latent response of a test taker in an item is characterized as () where random variate is the latent trait, random variate U () a residual with a standard logistic distribution, and () and () are item parameters. The observable graded responses result from the categorization of the latent continuous responses. Assume that the response options in item i determine Mi ordered thresholds <…< (). The observed graded response xi is just the number of thresholds that are exceeded, that is . It can be shown that the conditional probability of a response can be represented by the CRF function given in Equation 2 when adequate item parameter values and thresholds are chosen; see Online Appendix B for more details. For the plot, we recover values of U—the so-called augmented residuals—for all test takers in an item.
The proceeding is as follows. For each test taker in the sample, we draw a latent trait value from the standard normal distribution and augmented residuals ui () from the standard logistic distribution. Then, the corresponding response pattern of graded responses is determined by categorizing the implied continuous responses at the thresholds ,…, . The parameters and and the thresholds ,…, are replaced by the values that are implied by the marginal maximum likelihood estimates of the GRM model parameters; more details on this are given in Online Appendix B. When the generated data are coherent with the observed response pattern of the test taker, the augmented residual in target item i is retained. With coherent we denote the case that the generated response in the target item is identical to the response of the test taker and the sum of the remaining generated responses is identical to the test score minus the response in the target item. When the generated data are not coherent, new augmented residuals are simulated. This proceeding is repeated for all test takers in the sample. In order to reduce the randomness, one can generate several replications of augmented residuals for each test taker.
The augmented residuals of the test takers in a target item i are plotted against the test score of the test takers in the remaining items. Here, in contrast to the tests of model fit discussed above, the target item i is excluded when summing the responses. In case the model holds (and the real item parameters are used when generating the augmented residuals), the augmented residuals are distributed according to a standard logistic distribution in all strata of different test scores; a proof of this is given in Online Appendix B. Whether this distribution holds for the augmented residuals can be checked graphically with any plot developed for the evaluation of model fit in linear regression models. We suggest plotting the augmented residuals and the squared augmented residuals against the test score. The residuals should scatter randomly around zero and the squared residuals randomly around .
5. Simulation Study
In order to compare the performance of the tests, we conducted a small-scale simulation study consisting of two scenarios. In the first scenario, we assessed the size of the tests, in the second scenario their power.
5.1. Simulation Scenario I
In the first simulation scenario, the data were generated and analyzed with the GRM model. The data consisted of the responses to items with five graded response options. The data were generated as follows. The latent traits of the fictitious test takers were drawn from a standard normal distribution. Then, the conditional probabilities of choosing one of the five response options were determined according to Equation 2. Having determined the probabilities, a response was generated for all combinations of a test taker with an item. Three different sample sizes were considered, namely, sample sizes of , , and . The length of the test was either , , or . These test lengths and sample sizes are typical for data in personality assessment and were chosen in accordance to previous simulation studies (Kang & Chen, 2008; LaHuis et al., 2011). Item parameter values were chosen such that the distribution of the responses varied systematically, being right skewed, symmetric, or left skewed; the exact values of the item parameters are reported in Online Appendix C. Fully crossing the three sample sizes with the three lengths of the test defined simulation conditions for which simulation samples were generated.
The data were analyzed with the GRM model. The item parameters were estimated by marginal maximum likelihood estimation using the mirt package (Chalmers, 2012) from the statistical software environment R (R Development Core Team, 2009). Three tests were performed, namely, the original –test, the –test, and the –test. For the –test, we used the implementation in the mirt package (Chalmers, 2012). Three different versions of the –test were requested. In the first version, no cells were pooled, that is, the observed frequencies and model-based expectations were compared for all test scores that occurred at least once. In the second version, cells with small model-based expectations were combined such that all combined cells had an expected frequency of at least one. This is the default implementation of the –test in the mirt package. In the third version, cells were combined such that all combined cells had an expected frequency of at least five. This is the size that is usually recommended for the analysis of frequency tables (Agresti, 1990, p. 246). The –test and the –test were implemented as described above. The marginal probabilities of the response patterns (Equation 3) were determined with Gauss-Hermite quadrature using nodes (Stroud, 1971). The simulation study was implemented in the software environment R.
The results for the different tests and the different simulation conditions are reported in Table 1. Here, the empirical rejection rates of the tests are tabulated for different nominal Type-I error rates . We also report the Kolmogorov–Smirnov (KS) statistic that measures the difference between the empirical distribution function of the p values and the uniform distribution for the different tests. A small value of the KS statistic implies that a test works as expected. In Table 1, the results have been averaged over the items. Results were similar for the individual items.
Type-I Error Rates () of the Three Tests of Item Fit for Different Amounts of Aggregation, Nominal Type-I Error Rates (α), Sample Sizes (N), and Test Lengths (I) as Well as the Statistic of the Kolmogorov–Smirnov (KS) Test Assessing Uniformity
Test
Agg.
I = 5
I = 10
I = 20
N = 125
N = 250
N = 1,000
N = 125
N = 250
N = 1,000
N = 125
N = 250
N = 1,000
None
.120
.123
.116
.137
.144
.141
.150
.158
.162
.088
.088
.068
.103
.117
.107
.110
.121
.126
.053
.049
.026
.065
.080
.070
.064
.071
.084
KS
.167
.134
.060
.197
.218
.218
.126
.159
.216
e = 1
.125
.107
.100
.134
.111
.101
.157
.116
.093
.066
.049
.050
.064
.056
.051
.076
.057
.045
.016
.012
.009
.012
.011
.011
.014
.011
.009
KS
.112
.050
.030
.144
.073
.031
.177
.094
.032
e = 5
.572
.228
.108
.469
.274
.120
.419
.254
.135
.383
.125
.052
.303
.159
.062
.258
.144
.070
.127
.028
.010
.093
.040
.013
.075
.034
.014
KS
.598
.250
.051
.512
.305
.068
.464
.285
.097
LI
None
.098
.099
.104
.117
.105
.112
.104
.112
.103
.052
.048
.052
.061
.053
.060
.054
.058
.055
.010
.011
.014
.012
.012
.014
.014
.013
.011
KS
.041
.027
.028
.065
.056
.053
.042
.042
.033
None
.090
.093
.092
.094
.104
.114
.086
.098
.102
.046
.050
.047
.046
.051
.059
.043
.052
.051
.010
.012
.013
.010
.014
.012
.009
.012
.012
KS
.034
.029
.028
.035
.041
.049
.043
.029
.034
Note. Results are based on 1,000 simulation samples. Results have been averaged over the different items. = original –test given in Equation 4; = LI–test given in Equation 5; = modified –test given in Equation 8; Agg. = aggregation of sparse cells.
The results in Table 1 suggest that the –test and the –test achieve to control the Type-I error rate well under all conditions. The distribution of the p values is also close to the uniform distribution. Note that for 1,000 replications, the critical value of the KS test is for . Findings are different for the –test whose performance depends on the sample size, the length of the test, and the amount of pooling. In samples of test takers, the empirical rejection rates are too high irrespective of the amount of pooling. In the samples of and test takers, the –test controls the Type-I error rate well provided that cells are pooled to an expected frequency of one. The KS statistics of the –test are still rather large in longer tests and often larger than the KS statistic of the –test. The –test performs less well when cells are not pooled at all or when cells are pooled to an expected frequency of five. This is remarkable as the last finding contradicts the typical recommendation for the standard X2–test.
In addition to the performance of the tests, we analyzed the distribution of the –residuals given in Equation 7; note that the –residuals could be used for a post hoc analysis in order to determine the region where the CRF functions are misspecified. For the analysis, we tested whether the residuals’ expectations deviated from zero on the basis of Wald’s tests; note that an estimate of the covariance matrix of the –residuals is a by-product of the –test. The empirical rejection rates are visualized in Figure 1 for separately for test scores of 4 to and the five response categories. For lack of space, only the results for one prototypical item are reported (Item 3). Results were similar for the remaining items. Figure 1 also contains a 95% confidence band within each ofthe empirical rejection rates should be located.
Empirical rejection rates of a test whether the expectation of the –residuals is zero in Item 3, first simulation scenario for and different test scores s ranging from 4 to . Results are reported separately for the different combinations of the test length I and the sample size N. The straight line indicates the level of , and the dotted lines indicates a 95% confidence band within the empirical rejection rates should be located. Different plotting symbols refer to different response options x (, , , , ).
Figure 1 implies that the empirical rejection rates are close to the intended nominal Type-I error rate of . Only in low and high test scores with low frequency, the Wald test is not capable of controlling the Type-I error rate well. Overall, these findings suggest that the –residuals can be used for post hoc analyses.
5.2. Simulation Scenario II
In the second simulation scenario, we investigated the power of the tests. For this purpose, we generated data that violate the GRM model. As the model can be interpreted in terms of a linear factor model with originally continuous, but then categorized indicators—see the section on the augmented residuals above and Online Appendix B—we studied model misspecification in terms of a misspecified linear factor model. Two forms of misspecification were considered, variance heterogeneity (Condition A) and nonlinearity (Condition B).
5.2.1. Condition A
The GRM model can be derived from a linear factor model with standard logistic distributed responses when the continuous responses are categorized (e.g., Takane & de Leeuw, 1987). In factor analysis, the conditional expectation of the latent continuous responses is a linear function of the latent trait. The conditional variance is constant. In Condition A, we misspecified the GRM model as we related both, the conditional expectation and the conditional variance of the latent continuous response to the latent trait. This is similar to variance heteroscedasticity in a linear regression model. Such violations are due to unexplained heterogeneity.
Data were generated for a test of 10 items and 1,000 test takers. Other sample sizes and numbers of items were not considered anymore. We chose this setting as in 10 items and 1,000 test takers all tests adhered to the nominal Type I error rate well; see Table 1. This is necessary in order to compare the power of the tests. The data were generated as follows. We first drew a latent trait from the standard normal distribution for each test taker. We then generated a latent continuous response to each item by a linear combination of the trait and a residual that was drawn randomly from the logistic distribution. The residuals were drawn from a standard logistic distribution in eight of the items. These were the regular items. In two items, Item 3 and Item 8, respectively, the residuals were drawn from a logistic distribution with expectation of zero and standard deviation of . These were the irregular items; note that is the standard deviation of the standard logistic distribution. The responses were then categorized at thresholds that corresponded to the graded response options in the items. When generating the data, we ensured that the implied category response functions were identical to the category response functions of the first simulation scenario in the regular items. The true category response functions as well as their best approximations in the Kullback–Leibler sense (White, 1982) that are provided by the GRM model are visualized in Online Appendix C for the irregular items. Altogether 1,000 simulation samples were generated.
Data were analyzed as in the first simulation scenario. A GRM model was fitted to the data via marginal maximum likelihood estimation. Then, the tests of item fit were performed, namely, the –test, the –test, and the –test. For the –test, the cells were aggregated in order to achieve an expected frequency of one. As we noted that the power of the tests increased by aggregation, we also aggregated cells for the –test and the –test for the sake of comparability. We, however, also performed the tests without aggregation. The simulation study was implemented in the software environment R as in the first simulation scenario. All scripts are available from the authors on request.
The empirical rejection rates of the tests are reported in Table 2 for different nominal Type-I error rates . Results are reported for the irregular items (Power) and the regular items (Type-I error). Results have been aggregated over the different items.
Empirical Rejection Rates of Different Tests in the Second Simulation Scenario and Condition A Reported Separately for Regular and Irregular Items, Different Levels , and Different Amounts of Aggregation (Agg./Com)
Items
Irregular
Regular
.10
.05
.01
.10
.05
.01
Agg.
.439
.324
.140
.106
.055
.010
Com
.301
.190
.063
.121
.068
.016
Agg.
.309
.206
.074
.103
.052
.010
Com
.264
.172
.052
.123
.065
.016
Agg.
.342
.232
.098
.086
.041
.009
Note. Results are based on 1,000 simulation samples. Results have been averaged over the different items. = original –test given in Equation 4; = LI–test given in Equation 5; = modified –test given in Equation 8; Agg. = aggregation of sparse cells; Com = no aggregation of sparse cells.
The tests have moderate power to detect the irregular items. The –test is most powerful. The –test and the –test without aggregation have the lowest power. Aggregating cells increase the power slightly. In the regular items, the empirical rejection rate is close to . This suggests that the test is capable of distinguishing between regular and irregular items; note, however, that although we classify items as regular or irregular, this is not fully justified. It is always the whole model that fits or not. So, in a strict sense, one cannot interpret the rejection frequencies in the regular items as Type-I error rates and in the irregular items as power; note also that in practice, one probably would use a correction for multiple testing which would decrease the power of the tests further.
In addition to the global tests of item fit, we performed local tests of item fit on the basis of the –residuals. For this purpose, we tested whether the –residuals have an expectation of zero. We used Wald’s tests and a nominal Type-I error rate of . The empirical rejection rates are visualized in Figure 2 for the different response categories, test scores and four items. Item 3 and Item 8 are irregular items. Item 2 and Item 9 are regular items with similar item parameters. Figure 2 also contains a 95% confidence band within the observed rejection rates should fall when their expectation is .
Empirical rejection rates of a Wald test assessing whether the expectation of the –residuals is zero as a function of the test score s and the response category in four items of a test for and simulation scenario II, Condition A. The test score ranges from 4 to . The straight line denotes the level of , and the dotted lines a 95% confidence band within the observed rejection rates should fall when their expectation is . Different plotting symbols refer to different response options x (, , , , ).
For the regular items, the empirical rejection rates are close to . Similar to the first simulation scenario (Figure 1), there is some deviation in small and large test scores that occur infrequently. In the irregular items, the empirical rejection rates exceed . The highest rejection rates are achieved in test scores from to and response category .
We also investigated the misfit graphically with the augmented residual plots suggested in the previous section. Figure 3 contains a plot of the augmented and the squared augmented residuals against the test score in the regular Item 2 and the irregular Item 3 for an exemplary data set. The plots also include a curve (red line) depicting an estimate of the conditional expectation that was determined with a loess smoother. Any deviation of the curve from its expectation under model fit (black line) indicates a form of misspecification. The plots are interpreted in the same way as the residual plots in regression analysis. The plots reveal that the expected values of the squared augmented residuals depend on the test score in Item 3. This accords with the way the misspecification was induced.
Plot of augmented residuals U (left side) and squared augmented residuals U2 (right side) against the test score s in regular Item 2 and irregular Item 3 of simulation scenario II, Condition A. The red line visualizes an estimate of the conditional expectation as a function of the test score that was determined with a loess smoother, and the black line the expected value under correct model specification.
5.2.2. Condition B
In Condition B, we violated the linearity assumption of the GRM model with respect to the underlying continuous, but then categorized latent response. In two items, Item 3 and Item 8, we assumed that the expected values decreased quadratically in the range , but then increased linearly for ; for evidence for nonmonotonous relations between the trait and the response tendency in personality and attitudinal scales, see Meijer and Baneke (2004) and Stark et al. (2006). The nonmonotonous relation in the latent continuous response implies nonmonotonous CCF functions.
Apart from the form of misspecification, the proceeding in Condition B was similar to the proceeding in Condition A. Data were generated for a test of 10 items and a sample size of 1,000 test takers. Data were simulated as follows. We first drew a latent trait from the standard normal distribution for each test taker. We then generated a latent continuous response to each item by using a linear combination of the trait and a logistically distributed residual in the regular items or by adding a quadratic function of the trait and a logistic distributed residual in the irregular items. The responses were than categorized at thresholds that corresponded to the graded response options in the items. When generating the data, we ensured that the implied category response functions of the correctly specified items were identical to the ones of the first simulation scenario. The true category response function as well as their best approximations in the Kullback–Leibler sense (White, 1982) provided by the GRM model are visualized in Online Appendix C for the irregular items. Altogether 1,000 simulation samples were generated.
Data were analyzed as in the first simulation scenario. A GRM model was fitted to the data via marginal maximum likelihood estimation. Then, the tests of item fit were performed, namely, the –test, the –test, and the –test. For the –test, the cells were aggregated in order to achieve an expected frequency of one. As we noted that the power of the tests increased by aggregation, we also aggregated cells for the –test and the –test for comparability, but also performed the tests without aggregation.
The empirical rejection rates of the tests are reported in Table 3 for different nominal Type-I error rates . Results are reported for the irregular items (Power) and the regular items (Type-I error). Results have been aggregated over the different items.
Empirical Rejection Rates of Different Tests in the Second Simulation Scenario and Condition B Reported Separately for Regular and Irregular Items, Different Levels , and Different Amounts of Aggregation (Agg./Com)
Items
Irregular
Regular
.10
.05
.01
.10
.05
.01
Agg.
.242
.146
.037
.110
.057
.011
None
.214
.126
.042
.185
.110
.034
Agg.
.528
.418
.216
.177
.103
.031
None
.146
.091
.030
.126
.067
.016
Agg.
.393
.283
.120
.113
.058
.012
Note. Results are based on 1,000 simulation samples. Results have been averaged over the different items. = original –test given in Equation 4; = LI–test given in Equation 5; = modified –test given in Equation 8; Agg. = aggregation of sparse cells; Com = no aggregation of sparse cells.
The power of the –test, the power of the –test without aggregation, and the power of the –test without aggregation are very low. The power increases when sparse cells are aggregated. In this case, the –test detects the irregular items with a power of at . The –test has a power of . The rejection rates in the regular items are slightly above in all tests, an exception being the –test. This implies that the –test is not very specific. Although it is sensitive to the presence of misspecification, it cannot identify the items that are misspecified. The –test with aggregation is better in this respect.
In addition to the global tests of item fit, we performed local tests of item fit on basis of the –residuals. We tested whether the –residuals have an expectation of zero with a Wald test. The nominal Type-I error rate was set to . The empirical rejection rates are visualized in Figure 4 for the different response categories, test scores, and four items. Item 3 and Item 8 are irregular items, and Item 2 and Item 9 are regular items. Figure 4 also contains a 95% confidence band within the observed rejection rates should fall when their expectation is .
Empirical rejection rates of a Wald test assessing whether the expectation of the –residuals is zero as a function of the test score s and the response category in four items of a test for and simulation scenario II, Condition B. The test score ranges from 4 to . The straight line denotes the level of , and the dotted lines denote a 95% confidence band within the observed rejection rates should fall when their expectation is . Different plotting symbols refer to different response options x (, , , , ).
The power to detect deviations from the assumed category response function is rather low. In the irregular items, the empirical rejection rates are highest around a test score of where it reaches a value of about . The empirical rejection rates are close to for test scores below .
We also investigated the type of misfit graphically with the augmented residual plot suggested in the previous section. Figure 5 contains a plot of the augmented and the squared augmented residuals in the regular Item 7 and the irregular Item 8 for an exemplary data set. The plots also contain a curve (red line) representing the estimated conditional expectation that was determined with a LOESS smoother. The conditional expectation under model fit is represented by a black line. The plots are interpreted in the same way as the standard residual plots in regression analysis. Any deviation of the two curves indicates a misspecification of the model. The plots reveal that the relation between the latent trait and the conditional expectation of the augmented residuals is not linear in the irregular items.
Plot of augmented residuals U (left side) and squared augmented residuals U2 (right side) against the test score s in regular Item 7 and irregular Item 8 of simulation scenario II, condition B. The red line visualizes an estimate of the conditional expectation as a function of the test score that was determined with a loess smoother. The black line visualizes the expected value of the augmented residual under correct model specification.
6. Empirical Example
The use of the proposed tests is illustrated by means of a real-data application. The data set contained the real-life responses of participants to the 20-item German-language version of the Fear of Negative Evaluation Scale (FNES; Watson & Friend, 1969; German: Vormbrock & Neuser, 1983). The FNES is a frequently used instrument to assess individual differences in the anxiety to be disparagingly judged by others. Responses are given on a 5-point Likert-type scale (0 = almost never; 4 = almost always) and a sample item is “I am afraid that others will not approve of me.” The raw data are available from the authors upon request. Test scores ranged from 0 to . Extreme test scores, however, were very infrequent. In the sample, 79% of all test scores were between and .
Results are summarized in Table 4. Table 4 contains the parameter estimates and the results of the -test, the -test, and the -test. In all tests, the cells were pooled. Reversed items were recoded before performing the tests; note that results are different when reversed items are not recoded. Due to the different degrees of freedom, the original values of the test statistics cannot be compared directly. For this reason, we report the p values of the tests and quantiles that correspond to the tests’ p values for a reference distribution with degrees of freedom. These allow the comparison of the quantiles over the different tests.
Item Parameter Estimates, p Values, and Corresponding Quantiles Q of a Reference Distribution in the Items of the Fear of Negative Evaluation Scale
Item Parameter
–test
–test
–test
Item
ai
p
Q
p
Q
p
Q
1
1.04
−0.92
0.57
2.04
3.99
.49
292.05
.01
351.08
.23
309.83
2
2.85
−1.23
−0.31
0.56
1.56
.08
325.79
.02
344.37
.09
324.36
3
2.20
−1.51
−0.51
0.47
1.68
.17
315.24
.00
361.16
.05
332.83
4
0.53
−5.52
−2.63
−0.16
2.64
.06
331.19
.00
358.40
.21
311.20
5
2.35
−2.07
−1.00
0.29
1.36
.57
286.84
.23
309.53
.95
253.58
6
2.23
−1.63
−0.73
0.29
1.56
.04
334.60
.00
383.37
.02
343.83
7
0.67
−4.85
−1.20
0.33
2.63
.04
336.45
.00
363.33
.08
327.10
8
2.27
−1.08
−0.01
0.98
1.90
.58
286.37
.20
311.81
.91
259.68
9
3.66
−1.30
−0.36
0.45
1.43
.06
330.90
.12
320.16
.72
277.78
10
2.11
−1.04
0.09
0.89
1.84
.73
276.58
.08
327.27
.55
288.36
11
0.60
−4.14
−1.09
0.85
3.51
.04
337.20
.00
396.64
.00
381.33
12
2.13
−1.73
−0.70
0.20
1.54
.16
315.79
.02
346.38
.20
312.41
13
2.71
−0.94
−0.05
0.74
1.74
.13
318.94
.12
320.91
.47
293.24
14
3.17
−0.88
0.10
0.87
1.78
.01
353.65
.07
329.37
.20
311.88
15
1.46
−2.42
−0.93
0.06
1.51
.16
316.03
.00
365.01
.05
333.25
16
1.89
−1.74
−0.67
0.33
1.54
.00
367.06
.01
351.96
.20
311.69
17
0.99
−3.40
−1.69
−0.49
1.09
.01
357.61
.09
325.57
.48
292.68
18
2.67
−0.82
0.03
0.64
1.33
.41
296.98
.18
314.05
.44
295.13
19
2.91
−1.08
−0.27
0.42
1.52
.34
301.68
.18
313.98
.59
285.71
20
2.04
−1.64
−0.61
0.21
1.48
.52
290.06
.02
344.84
.22
310.18
The –test flags six items, the –test 11 items, and the –test two items as misfitting at . Two items, Item 6 and Item 11, are flagged by all tests. The p values of the –test were moderately correlated to the p values of the –test and the –test (Kendall : ). The p values of –test and the –test were more similar (Kendall : ). This suggests that the two tests agree which items are relatively more suspicious. We evaluated the fit of Item 11 that had the highest Q value further. We performed local tests of item fit on basis of the –residuals. Findings, however, were not clear. Wald tests on whether the expected residual is zero were significant for some combinations of test score and response. There was, however, no interpretable pattern. This is not really surprising as some forms of misfit have heterogeneous effects on the category characteristic curves; see Figure C1 and Figure C2 in Online Appendix C. We finally determined the augmented residual plot for Item 11; see Figure 6. The plot suggests that there is a form of variance heterogeneity. The variance seems to be too small in average test scores.
Plot of augmented residuals U (left side) and squared augmented residuals U2 (right side) against the test score s in Item 11 of the Fear of Negative Evaluation Scale. The red line visualizes an estimate of the conditional expectation as a function of the test score that was determined with a loess smoother, and the black line visualizes the expected value under correct model specification.
In addition to Item 11, the tests indicated that Item 6 and, to a lesser degree, Items 3, 7, and 15 show statistically significant deviations from the underlying model. When analyzing the items’ content, there was no overlap between all items, in the wording or meaning, that would clearly explain why those items were flagged. However, three of the five flagged items (i.e., 7, 11, and 15) are worded negatively. Evidence that reversed items produce ambiguous results by eliciting changes in response behaviors, acquiescence or misunderstandings of negations (e.g., Lindwall et al., 2012; Roszkowski & Soven, 2010; Weijters et al., 2013), and prior findings on the FNES negative items (Carleton et al., 2007; Rodebaugh et al., 2004) might contribute to understand why the tests flag negatively worded items as deviating from the GRM model. One also might speculate that the flagged items were not adequate for the nonclinical sample we used. The latter would indicate that researchers should be cautious when using the FNES for comparing clinical and nonclinical samples in studies.
7. Discussion
Identifying items that undermine the fit of the data to the employed item response model is of great importance for the development and validation of psychological tests. One of the most popular tests of item fit is the –test that was proposed by Orlando and Thissen (2000) and extended to polytomous item response models by Kang and Chen (2008). This test compares the observed frequencies of responses with the corresponding model-based expectations in strata defined by the test scores. Numerous findings suggest that the –test works very well in tests with binary items (Orlando & Thissen, 2003; Sinharay & Ying, 2007) but is liberal in tests with graded response format (Kang & Chen, 2008; LaHuis et al., 2011). In this article, we have expanded the previous findings on the –test in several ways. We compared the –test to two alternative variants, we derived the distribution of the associated residuals, and we proposed a graphical check of item fit that can be used as a supplement to the statistical tests.
As the –test does not have the same mathematical foundation as, for example, the limited information tests (e.g., Haberman & Sinharay, 2013; Maydeu-Olivares & Joe, 2005), we considered two alternative tests, the LI–test and the –test. The two alternative tests differ from the –test in several aspects. The first alternative, the LI–test, is based on the joint distribution of the test score and the response in an item. As it is based on limited information statistics, its asymptotic distribution can be derived within the limited information framework developed by Maydeu-Olivares and Joe (2005). The –test differs from the –test in the way the model-based expectations are determined. In the –test, the sum of the model-based expectations over the responses x is enforced to be identical to the frequency of the test score in the data. This is not the case in the –test where the expected frequencies are not restricted. The –test, therefore, taps two aspects, the congruence of the theoretical and the empirical distribution of the test score and the congruence of the theoretical and empirical distribution of the response in an item when conditioning on the test score. This small modification reduces the specificity of the test. The second test, the –test, is based on the same quantities than the –test. It differs from the –test in the way the test statistic is implemented and in the statistical model. When deriving the sample distribution of the test statistic, we do not condition on the test score. The test score’s random fluctuation from one sample to the other is explicitly considered. Contrary to the –test, the –test controls the Type-I error rate well even when samples are small and tests consist of up to items.
As a by-product, we derived the distribution of the –residuals. This result can be used for post hoc checks in order to locate the source of misfit. One can also test for a linear or quadratic trend by using polynomial contrasts. Whether such more specific tests can be used in order to increase the power to detect misspecification requires future research. We finally suggested a graphical check of item fit that is intended to supplement the statistical tests. The graphical check of item fit is similar to the residual plots used in regression analysis. The plot visualizes how the data deviates from the model.
Supplemental Material
Supplemental Material, sj-docx-1-jeb-10.3102_10769986211050304 - On the Generalized –Test of Item Fit: Some Variants, Residuals, and a Graphical Visualization
Supplemental Material, sj-docx-1-jeb-10.3102_10769986211050304 for On the Generalized –Test of Item Fit: Some Variants, Residuals, and a Graphical Visualization by Jochen Ranger and Kay Brauer in Journal of Educational and Behavioral Statistics
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
ORCID iD
Kay Brauer
References
1.
AgrestiA. (1990). Categorical data analysis. Wiley.
2.
American Psychological Association. (2014). Standards for educational and psychological testing. AERA Publications.
BockR. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29–51. https://doi.org/10.1007/BF02291411
5.
CarletonR.CollimoreK.AsmundsonG. (2007). Social anxiety and fear of negative evaluation: Construct validity of the BFNE-II. Journal of Anxiety Disorders, 21, 131–141. https://doi.org/10.1016/j.janxdis.2006.03.010
6.
ChalmersR. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48, 1–29. https://doi.org/10.18637/jss.v048.i06
7.
DouglasJ.CohenA. (2001). Nonparametric item response function estimation for assessing parametric model fit. Applied Psychological Measurement, 25, 234–243. https://doi.org/10.1177/01466210122032046
8.
GlasC. (2016). Frequentist model-fit tests. In van der LindenW. (Ed.), Handbook of item response theory (Vols. 2, Statistical Tools, pp. 343–361). Chapman and Hall/CRC Press.
9.
HabermanS.SinharayS. (2013). Generalized residuals for general models for contingency tables with application to item response theory. Journal of the American Statistical Association, 108, 1436–1444. https://doi.org/10.1080/01621459.2013.835660
10.
HabermanS.SinharayS.ChonK. (2013). Assessing item fit for unidimensional item response theory models using residuals from estimated item response functions. Psychometrika, 78, 417–440. https://doi.org/10.1007/s11336-012-9305-1
11.
KangT.ChenT. (2008). Performance of the generalized S-X 2 item fit index for polytomous IRT models. Journal of Educational Measurement, 45, 391–406. https://doi.org/10.1111/j.1745-3984.2008.00071.x
12.
LaHuisD.ClarkP.O’BrienE. (2011). An examination of item response theory item fit indices for the graded response model. Organizational Research Methods, 14, 421–434. https://doi.org/10.1177/1094428109350930
13.
LindwallM.BarkoukisV.GranoC.LucidiF.RaudseppL.LiukkonenJ.Thogersen-NtoumaniC. (2012). Method effects: The problem with negatively versus positively keyed items. Journal of Personality Assessment, 94, 196–204. https://doi.org/10.1080/00223891.2011.645936
14.
LiuD.ZhangH. (2018). Residuals and diagnostics for ordinal regression models: A surrogate approach. Journal of the American Statistical Association, 113, 845–854. https://doi.org/10.1080/01621459.2017.1292915
15.
MavridisD.MoustakiI.KnottM. (2007). Goodness-of-fit measures for latent variable models for binary data. In LeeS.-Y. (Ed.), Handbook of latent variable and related models (pp. 135–161). Elsevier.
Maydeu-OlivaresA.JoeH. (2005). Limited- and full-information estimation and goodness-of-fit testing in 2 n contingency tables: A unified framework. Journal of the American Statistical Association, 100, 1009–1020. https://doi.org/10.1198/016214504000002069
MeijerR.BanekeJ. (2004). Analyzing psychopathology items: A case for nonparametric item response theory modeling. Psychological Methods, 9, 354–368. https://doi.org/10.1037/1082-989X.9.3.354
20.
MooreD. (1978). Chi-square tests. In HoggR. (Ed.), Studies in statistics (pp. 66–106). Mathematical Association of America.
21.
OrlandoM.ThissenD. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24, 50–64. https://doi.org/10.1177/01466216000241003
22.
OrlandoM.ThissenD. (2003). Further investigation of the performance of s-x2: An item fit index for use with dichotomous item response theory models. Applied Psychological Measurement, 27, 289–298. https://doi.org/10.1177/0146621603027004004
R Development Core Team. (2009). R: A language and environment for statistical computing [Computer software manual]. http://www.R-project.org
25.
RizopoulosD. (2006). ltm: An R package for latent variable modeling and item response theory analysis. Journal of Statistical Software, 17, 5. https://doi.org/10.18637/jss.v017.i05
26.
RodebaughT.WoodsC.ThissenD.HeimbergR.ChamblessD.RapeeR. (2004). More information from fewer questions: The factor structure and item properties of the original and brief fear of negative evaluation scale. Psychological Assessment, 16, 169–181. https://doi.org/10.1037/1040-3590.16.2.169
27.
RoszkowskiM.SovenM. (2010). Shifting gears: Consequences of including two negatively worded items in the middle of a positively worded questionnaire. Assessment & Evaluation in Higher Education, 35, 113–130. https://doi.org/10.1080/02602930802618344
28.
SamejimaF. (1997). Graded response model. In van der LindenW.HambletonR. (Eds.), Handbook of modern item response theory (pp. 237–266). Springer. https://doi.org/10.1007/978-1-4757-2691-6
29.
SijtsmaK. (1998). Methodological review: Nonparametric IRT approaches to the analysis of dichtomous item scores. Applied Psychological Measurement, 22, 3–31. https://doi.org/10.1177/01466216980221001
30.
SinharayS. (2006). Bayesian item fit analysis for unidimensional item response theory models. British Journal of Mathematical and Statistical Psychology, 59, 429–449. https://doi.org/10.1348/000711005X66888
StarkS.ChernyshenkoO.DrasgowF.WilliamsB. (2006). Examining assumptions about item responding in personality assessment: Should ideal point methods be considered for scale development and scoring?Journal of Applied Psychology, 91, 25–39. https://doi.org/10.1037/0021-9010.91.1.25
33.
StoneC. (2003). Empirical power and type I error rates for an IRT fit statistic that considers the precision of ability estimates. Educational and Psychological Measurement, 63, 566–583. https://doi.org/10.1177/0013164402251034
34.
StroudA. (1971). Approximate calculation of multiple integrals. Prentice-Hall.
35.
SwaminathanH.HambletonR. K.RogersH. J. (2006). Assessing the fit of item response theory models. In RaoC.SinharayS. (Eds.), Handbook of statistics: Vol. 26. Psychometrics (pp. 683–718). Elsevier.
36.
TakaneY.de LeeuwJ. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 53, 393–408. https://doi.org/https://doi.org/10.1007/BF02294363
37.
VormbrockF.NeuserJ. (1983). Konstruktion zweier spezifischer Trait-Fragebogen zur Erfassung von Angst in sozialen Situationen (SANB und SVSS). [Construction of two specific trait questionnaires to assess anxiety in social situations (SANB and SVSS)]. Diagnostica, 29, 165–182.
38.
WatsonD.FriendR. (1969). Measurement of social-evaluative anxiety. Journal of Consulting and Clinical Psychology, 33, 448–457. https://doi.org/10.1037/h0027806
WelchB. (1938). The significance of the difference between two means when the population variances are unequal. Biometrika, 29, 350–362. https://doi.org/10.2307/2332010
41.
WhiteH. (1982). Maximum likelihood estimation of misspecified models. Econometrica, 50, 1–25. https://doi.org/10.2307/1912526
YuanK.-H.BentlerP. (2010). Two simple approximations to the distributions of quadratic forms. British Journal of Mathematical and Statistical Psychology, 63, 273–291. https://doi.org/10.1348/000711009X449771.
44.
ZhangB.StoneC. (2008). Evaluating item fit for multidimensional item response models. Educational and Psychological Measurement, 68, 181–196. https://doi.org/10.1177/0013164407301547
45.
ZhangB.WangC.TaoJ. (2018). Assessing item-level fit for higher order item response theory models. Applied Psychological Measurement, 42, 644–659. https://doi.org/10.1177/0146621618762740
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.