Abstract
A new approach to identify item clusters fitting the Rasch model is described and evaluated using simulated and real data. The proposed method is based on hierarchical cluster analysis and constructs clusters of items that show a good fit to the Rasch model. It thus gives an estimate of the number of independent scales satisfying the postulates of sufficiency of total number of correctly answered items for a person’s proficiency, unidimensionality, and local independence that can be constructed from an item set. The method is also compared with the application of a principal components analysis based on tetrachoric correlations. In general, the proposed method was shown to provide practically usable results especially for large person samples.
Introduction
In this article, a statistical method is described that allows the identification of item scales showing a good fit to the unidimensional Rasch model (Rasch, 1960) in a multidimensional item set. It thus gives an estimate of the number of independent scales, satisfying the postulates of sufficiency of total number of correctly answered items for a person’s proficiency, unidimensionality, and local independence that can be constructed from an item set (Fischer, 1974).
The proposed method is intended to be used prior to the application of other model tests that have power against specific model violations. Appropriate test statistics have been described elsewhere (e.g., Andersen, 1973; Glas, 1988; Martin-Löf, 1973; Ponocny, 2002; van den Wollenberg, 1982; Wright & Panchapakesan, 1969). A discussion of additional test procedures in the context of Rasch measurement was provided by Linacre (1992), E. V. Smith (2002), and other writers.
Because the application of the Rasch model requires the unidimensionality of the data, several statistical tests have been proposed or used in the research literature to test this assumption prior to Rasch analysis. These procedures include principal components analysis (PCA; R. M. Smith, 1996) and other approaches based on factor analysis (e.g., Wirth & Edwards, 2007). Lord (1980), among others, already discussed some of the problems associated with this approach when analyzing binary data. Also, in the application of exploratory factor analysis and PCA, determining the correct number of factors or components plays a crucial role. Recent writers (e.g., Tran & Formann, 2009; Weng & Cheng, 2005) evaluated the performance of parallel analysis, an approach possibly initially suggested by Horn (1965), in solving this problem. Although Weng and Cheng (2005) found that parallel analysis performed well in determining the correct number of factors, Tran and Formann (2009) concluded that the usefulness of classical linear factor analysis and PCA is diminished in the presence of binary data.
In contrast to PCA and methods based on factor analysis, the procedure described in the article at hand does not assume that a specific model underlies the analyzed item set. Instead, it is based on the idea of a partial hierarchical cluster analysis, which uses a test statistic for the Rasch model as a measure of similarity.
There have already been numerous approaches for applying cluster analysis in item response theory (for an overview, see Reckase, 2009, or van Abswoude, van der Ark, & Sijtsma, 2004). Many of them try to assign each item to one of several clusters, and often they provide no clear criterion to determine the number of clusters underlying an item set (for an illustration, see again Reckase, 2009). The application of cluster analysis described in this article addresses this issue by providing a statistic of model fit which is used as a strict criterion for selecting items to yield a unidimensional cluster.
The described method could also serve as an alternative to other approaches for the assessment of unidimensionality in the context of Rasch measurement. If an item set fits the Rasch model, it is to be expected that the suggested procedure assigns all items to a single cluster. For comprehensive reviews of methods for assessing the dimensionality of an item set, see Hattie (1985).
The procedure described in this article will be evaluated using simulation studies. As will be shown, it provides results applicable under many circumstances but generally requires large person samples. To compare the new approach with other methods, its results will be compared with those of a PCA and parallel analysis based on tetrachoric correlations.
The remainder of this article is organized as follows: In the next section, the procedure will be described, which includes a discussion of the R1c statistic of Glas (1988). The subsequent section contains a description of the method of a simulation study that was used to evaluate the procedure. This is followed by the “Results of the Simulation Study” section. The penultimate section contains results from an empirical study in which the results of the procedure are compared with other tests of fit to the Rasch model. The article concludes with a discussion of the procedure and directions for future research.
Description of the Procedure
The Basic Structure of the Procedure
The method used to assign items to scales that fit the Rasch model will first be formally described. Its principal idea is similar to that of a partial hierarchical cluster analysis. Given an item set O, let On be the set of all sets of items of O, consisting of n items. By calculating a statistic of fit to the Rasch model, a function f is defined that assigns the p value of a statistic of model fit to a subset of O. The procedure starts with the analysis of all item subsets that are elements of O3. The initial subset A3 in the scale construction is the subset for which f reaches its maximum.
After defining A3 the procedure begins to expand this subset. Let An be a subset of n items already constructed by the procedure. The procedure constructs a new item subset An+1, containing (n + 1) items, by analyzing all elements of On + 1 that contain all elements of An. An+1 is defined as the item subset for which f is maximized. This procedure might terminate as soon as the maximum of f, calculated for each element of Ok with a fixed k value, is below a predefined upper threshold, or as soon as Ok cannot be expanded any further. In the study at hand, the R1c statistic of Glas (1988) was used as the statistic of model fit. This test statistic will be reviewed in the next section.
Assessing the Model Fit
The R1c statistic of Glas (1988) is based on the comparison of the expected and observed frequencies of persons giving a positive or negative response to item i and obtaining a score of exactly r. Following Glas (1988), the R1c statistic is calculated by the formula in Equation (1):
In Equation (1), Nr denotes the number of persons obtaining a raw score of r,
In this formula, γ
r
denotes the rth elementary symmetric function, and
Following a proof presented by Glas (1988), the R1c statistic can be regarded as being asymptotically χ2 distributed with (k − 1)(k − 2) degrees of freedom, with k being the number of items in the test. The R1c statistic was shown to have power against multiple violations of assumptions of the Rasch model, such as the axioms of unidimensionality and parallel item characteristic curves (Glas & Verhelst, 1995;Suárez-Falcón & Glas, 2003). For this reason, the R1c statistic was chosen as test of global model fit for the study at hand.
Method of the Simulation Study
We now conduct a simulation study to analyze the extent to which the algorithm described in the previous section is correctly able to detect and reconstruct subsets of items that fit the Rasch model.
To evaluate the algorithm, item samples containing two subsets, both of which fit the Rasch model, were simulated. In each simulation, it was assessed whether the scale-constructing algorithm was able to distinguish between the items of the two scales. One reason for choosing this study design was that, to our knowledge, no study has been published so far that investigated the power of the R1c statistic to detect between-item multidimensionality (Adams, Wilson, & Wang, 1997).
In all these simulations, response data were constructed by a computer program using the following algorithm: first, the item and person parameters were defined with previously set means and standard deviations. The distribution of the person parameter was set to be normal, while the item parameters were normally or equally distributed, depending on the simulation. After the parameters of every simulated item and every simulated person were set, the probability of a positive reaction and a random number between 0 and 1 were calculated for every person–item pair. If the random number was found to be smaller than the calculated probability of a positive reaction, the reaction was set to be positive for the person–item pair; otherwise, it was set to be negative.
In every simulation, a set of items was analyzed. Of these items, the first and the second half were independent item sets that fit the Rasch model. A computer program was written that implemented the algorithm described in the previous sections. In every simulation, it was assessed whether one of the two item sets fitting the Rasch model was reconstructed correctly by the algorithm.
The simulations differed in six aspects: the distribution of the item parameters (i.e., approximately normal or equal distribution), the standard deviations of the item and the person parameters, the size of the person sample, the size of the item set, and the correlation between the person parameters of the scales fitting the Rasch model.
Four different types of data sets were defined that varied in the standard deviations of their item and person parameters. In the first type of data set (defined as Type A), the person and the item parameters were set to be normally distributed with standard deviations of 1.0 and 0.5 for the person and item parameters, respectively. In the second type of data set (defined as Type B), the person and item parameters were set to have standard deviations of 1.5 and 0.5, respectively. In the third type of data set (defined as Type C), the standard deviations of the person and item parameters were set to be 2.0 and 1.5, respectively. In the fourth type of data set (defined as Type D), the person and item parameters were set to have standard deviations of 2.5 and 1.5, respectively.
Each of these data sets was combined, such that 10 different combinations of parameter distributions were analyzed. The size of the item sample varied between 10, 30, and 50, and the size of the person sample varied between 250, 500, and 1,000. In the first half of the simulated data sets, the correlations between the person parameters were set to 0.0, and in the second half, they were set to .5.
To perform the simulation study, a computer program was written that implemented the scale construction procedure with the R1c test statistic of Glas (1988) as the test statistic of item fit, as described in the previous sections. As a termination criterion for the scale construction, the scale construction was set to halt as soon as the significance probability would become less than .05 for any further expansion of the scale.
To compare the results of the new method with those of a traditional method, all simulations were repeated with a PCA based on tetrachoric correlations. After each PCA, a varimax rotation was carried out. It was assessed whether the application of PCA resulted in two components, with all items of each item subset fitting the Rasch model showing their highest loading on the same component.
In our study, we calculated the tetrachoric correlations using an algorithm proposed by Brown (1977). In the simulations, parallel analysis (Horn, 1965) was used to determine the number of components to extract. Following previous studies (e.g., Tran & Formann, 2009; Weng & Cheng, 2005), the 95th percentile eigenvalues calculated from 10,000 random data matrices were chosen as the criteria for comparison. To obtain stable results, 10,000 simulations were carried out under each condition.
Results of the Simulation Study
In this section, the results of the simulation study are presented. For each simulated condition, the percentage of correct results over 10,000 replications is presented. Because of space constraints, only the results of simulations with normally distributed item parameters are reported. Under all simulated conditions, the form of the distribution of the item parameters had only negligible effects on the results of the simulations.
To assess the accurateness of the results, the maximum standard error of all simulations was calculated. For all simulations, the standard error of the obtained results reached a maximum of 0.5.
The time needed to analyze each data set differed depending on the size of the analyzed person sample and item set. To illustrate a typical example, it should be reported that the analysis of the responses of 1,000 persons to 50 items took 7 seconds on the average if two data sets of type B were combined with each other using an Intel Core i7 processor.
Results of the Application of the Cluster Analytical Algorithm
The percentages of correct scale reconstructions under each simulated condition are given in Table 1 for simulations where the item parameters were approximately normally distributed.
Percentage of Correct Scale Reconstructions With No Errors Under Different Conditions a
Values are percentage of correct scale reconstructions with no errors under different conditions of person sample size (n), item set size (i), and correlation between the person parameters (r) for each combination of data sets A, B, C, and D when the item parameters were normally distributed.
Results of the Application of Principal Components Analysis
In the application of the PCA, a solution was considered as correct if (a) a two-component solution was obtained and (b) for each subset fitting the Rasch model, all items showed their highest loading on the same component. The results of the application of PCA and parallel analysis for simulations where the item parameters were approximately normally distributed are shown in Table 2.
Percentage of Correct Results After Application of Principal Components Analysis Under Different Conditions a
Values of percentage of correct results after application of principal components analysis under different conditions of person sample size (n), item set size (i), and correlation between the person parameters (r) for each combination of data sets A, B, C, and D when the item parameters were normally distributed.
A Practical Application
In this section, a practical application of the method described in this article is presented. To further evaluate the method, it was applied to dichotomous data obtained by testing 298 persons with a general battery of intelligence tests, the Basic Intelligence Functions (Intelligenz-Basis-Funktionen or IBF; Blum et al., 2005). The purpose of this analysis was to test whether the scales constructed by the algorithm would also pass traditional tests of fit to the Rasch model.
Description of the Sample
The analyzed sample contained the data of 298 persons (151 male, 147 female) all of whom participated in the IBF subtesting. The mean age of the sample was 23.9 years, with a standard deviation of 3.27. In all, 31 persons (10.4%) had an EU educational level of 2, 49 persons (16.4%) had an EU educational level of 3, 188 persons (63.1%) had an EU educational level of 4, and 30 persons (10.1%) had an EU educational level of 5.
From the original sample of 298 persons, persons were excluded if they (a) cancelled the test, (b) showed very short response times combined with poor test performance, or (c) did not answer at least 75% of the items in a subtest. After the exclusion of persons showing deviant response behavior, the data of between 281 and 284 persons were used for the analysis of the four subtests of the IBF.
Description of the Test
The IBF test battery is a computerized intelligence test that consists of six subtests assessing verbal and numerical intelligence functions, long-term memory, and visualization. Verbal and numerical intelligence functions are each assessed by two subtests, and long-term memory and visualization are each assessed by one subtest. For each subtest there is a time limit. Items that are not answered before the end of the time limit are counted as not solved by the test participant.
In the IBF test, the test result contains the number of correct answers for each subtest and the factor scores for the verbal, numerical, visualization, and long-term memory tasks. The two subtests assessing numerical intelligence were not analyzed because fewer than 240 test participants were able to answer 75% of the items within the test’s time limit; we regard the resulting samples as too small for an analysis with the method described in this article. Items with missing responses were excluded from the test analysis. After the exclusion, 13 of the 17 items of the visualization subtest, 15 of the 20 items of the long-term memory subtest, 12 of the 16 items of the first verbal intelligence functions, and 15 of the 19 items of the second verbal intelligence functions subtest were analyzed.
Procedure
A computer program called Raschcon, 1 which implemented the scale-constructing algorithm analyzed in the simulation study, was used to analyze the IBF data set. The item set of each subtest was analyzed separately. The scale construction stopped as soon as the p values corresponding to the R1c test statistics reported by Raschcon became less than .05. After four item sets had been constructed by Raschcon, their fit to the Rasch model was assessed by calculating Andersen likelihood ratios (Andersen, 1973), using the partitioning criteria of age, gender, and mean split on the basis of the test performance. The Andersen tests were computed using the eRm software package (Mair & Hatzinger, 2006). A level of significance of .01 was chosen, and an alpha adjustment was performed using the method of Holm–Bonferroni. Additionally, several item fit statistics for the Rasch model were calculated using Winsteps (Linacre, 2007), and a PCA of the residuals was calculated for each subtest. As in the simulation study, a PCA and parallel analysis of tetrachoric correlations were also performed for the data of each subtest.
Results
In the case of all subtests, the scale constructed by Raschcon was identical to the analyzed item set of the respective subtest. Each of the item sets constructed by Raschcon were subsequently analyzed for their fit to the Rasch model.
The Andersen tests indicate that the scales constructed by Raschcon fit the Rasch model with a level of significance of .01 in each subtest. It should be noted, however, that if a level of significance of .05 had been used, the test statistics used would have detected violations in three of the four subtests. The mean square infit and outfit statistics ranged between 1.33 and 0.65 for all four scales. After performing a PCA of the residuals, the eigenvalues of the first components reached values of 1.4 or less for the two verbal intelligence subtests and the visualization subtest, indicating that no additional dimensions were present in the data. In the case of the long-term memory subtest, the eigenvalue of the first component was 2.0, which could indicate that a second dimension is present in this subtest. In line with these results, a PCA of tetrachoric correlations led to a two-component solution for this subtest. For the visualization subtest, the PCA of tetrachoric correlations led to a one-component solution, whereas indefinite correlation matrices were obtained in the remaining two subtests. In general, the results of the analysis with Winsteps and eRm indicate that the scales constructed by Raschcon show a good fit to the Rasch model.
Discussion
In this article, a new algorithm was presented for the construction of scales that show a good fit to the Rasch model. The algorithm is based on a partial hierarchical cluster analysis and makes no specific assumptions regarding the model underlying the analyzed item set.
The R1c test statistic of Glas (1988) was chosen as a test statistic to evaluate the practical use of the algorithm in a simulation study. The same algorithm was later implemented in the Raschcon computer program to apply it to real data. To assess its usefulness for practical research, a simulation study was carried out that compared the cluster analytic approach with another well-known method of exploratory data analysis, a PCA of tetrachoric correlations with varimax rotation.
We can try to identify the conditions under which the algorithm leads to correct results, and the reasons for the observed incorrect results. In the simulation study, the algorithm performed best if the simulated person sample was large and if the correlation between the person parameters was low. This trend is demonstrated by the high rate of correct scale reconstruction in simulations with a person sample of 1,000 and in simulations with a correlation of 0.0 between the person parameters. By comparison, the distribution of the item parameter seems to have only small effects on the correctness of the results of the algorithm. These results are in line with the results of previous studies on the R1c test statistic (e.g., Suárez-Falcón & Glas, 2003).
It can also be observed that the algorithm performed better in data sets with smaller scales than in data sets that required the reconstruction of large scales. To some extent, this result can be explained by the higher probability in large item samples that at least one item has a response vector that is improbable under the assumption of the Rasch model, so the algorithm does not add it to the scale it constructs. The scale-constructing algorithm generally combines items that show highly probable response patterns under the assumption of a Rasch model. Therefore, items that happen to show an improbable response pattern are not included in the scale constructed by the algorithm, even if they are an a priori part of a scale that fits the Rasch model. The probability of such errors increases in larger item pools.
The simulation study revealed two different conditions under which the algorithms failed to correctly reconstruct one of the two scales that fitted the Rasch model. First, the algorithm leads to incorrect results more often if the person sample is small. Suárez-Falcón and Glas (2003) have obtained similar results by showing that the power of the R1c test statistic to detect multiple model violations decreases in data sets with small samples. The second condition that leads to a failure of the algorithm is the combination of small variances of the item and person parameters and a significant correlation between the person parameters of the two scales.
The comparison of the cluster analytical algorithm with PCA and parallel analysis of tetrachoric correlations showed that the cluster analytical algorithm was in some cases to be preferred over this classical approach. This is most notably the case with the analysis of large item sets and small person samples, since the application of PCA was often not possible in these cases because of the occurrence of indefinite correlation matrices. Similar results have been previously reported by Weng and Cheng (2005) and Tran and Formann (2009). As a comparison of the results of both approaches suggests, the application of PCA may be preferred if there is a significant correlation between the person parameters. The generalizability of these results is limited by the generalizability of the conditions simulated in our simulation study. Since the application of PCA on tetrachoric correlations and the cluster analytical algorithm is based on different theoretical assumptions, there might be further conditions, such as the presence of guessing, which influence the relative efficiency of both approaches as well.
Notably, the result of the algorithm should not conclude the analysis process. The scales constructed by the algorithm meet several of the conditions that are necessary for fitting the Rasch model, but the application of additional statistical test procedures is recommended to determine the fit of the scales constructed by the Raschcon algorithm to the Rasch model. Several authors (e.g., Linacre, 1992; van der Linden & Hambleton, 1997) have already emphasized the importance of checking model assumptions with multiple model tests.
Future research could investigate the performance of the cluster analytical algorithm if alternative test statistics for the fit of the Rasch model are used. Based on the research by Suárez-Falcón and Glas (2003), it can be assumed that most of the test statistics analyzed in their study (e.g., the likelihood ratio test of Andersen, 1973) would lead to comparable results. Future research could also investigate the application of test statistics of other IRT models, such as the three-parameter logistic model (Birnbaum, 1968).
The application of Raschcon to the IBF data showed that the scales constructed by Raschcon may fit the Rasch model, even if the sample used for analysis is relatively small. The results of the simulation study still suggest that the use of larger samples would lead to more reliable results.
Footnotes
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
The author(s) received no financial support for the research, authorship, and/or publication of this article.
