Abstract
Rasch mixture models can be a useful tool when checking the assumption of measurement invariance for a single Rasch model. They provide advantages compared to manifest differential item functioning (DIF) tests when the DIF groups are only weakly correlated with the manifest covariates available. Unlike in single Rasch models, estimation of Rasch mixture models is sensitive to the specification of the ability distribution even when the conditional maximum likelihood approach is used. It is demonstrated in a simulation study how differences in ability can influence the latent classes of a Rasch mixture model. If the aim is only DIF detection, it is not of interest to uncover such ability differences as one is only interested in a latent group structure regarding the item difficulties. To avoid any confounding effect of ability differences (or impact), a new score distribution for the Rasch mixture model is introduced here. It ensures the estimation of the Rasch mixture model to be independent of the ability distribution and thus restricts the mixture to be sensitive to latent structure in the item difficulties only. Its usefulness is demonstrated in a simulation study, and its application is illustrated in a study of verbal aggression.
Introduction
Based on the Rasch model (Rasch, 1960), Rost (1990) introduced what he called the “mixed Rasch model,” a combination of a latent class approach and a latent trait approach to model qualitative and quantitative ability differences. As suggested by Rost (1990), it can also be used to examine the fit of the Rasch model and check for violations of measurement invariance such as differential item functioning (DIF). Since the model assumes latent classes for which separate Rasch models hold, it can be employed to validate a psychological test or questionnaire: if a model with two or more latent classes fits better than a model with one latent class, measurement invariance is violated and a single Rasch model is not suitable because several latent classes are present in the data that require separate Rasch models with separate sets of item difficulties. These classes are latent in the sense that they are not determined by covariates.
As the model assesses a questionnaire—or instrument as it will be referred to in the following—as a whole, it works similar to a global test like the likelihood ratio (LR) test (Andersen, 1972; Gustafsson, 1980), not an itemwise test like the Mantel–Haenszel test (Holland & Thayer, 1988). Hence, it is the set of item parameters for all items, which is tested for differences between groups rather than each item parameter being tested separately.
The mixed Rasch model—here called Rasch mixture model to avoid confusion with mixed (effects) models and instead highlight its relation to mixture models—has since been extended by Rost and von Davier (1995) to different score distributions and by Rost (1991) and von Davier and Rost (1995) to polytomous responses. The so-called “mixed ordinal Rasch model” is a mixture of partial credit models (PCM; Masters, 1982) and includes a mixture of rating scale models (RSM; Andrich, 1978) as a special case.
The original dichotomous model as well as its polytomous version have been applied in a variety of fields. Zickar, Gibby, and Robie (2004) use a mixture PCM to detect faking in personality questionnaires, while Hong and Min (2007) identify three types/classes of depressed behavior by applying a mixture RSM to a self-rating depression scale. Another vast field of application is tests in educational measurement. Baghaei and Carstensen (2013) identify different reader types from a reading comprehension test using a Rasch mixture model. Maij-de Meij, Kelderman, and van der Flier (2010) also apply a Rasch mixture model to identify latent groups in a vocabulary test. Cohen and Bolt (2005) use a Rasch mixture model to detect DIF in a mathematics placement test.
Rasch mixture models constitute a legitimate alternative to DIF tests for manifest variables such as the LR test or the recently proposed Rasch trees (Strobl, Kopf, & Zeileis, 2013). These methods are usually used to test DIF based on observed covariates, whereas Maij-de Meij et al. (2010) show that mixture models are more suitable to detect DIF if the “true source of bias” is a latent grouping variable. The simulation study by Preinerstorfer and Formann (2011) suggests that parameter recovery works reasonably well for Rasch mixture models. While they did not study in detail the influence of DIF effect size or the effect of different ability distributions, they deem such differences relevant for practical concern but leave it to further research to establish just how strongly they influence estimation accuracy.
As the Rasch model is based on two aspects, subject ability and item difficulty, Rasch mixture models are sensitive not only to differences in the item difficulties—as in DIF—but also to differences in abilities. Such differences in abilities are usually called impact and do not infringe on measurement invariance (Ackerman, 1992). In practice, when developing a psychological test, one often follows two main steps. First, the item parameters are estimated, for example, by means of the conditional maximum likelihood (CML) approach, checked for model violations and problematic items are possibly excluded or modified. Second, the final set of items is used to estimate person abilities. The main advantage of the CML approach is that, for a single Rasch model, the estimation and check of item difficulties are (conditionally) independent of the abilities and their distribution. Other global assessment methods like the LR test and the Rasch trees are also based on the CML approach to achieve such independence. However, in a Rasch mixture model, the estimation of the item difficulties is not independent of the ability distribution, even when employing the CML approach. DeMars and Lau (2011) find that a difference in mean ability between DIF groups affects the estimation of the DIF effect sizes. Similarly, other DIF detection methods are also affected by impact, for example, inflated Type I error rates occur in the Mantel–Haenszel and logistic regression procedures if impact is present (DeMars, 2010; Li, Brooks, & Johanson, 2012).
When using a Rasch mixture model for DIF detection, an influence of impact alone on the mixture is undesirable as the goal is to uncover DIF groups based on item difficulties, not impact groups based on abilities. To avoid such confounding effects of impact, we propose a new version of the Rasch mixture model specifically designed to detect DIF, which allows for the transfer of the crucial property of CML from a single Rasch model to the mixture: estimation and testing of item difficulties is independent of the abilities and their distribution.
A simulation study is conducted to illustrate how previously suggested versions and this new version of the Rasch mixture model react to impact, either alone or in combination with DIF, and how this affects the suitability of the Rasch mixture model as a DIF detection method.
In the following, we briefly discuss the Rasch model and Rasch mixture models to explain why the latter are sensitive to the specification of the score distribution despite employing a conditional maximum likelihood approach for estimation. This section is concluded with our suggested new score distribution. We illustrate and discuss the behavior of Rasch mixture models with different options for the score distribution in a Monte Carlo study in the next section. Then, the suggested approach for DIF detection via Rasch mixture models is illustrated through an empirical application to a study on verbally aggressive behavior. Concluding remarks are provided in the last section.
Theory
The Rasch Model
The Rasch model, introduced by Georg Rasch (1960), models the probability for a binary response
depending on the subject’s ability
Since joint maximum likelihood (JML) estimation of all abilities and difficulties is not consistent for a fixed number of items
Due to this separation, consistent estimates of the item parameters
with
If not only the conditional likelihood but the full likelihood is of interest—as in Rasch mixture models—then the score distribution
Based on this density, the following subsections first introduce mixture Rasch models in general and then discuss several choices for
Rasch Mixture Models
Mixture models are essentially a weighted sum over several components, that is, here over several Rasch models. Using the Rasch model density function from Equation 3, the likelihood
where the
This kind of likelihood can be maximized via the expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977), which alternates between maximizing the component-specific likelihoods for obtaining parameter estimates and computing expectations for each observations belonging to each cluster.
More formally, given (initial) estimates for the model parameters
In the M-step of the algorithm, these posterior probabilities are used as the weights in a weighted ML estimation of the model parameters. This way, an observation deemed unlikely to belong to a certain latent class does not contribute strongly to its estimation. Estimation can be done separately for each latent class. Using CML estimation for the Rasch Model, the estimation of item and score parameters can again be done separately. For all components
Estimates of the class probabilities can be obtained from the posterior probabilities by averaging:
The E-step (Equation 5) and M-step (Equations 6 and 7) are iterated until convergence, always updating either the weights based on current estimates for the model parameters or vice versa.
Note that the above implicitly assumes that the number of latent classes
Score Distribution
In a single Rasch model, the estimation of the item parameters is invariant to the score distribution because of the separation in Equation 3. In the mixture context, this invariance property holds only given the weights in Equation 6. However, these posterior weights depend on the full Rasch likelihood, including the score distribution (Equation 5). Therefore, the estimation of the item parameters in a Rasch mixture model is not independent of the score distribution for
Saturated and Mean-Variance Specification
In his introduction of the Rasch mixture model, Rost (1990) suggests a discrete probability distribution on the scores with a separate parameter for each possible score. This requires
Realizing that this saturated specification requires a potentially rather large number of parameters, Rost and von Davier (1995) suggest a parametric distribution with one parameter each for mean and variance.
Details on both specifications can be found in Rost (1990) and Rost and von Davier (1995), respectively. Here, the notation of Frick, Strobl, Leisch, and Zeileis (2012) is adopted, which expresses both specifications in a unified way through a conditional logit model for the score
with different choices for
and the 1 at position
Restricted Specification
In the following we suggest a new specification of the score distribution in the Rasch mixture model, which aims at obtaining independence of the item parameter estimates from the specification of the score distribution and therefore enabling the Rasch mixture model to distinguish between DIF and impact. Other global DIF detection methods like the LR test and Rasch trees are able to make this distinction (Ankenmann, Witt, & Dunbar, 1999; Strobl et al., 2013) because they are based only on the conditional part of the likelihood (Equation 2). Analogously, we suggest a mixture of only this conditional part rather than the full likelihood (Equation 3) of the Rasch model so that the mixture model will only be influenced by differences in the item parameters.
Mixing only the conditional likelihood
because then the factor
This equivalence and independence from the score distribution can also be seen easily from the definition of the posterior weights (Equation 5): If restricted,
Subsequently, we adopt the restricted perspective rather than omitting
Overview
The different specifications of the score distribution vary in their properties and implications for the whole Rasch mixture model.
The saturated model is very flexible. It can model any shape and is thus never misspecified. However, it needs a potentially large number of parameters, which can be challenging in model estimation and selection.
The mean-variance specification of the score model is more parsimonious as it only requires two parameters per latent class. While this is convenient for model fit and selection, it also comes at a cost: since it can only model unimodal or U-shaped distributions (see Rost & von Davier, 1995), it is partially misspecified if the score distribution is actually multimodal.
A restricted score model is even more parsimonious. Therefore, the same advantages in model fit and selection apply. Furthermore, it is invariant to the latent structure in the score distribution. If a Rasch mixture model is used for DIF detection, this is favorable as only differences in the item difficulties influence the mixture. However, it is partially misspecified if the latent structure in the scores and item difficulties coincides.
Monte Carlo Study
The simple question DIF or no DIF? leads to the question whether the Rasch mixture model is suitable as a tool to detect such violations of measurement invariance.
As the score distribution influences the estimation of the Rasch mixture model in general, it is of particular interest how it influences the estimation of the number of latent classes, the measure used to determine Rasch scalability.
Motivational Example
As a motivation for the simulation design, consider the following example: The instrument is a knowledge test that is administered to students from two different types of schools and who have been prepared by one of two different courses for the knowledge test. Either of the two groupings might be the source of DIF (or impact). If the groupings are available as covariates to the item responses of the students, then a test for DIF between either school types or course types can be easily carried out using the LR test. However, if the groupings are not available (or even observed) as covariates, then a DIF test is still possible by means of the Rasch mixture model. The performance of such a DIF assessment is investigated in our simulation study for different effects of school and course type, respectively.
In the following we assume that the school type is linked to ability difference (i.e., impact but not DIF) while the course type is the source of DIF (but not impact). This can be motivated in the following way (see also Figure 1): When the students from the two school types differ in their mean ability, this is impact between these two groups. The courses might be a standard course and a new specialized course. While the standard course covers all topics of the test equally, the specialized course gives more emphasis to a relatively advanced topic and due to time constraints less emphasis to a relatively basic topic. This may lead to DIF between the students in the standard and the specialized course. See the left panel of Figure 2 for illustrative item profiles of the standard course (in dark gray) and the specialized course (in light gray).

Grouping structure in the motivational example.

Scenario 2. Left: Item difficulties with DIF (
Finally, the ability groups by school and the DIF groups by course can either coincide or not. If all students in the first school type are being taught the standard course while all students in the second school type are being taught the specialized course, the DIF groups coincide with the ability groups. The DIF and ability groups do not coincide but only overlap partly if both course types are taught in both school types: each DIF group (based on the type of course taught) consists of a mix of students from both schools and therefore from both ability groups. An illustration of coinciding and not coinciding ability and DIF groups is provided in the upper and lower rows of Figure 1, respectively. Ability groups, based on school type, are shown in the columns, while DIF groups, based on course type, are illustrated with dark and light gray for the standard course and specialized course, respectively. This difference of coinciding or not coinciding DIF and ability groups might have an influence on the Rasch mixture model’s ability to detect the DIF because in the former case the score distributions differ between the two DIF groups while in the latter case they do not.
Subsequently, a Monte Carlo study is carried out to investigate how the Rasch mixture model performs in situations where such groupings are present in the underlying data-generating process but are not available as observed covariates. Moreover, we vary whether or not all students come from the same school type (i.e., from the same ability distribution), whether or not all students receive the standard course (i.e., whether there is DIF), and whether both school types use the same or different courses (i.e., whether the groupings coincide or not). For all computations, the R system for statistical computing (R Core Team, 2013) is used along with the add-on packages
Simulation Design
The simulation design combines ideas from the motivational example with aspects from the simulation study conducted by Rost (1990). Similar to the original simulation study, the item parameters represent an instrument with increasingly difficult items. Here, 20 items are employed with corresponding item parameters
To introduce DIF, a second set of item parameters
In the simulations below, the DIF effect size
while the impact
Impact and DIF, or lack thereof, can be combined in several ways. Table 1 provides an overview and Figures 2, 3, and 4 show illustrations. In the following, the different combinations of impact and DIF are explained in more detail and connected to the motivational example:
If the simulation parameter
In the example: Only the standard course is taught and hence no DIF exists.
If
In the example: Both courses are taught, thus leading to DIF. The standard course corresponds to the straight line as the item profile while the specialized course corresponds to the spiked item profile with relatively difficult Item 16 being easier and the relatively easy Item 5 being more difficult for students in this specialized course than for students in the standard course.
If the simulation parameter
In the example: All students are from the same school and hence there is no impact. However, both types of courses may be taught in this one school, thus leading to DIF as in Scenario 2.
If
In the example: Only the standard course is taught in both school types. Hence no DIF is present but impact between the school types.
If there is DIF (i.e.,
These groups can coincide: For subjects with low mean ability
Additionally, the DIF groups and ability groups can also not coincide: Subjects in either DIF group may stem from both ability groups, not just one. This is simulated in Scenario 4 and labeled Impact and DIF, not coinciding. The resulting score distribution is illustrated in the left panel of Figure 4. Again, subjects for whom item difficulties
In the example: Students from both school types and from both course types are considered, thus leading to both impact and DIF. Either both courses are taught at both schools (Scenario 4, not coinciding) or the standard course is only taught in the first school and the specialized course is only taught at the second school (Scenario 5, coinciding).
Simulation Design. The Latent-Class-Specific Item Parameters

Scenario 3. Left: Item difficulties without DIF (

Stacked histograms of score distributions for Scenarios 4 (left) and 5 (right) with DIF (
Note that Scenario 1 is a special case of Scenario 2 where
For each considered combination of
False Alarm Rate and Hit Rate
The main objective here is to determine how suitable a Rasch mixture model, with various choices for the score model, is to recognize DIF or the lack thereof.
For each data set and type of score model, models with
In the following subsections, the key results of the simulation study will be visualized. The exact rates for all conditions are included as a data set in the R package
Scenario 2: No Impact With DIF
This scenario is investigated as a case of DIF that should be fairly simple to detect. There is no impact as abilities are homogeneous across all subjects so the only latent structure to detect is the group membership based on the two item profiles. This latent structure is made increasingly easy to detect by increasing the difference between the item difficulties for both latent groups. In the graphical representation of the item parameters (left panel of Figure 2) this corresponds to enlarging the spikes in the item profile.
Figure 5 shows how the rate of choosing a model with more than one latent class (

Rate of choosing a model with
The number of iterations in the EM algorithm that are necessary for the estimation to converge is much lower for the mean-variance and the restricted model than for the saturated model. Since the estimation of the saturated model is more extensive due to the higher number of parameters required by this model, it does not converge in about 10% of the cases before reaching the maximum number of iterations which was set to 400. The mean-variance and saturated model usually converge within the first 200 iterations.
Brief summary
The mean-variance and restricted model have higher hit rates than the saturated model in the absence of impact.
Scenario 3: Impact Without DIF
Preferably, a Rasch mixture model should not only detect latent classes if the assumption of measurement invariance is violated but it should also indicate a lack of latent structure if indeed the assumption holds. In this scenario, the subjects all stem from the same class, meaning each item is of the same difficulty for every subject. However, subject abilities are simulated with impact resulting in a bimodal score distribution as illustrated in Figure 3.
Here, the rate of choosing more than one latent class can be interpreted as a false alarm rate (Figure 6). The restricted score model is invariant against any latent structure in the score distribution and thus almost always (≤0.2%) suggests

Rate of choosing a model with
Brief summary
If measurement invariance holds but ability differences are present, the mean-variance model exhibits a high false alarm rate while the saturated and restricted model are not affected.
Scenario 4: Impact and DIF, Not Coinciding
In this scenario, there is DIF (and thus two true latent classes) if
Figure 7 again shows the rate of choosing

Rate of choosing a model with
As Rasch mixture models with

Rates of choosing the correct number of classes (
Brief summary
If impact is simulated within DIF groups, the mean-variance model has higher hit rates than the saturated and restricted models. However, the latent classes estimated by the mean-variance model are mostly based on ability differences if the DIF effect size is low. If the DIF effect size is high, the mean-variance model tends to overestimate the number of classes.
Scenario 5: Impact and DIF, Coinciding
In Scenario 5, there is also DIF (i.e.,
Again, small ability differences do not strongly influence the rate of choosing more than one latent class (rates for low levels of impact, such as
As impact increases (Figure 9), the hit rates of all models increases as well because the ability differences contain information about the DIF groups: separating subjects with low and high abilities also separates the two DIF groups (not separating subjects within each DIF group as in the previous setting). However, for the mean-variance model these increased hit rates are again coupled with a highly increased false alarm rate at

Rate of choosing a model with
Finally, the potential issue of overselection can be considered again. Figure 8 (solid symbols) shows that this problem disappears for the mean-variance specification if both DIF effect size
Brief summary
If abilities differ between DIF groups, the mean-variance model detects the violation of measurement invariance for smaller DIF effect sizes than the saturated and restricted model. While the mean-variance model does not overselect the number of components in this scenario, the high hit rates are connected to a high false alarm rate when no DIF is present but impact is high. This does not affect the other two score models.
Quality of Estimation
Although here the Rasch mixture model is primarily used analogously to a global DIF test, model assessment goes beyond the question whether or not the correct number of latent classes is found. Once the number of latent classes is established/estimated, it is of interest how well the estimated model fits the data. Which groups are found? How well are the parameters estimated? In the context of Rasch mixture models with different score distributions, both of these aspects depend heavily on the posterior probabilities
This is a standard task in the field of cluster analysis and we adopt the widely used Rand index (Rand, 1971) here: Each observation is assigned to the latent class for which its posterior probability is highest yielding an estimated classification of the data which is compared to the true classification. For this comparison, pairs of observations are considered. Each pair can either be in the same class in both the true and the estimated classification, in different classes for both classifications, or it can be in the same class for one but not the other classification. The Rand index is the proportion of pairs for which both classifications agree. Thus, it can assume values between 0 and 1, indicating total dissimilarity and similarity, respectively.
In the following, the Rand index for models with the true number of

Average Rand index for models with
However, in Scenario 5 where the score distribution contains information about the DIF groups, the three score specifications perform very differently as the bottom row of Figure 10 shows. Given the correct number of classes, the mean-variance model is most suitable to uncover the true latent classes, yielding Rand indices close to 1 if both DIF effect size and impact are large. The saturated specification follows a similar pattern albeit with poorer results, reaching values of up to 0.87. However, the classifications obtained from the restricted score specification do not match the true groups well in this scenario, remaining below 0.52 if impact is high. The reason is that the restricted score model is partially misspecified as the score distributions differ substantially across DIF groups.
Summary and Implications for Practical Use
Given various combinations of DIF and ability impact, the score models are differently suitable for the two tasks discussed here—DIF detection and estimation of item parameters in subgroups. Starting with a summary of the results for DIF detection:
The saturated score model has much lower hit rates than the other two specifications, that is, violation of measurement invariance remains too often undetected. Only if high impact and high DIF effect sizes coincide does the saturated model perform similarly well as the restricted model.
The mean-variance model has much higher hit rates. However, if impact is present in the abilities, this specification has highly inflated false alarm rates. Hence, if the mean-variance model selects more than one latent class it is unclear whether this is due to DIF or just varying subject abilities. Thus, measurement invariance might still hold even if more than one latent class is detected.
The restricted score model also has high hit rates, comparable to the mean-variance model if abilities are rather homogeneous. But unlike the mean-variance specification, its false alarm rate is not distorted by impact. Its performance is not influenced by the ability distribution and detecting more than one latent class reliably indicates DIF, that is, a violation of measurement invariance.
Hence, if the Rasch mixture model is employed for assessing measurement invariance or detecting DIF, then the restricted score specification appears to be most robust. Thus, the selection of the number of latent classes should only be based on this specification.
DeMars (2010) illustrates how significance tests based on the observed (raw) scores in reference and focal groups suffer from inflated Type I error rates with an increased sample size if impact is present. This does not apply to the false alarm rate of Rasch mixture models because not a significance test but rather model selection via BIC is carried out. The rate of the BIC selecting the correct model increases with larger sample size if the true model is a Rasch mixture model. Since consistent estimates are employed, a larger sample size also speeds up convergence, which is particularly desirable for the saturated model if the number of latent classes and thus the number of parameters is high.
Given the correct number of classes, the different score models are all similarly suitable to detect the true classification if ability impact does not contain any additional information about the DIF groups. However, if ability impact is highly correlated with DIF groups in the data and the ability groups thus coincide with the DIF groups, this information can be exploited by the unrestricted specifications while it distracts the restricted model.
Thus, while the selection of the number of latent classes should be based only on the restricted score specification, the unrestricted mean-variance and saturated specifications might still prove useful for estimating the Rasch mixture model (after
We therefore recommend a two-step approach for DIF detection via a Rasch mixture model. First, the number of latent classes is determined via the restricted score model. Second, if furthermore the estimation of the item difficulties is of interest, the full selection of score models can then be utilized. While the likelihood ratio test is not suitable to test for the number of latent classes, it can be used to establish the best fitting score model, given the number of latent classes. If this approach is applied to the full range of score models (saturated and mean-variance, both unrestricted and restricted), the nesting structure of the models needs to be kept in mind.
Empirical Application: Verbal Aggression
We use a data set on verbal aggression (De Boeck & Wilson, 2004) to illustrate this two-step approach of first assessing measurement invariance via a Rasch mixture model with a restricted score distribution and then employing all possible score models to find the best fitting estimation of the item difficulties.
Participants in this study are presented with one of two potentially frustrating situations (S1 and S2)
S1: A bus fails to stop for me
S2: I miss a train because a clerk gave me faulty information
and a verbally aggressive response (cursing, scolding, shouting). Combining each situation and response with either “I want to” or “I do” leads to the following 12 items:
First, we assess measurement invariance with regard to the whole instrument: we fit a Rasch mixture model with a restricted score distribution for
DIF Detection by Selecting the Number of Latent Classes
The BIC for a Rasch mixture model with more than one latent class is smaller than the BIC for a single Rasch model, thus indicating that measurement invariance is violated. The best fitting model has
Selection of the Score Distribution Given the Number of Latent Classes
As
To visualize how the three classes found in the data differ, the corresponding item profiles are shown in Figure 11.

Item profiles for the Rasch mixture model with
The latent class in the right panel (with 108 observations) shows a very regular zig-zag pattern where for any type of verbally aggressive response actually “doing” the response is considered more extreme than just “wanting” to respond a certain way as represented by the higher item parameters for the second item, the “do-item,” than the first item, the “want-item,” of each pair. The three types of response (cursing, scolding, shouting) are considered increasingly aggressive, regardless of the situation (first six items vs. last six items).
The latent class in the left panel (with 112 observations) distinguishes more strongly between the types of response. However, the relationship between wanting and doing is reversed for all responses except shouting. It is more difficult to agree to the item “I want to curse/scold” than to the corresponding item “I do curse/scold.” This could be interpreted as generally more aggressive behavior where one is quick to react a certain way rather than just wanting to react that way. However, shouting is considered a very aggressive response, both in wanting and doing.
The remaining latent class (with 53 observations considerably smaller), depicted in the middle panel, does not distinguish that clearly between response types, situations or wanting versus doing.
Therefore, not just a single item or a small number of items have DIF but the underlying want/do relationship of the items is different across the three classes. This instrument thus works differently as a whole across classes.
In summary, the respondents in this study are not scalable to one single Rasch-scale but instead need several scales to represent them accurately. A Rasch mixture model with a restricted score distribution is used to estimate the number of latent classes. Given that number of classes, any type of score model is conceivable. Here, the various versions are all fairly similar and the restricted mean-variance specification is chosen based on likelihood ratio tests. Keep in mind that the resulting fits can be substantially different from each other as shown in the simulation study, in particular for the case of impact between DIF classes. The latent classes estimated here differ mainly in their perception of the type and the “want/do” relationship of a verbally aggressive response.
Conclusion
Unlike in a single Rasch model, item parameter estimation is not independent of the score distribution in Rasch mixture models. The saturated and mean-variance specifications of the score model are both well established. A further option is the new restricted score specification introduced here. In the context of DIF detection, only the restricted score specification should be used as it prevents confounding effects of impact on DIF detection while exhibiting hit rates positively related to DIF effect size. Given the number of latent classes, it may be useful to fit the other score models as well, as they might improve estimation of group membership and therefore estimation of the item parameters. The best fitting model can be selected via the likelihood ratio test or an information criterion such as the BIC. This approach enhances the suitability of the Rasch mixture model as a tool for DIF detection as additional information contained in the score distribution is only employed if it contributes to the estimation of latent classes based on measurement invariance.
Computational Details
An implementation of all versions of the Rasch mixture model mentioned here is freely available under the General Public License in the R package
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Austrian Ministry of Science BMWF as part of the UniInfrastrukturprogramm of the Focal Point Scientific Computing at Universität Innsbruck.
