Abstract
The assessment of test data for the presence of differential item functioning (DIF) is a key component of instrument development and validation. Among the many methods that have been used successfully in such analyses is the mixture modeling approach. Using this approach to identify the presence of DIF has been touted as potentially superior for gaining insights into the etiology of DIF, as compared to using intact groups. Recently, researchers have expanded on this work to incorporate multilevel mixture modeling, for cases in which examinees are nested within schools. The current study further expands on this multilevel mixture modeling for DIF detection by using a multidimensional multilevel mixture model that incorporates multiple measured dimensions, as well as the presence of multiple subgroups in the population. This model was applied to a national sample of third-grade students who completed math and language tests. Results of the analysis demonstrate that the multidimensional model provides more complete information regarding the nature of DIF than do separate unidimensional models.
Keywords
Differential item functioning (DIF) refers to the case where performance on an item differs for members of different subgroups from the population, when examinees have been matched on the underlying latent trait being measured by the instrument. Typically (Camilli & Shepard, 1994), DIF is described as taking one of two forms: (a) uniform DIF refers to the case where the conditional (on the trait being measured) probability of a correct item response differs between members of two groups, and this difference is the same across all values of the measured trait; and (b) nonuniform DIF, which refers to the case where the conditional probability of members from different groups correctly responding to an item differs, but this difference is not the same across all values of the measured trait. Assessment of DIF based on examinee traits such as gender and ethnicity is a very common component of instrument development and validation (Camilli & Shepard, 1994). When DIF is detected, the developers of the assessment may take measures to address the potential bias engendered by such items, including removing them from the scale or rewriting them. Examinee characteristics associated with DIF may not only be associated with demographic characteristics such as race or gender, which are commonly studied, but also with respect to whether or not an individual has some type of learning or emotional disability that might affect the manner by which they access certain test items (Finch, Barton, & Meyer, 2009). In cases where it is known or suspected that students who have been identified with particular disabilities that might affect their test performance, specific accommodations may be employed in order to bring balance to the testing process and improve their accessibility of test items (Kettler et al., 2005). The purpose behind such accommodations is to remove barriers for examinees who, because of an identified disability, may have difficulty successfully completing certain items in the standard testing format originally envisioned by the instrument developers. Such difficulties can, in turn, threaten the validity of the items as they may not measure the intended construct properly. Testing accommodations might include the allowance of more time to take the test than is afforded to members of the general population, the provision of an alternative test setting that contains fewer distractions than the standard testing room, the reading of testing directions and/or items to examinees, and the use of calculators for solving math problems that are not focused on examinee ability to do calculations by hand (Sireci, Scarpati, & Li, 2005).
Prior Research on DIF, Disability Status, and Testing Accommodations
There has been prior research examining the relationship of testing accommodations, as well as identified disability status on DIF, though no clear patterns have emerged heretofore. For example, in a review of this work, Sireci et al. (2005) found that across several prior studies comparing the performance of accommodated students with disabilities (SWDA) and those with disabilities who did not receive any testing accommodations (SWDN), an inconsistent picture of DIF results emerged. They found that in some cases the receipt of testing accommodations was associated with DIF, while in others it was not. Willingham et al. (1988) found no DIF when comparing items across students with and without disabilities, when testing accommodations were in place, though when accommodations were not present DIF did emerge. Prior research focused specifically on a comparison of SWDA and SWDN examinees provides a similar pattern of mixed results with regard to the presence and magnitude of DIF (Barton & Finch, 2004; Bielinski, Thurlow, Ysseldyke, Freidebach, & Friedebach, 2001; Bolt, 2004; Cahalan-Laitusis, Cook, &Aicher, 2004; Cohen, Gregg, & Deng, 2005; Finch et al., 2009), so that no clear pattern regarding the provision of accommodations or the presence of an identified disability can be discerned. While most prior research did not focus on specific types of disabilities or accommodations, Cohen et al. (2005) did examine DIF for a mathematics exam between SWDA and nondisabled students with respect to an extended time accommodation. Using a mixture Rasch model, they found that differential performance on items did not appear to be related to the accommodation but was associated with differences in skill development in specific areas of mathematics. This result may be seen as supporting the view that the accommodation effectively “evened the playing field” for SWDA examinees.
Multidimensional Assessments Made at Multiple Levels
Quite often, large-scale testing programs, such as those developed by states, involve the administration of exams measuring more than one construct to examinees. For example, most large-scale federal and state testing programs in the United States incorporate assessments of reading and language skills, mathematics, and perhaps other content areas such as science or social studies (e.g., “National Assessment of Educational Progress,” 2008). Furthermore, decisions made based on scores from these exams regarding examinees, teachers, and schools incorporate scores on all assessments together (e.g., total test score, or a weighted combination of test scores), so that performance on all of them is of importance. With respect to examinee disability status, it is certainly true that some individuals may be identified with a reading specific learning disability only, a math specific learning disability only, or disabilities in both areas of learning (Individuals with Disabilities Education Improvement Act of 2004; Department of Education, 2004). Given this potential differentiation in learning disability status, it is not sufficient for psychometricians, measurement practitioners, and educators to simply identify an examinee as having a specific learning disability, but rather the nature of the disability must also be understood. And, given prior evidence regarding the nature of DIF associated with particular disabilities, it is crucial that the multifaceted nature of disability status be incorporated into any investigations of DIF. In addition, given that many testing programs involve the testing of multiple students from the same schools, it is necessary that all DIF analyses, including those involving the mixture model, incorporate the multilevel nature of the data, as prior research (French & Finch, 2012) has demonstrated that ignoring this structure leads to inflated Type I error rates, leading to the incorrect determination that DIF is present. The goal of this study, therefore, is to introduce a multidimensional mixture modeling approach for identifying subgroups in the population that might be associated with the presence of DIF. This model incorporates both the multilevel nature of many testing programs (e.g., students nested in schools), as well as the multidimensional nature of many testing programs (e.g., the assessment of both math and language skills).
Differential Item Functioning and Mixture Item Response Models
There are several statistical methods that can be used for the assessment of uniform DIF, which is the focus of the current study. For the vast majority of these methods, it is assumed that the groups for which DIF testing is of interest are observable, and individual examinees can be identified as belonging to one of them. Common examples of such observable grouping variables include gender, ethnicity, and disability status. In this context, DIF hypotheses focus on the extent to which members of one such group (e.g., males) might find specific items more or less difficult than members of a second group (e.g., females), when they are matched on the trait being measured by the instrument (e.g., reading). A potential disadvantage to using such an approach is that it presumes the source of DIF is largely, if not totally, confounded with observed group membership. In other words, if a researcher tests for DIF based on gender, she/he is assuming that DIF is gender based, thereby precluding other potential sources. Thus, if the mechanism causing DIF is not directly associated with gender, or only partially so, these observed group methods may not be effective tools for detecting it. In response to this conundrum, Cohen and Bolt (2005) described a mixture model approach to detecting uniform DIF, which was different in some fundamental ways from the earlier methods. This mixture Rasch model is expressed as
where yijg is the response to item i by examinee j in group g; θ jg is the latent trait value for examinee j in group g; and big is the difficulty for item i in group g.
In particular, they demonstrated how a mixture Rasch model could identify examinee subgroups with different difficulty parameter values for one or more items, and how this approach has great potential for explaining sources of DIF beyond those associated with readily observed variables such as gender and ethnicity. In such a case, the mixture Rasch model is used to identify latent classes of examinees with similar item response profiles, and then item difficulty parameters estimated by the model can be compared across classes to determine whether DIF is present. Thus, rather than assuming that DIF is associated strictly with membership in an observed category, researchers can first identify groups in the data that differ in terms of item response profiles and then ascertain whether uniform DIF, say, is present for one or more of the items. Furthermore, these latent classes can be characterized in terms of observable characteristics associated with the examinees in them, such as gender or disability status. In this way, Cohen and Bolt, as well as other researchers (e.g., Cohen et al., 2005; Samuelson, 2005) demonstrated that this mixture Rasch approach can provide greater insights regarding antecedents of DIF than do methods that rely strictly on assessing DIF associated with observed groups. In addition, this approach to DIF assessment has the potential for providing more well-rounded analyses that are not based on predetermined sorting of examinees that itself could be biased in some respects.
Multilevel Mixture Rasch Model
In addition to mixture modeling, another fairly recent addition to the DIF literature has been the emergence of methods for dealing with the multilevel data structure that is common in such assessments (French & Finch, 2010). For example, quite often data for DIF detection studies, particularly in the context of large-scale assessments, are collected from examinees nested within schools. In such cases, it must be assumed that, at least to some extent, schools affect examinee item responses. Such an impact will be expressed in the form of nontrivial intraclass correlation (ICC) values. When such multilevel data structure is ignored and the ICCs are not 0 (or very close to it) the resulting analyses will likely yield inaccurate estimates of item parameters and their associated standard errors, which in turn will lead to erroneous DIF detection results. Researchers have begun developing and adapting multilevel methods for DIF detection in the context of observed groupings (e.g., French & Finch, 2012).
In the context of the mixture modeling paradigm described above, Cho and Cohen (2010) introduced a multilevel mixture Rasch model (MMixRM) for DIF assessment when, for example, examinees (Level 1) are nested within schools or classrooms (Level 2). For the context of examinees nested within schools, the MMixRM model, which is a direct extension of the model in Equation (1), takes the form:
where yijt is the response to item i by examinee j in school t, where 1 is correct; g is the Level 1 (e.g., examinee) latent class; k is the Level 2 (e.g., school) latent class; θ jtgk is the latent trait for examinee j in school t from latent classes g and k; and bigk is the difficulty of item i for latent classes g and k.
Based on a simulation study, and analysis of an existing data set, Cho and Cohen (2010) demonstrated that the MMixRM was an effective tool both for identifying the presence of DIF as well as exploring its causes through the inclusion of examinee- and school-level covariates in the model. In addition to revealing its potential for successfully identifying latent classes based on examinee item response patterns, the study by Cho and Cohen also demonstrated how results from the MMixRM could be used in conjunction with other statistical methods for the testing of uniform DIF between examinee classes within school classes, and likewise the testing of school-based DIF within examinee classes.
Multidimensional Multilevel Mixture Modeling for DIF Assessment
The prior work described above using the mixture item response theory (IRT) and MMixRM for investigating DIF focused on the case where a single latent trait was measured. However, as noted previously, many real-world assessment situations involve the assessment of examinees on multiple constructs simultaneously. Such scenarios may involve the administration of subject area tests in which several subskills are assessed simultaneously, as might occur if a mathematics exam assesses multiple aspects of math ability, for example, computation, problem solving, geometry. In another context, examinees might be given tests of two separate constructs at the same time, such as language and math. In such instances, the standard IRT assumption of unidimensionality of the latent trait will likely not be valid, leading to potential difficulties in the estimation of both item and person parameters (Ackerman, 1994; Reckase, 1985). In addition, when the data are multidimensional, standard DIF detection procedures may also not be effective, given that they condition examinee abilities on a single latent trait (e.g., the Mantel–Haenszel [MH] test, logistic regression, SIBTEST, likelihood ratio test). In such cases, the multidimensional IRT (MIRT) model is recommended, as it provides both accurate parameter estimation and can also provide additional information about the separate constructs (Reckase, 2007).
Given the potential advantages provided by the MIRT model, and the aforementioned strengths of the MMixRM for DIF assessment with multilevel data, the goal of this study was to describe an extension of the MMixRM that accommodates both multidimensional and multilevel data. The multidimensional multilevel mixture Rasch model (MMMixRM) is designed to be used in much the same manner as the MMixRM discussed above, but in the case where multiple latent traits are assessed simultaneously. The MMMixRM can be expressed as
The terms in Equation (3) are as defined in Equation (2), with the addition that
Goals of the Current Study
In this study, the MMMixRM was used to identify latent classes at both the examinee and school levels with item response data from language and mathematics achievement tests. Of particular interest was the extent to which resulting classes could be understood in terms of identified learning disabilities and testing accommodations received by examinees, both of which served as Level 1 (examinee) covariates in the MMMixRM model. The proportion of examinees in each school identified with a disability and the proportion receiving accommodations were Level 2 covariates for predicting school latent class membership. We believe that by including multiple constructs in a single analysis, we might gain greater information regarding examinee typologies that are present in the population, which in turn should provide educators with insights into how better to work with individuals with different item response patterns. In addition, the multidimensional nature of the model should provide researchers and educators more information regarding the multifaceted nature of DIF than would separate analyses of scales in isolation.
Method
Participants
Achievement test data for mathematics (30 dichotomous items) and language (20 dichotomous items) were collected from 2,553 nationally representative Grade 3 examinees (50.5% male) in 65 U.S. schools during the standardization of a nationally normed achievement instrument. The examinees were sampled from schools across the nation and were selected so as to be representative demographically in terms of gender and ethnicity. In addition, schools were selected to represent the United States both geographically and in terms of school settings (i.e., urban, suburban, or rural). As well as demographic information, disability and testing accommodation status were also recorded for each examinee, as were the proportion of examinees in each school identified with a disability, and receiving one or more accommodations. At the examinee level, these variables were recorded for several specific disability and accommodation categories, and were coded as 1 if the disability/accommodation was present, and 0 if not. For purposes of this study, at the school level, disabilities and accommodations were not divided into specific categories because for many of the schools these proportions were too small to include in the statistical analyses described below. At the student level, the individual disabilities and accommodations were used because, as will be evident in the results, there were sufficient numbers of examinees in each category within the examinee-level latent classes. Rates of accommodation receipt along with rates of the various disability categories appear in Table 1. The most common accommodations granted to examinees were having directions read aloud, being granted extra time to take the exam, and being allowed the use of a calculator for portions of the math exam that were not directly focused on hand calculations. The most common disability categories were reading and math-specific learning disabilities, respectively, along with speech impairment. Across the entire sample, 265 (10.4%) examinees were classified as having a disability and 209 (8.2%) received an accommodation.
Number (Percentage) of Examinees in Each Accommodation and Disability Categories.
Mixture Modeling
To investigate the presence of uniform DIF at both the examinee and school levels for both math and language simultaneously, an MMMixRM was fit to the 30 math and 20 language exam items using Mplus version 6.11 (Muthén, & Muthén, 2011). Maximum likelihood was used for parameter estimation, with 1,000 random sets in the initial stage and 100 optimizations in the final stage. Model convergence was attained using these settings. As noted by Cho and Cohen (2010), latent variable scales must be linked across latent classes for the comparison of item difficulty estimates across them to be accurate. The current study made use of the method outlined in the Cho and Cohen article, namely, anchoring the scales in terms of the distribution of the latent trait by setting the mean and variance of a reference class to 0 and 1, respectively. In addition to identifying latent classes in the data, and thus potential uniform DIF, MMMixRM was also used to investigate possible sources of DIF through inclusion of covariates at both the examinee and school levels of the model. The examinee-level covariates used in this analysis were focused on disability and accommodations, which were of substantive interest in the study. Inclusion of these covariates in the model allowed for the determination of the extent to which different examinee latent classes were associated with the receipt of different specific accommodations, or differences in rates of identification of particular disabilities. Therefore, if DIF were found for certain items, inclusion of these covariates may help explain its possible sources. As noted above, at the school level two covariates were used in the model: the proportion of disabled students at the school and the proportion of students receiving accommodations at the school. As with the student-level covariates, these variables provided insights into the nature of the latent class and school-level DIF. The covariates were related to student and school group membership using a multilevel multinomial logistic regression model that was incorporated into the multilevel mixture model. The optimal MMMixRM was selected using a combination of the sample size adjusted Bayesian information criterion (aBIC) and the Akaike information criterion (AIC), based on guidelines for practice using the MMixIRT model (Cho & Cohen, 2010). A series of models was estimated for the sample, each producing the information criteria. For a given one of these criteria, the optimal model was the one producing the smallest value.
Differential Item Functioning
The latent classes themselves were characterized in terms of item difficulty parameter estimates and examinee demographics. As recommended by Cho and Cohen (2010), examinee-level DIF analyses were conducted within each school-level latent class separately, while school-level uniform DIF detection was conducted by comparing school latent class item difficulty estimates within examinee levels. In the case where there were two latent classes the standard MH test was used, while for situations involving more than two latent classes, the generalized MH (GMH) was employed, as recommended by Penfield (2001) for use with more than two groups. As a post hoc follow-up to a significant GMH result, pairwise MH tests were used to assess uniform DIF for each pair of latent classes. The Benjamini–Hochberg False Discovery Rate (FDR) procedure was used to control the Type I error rate for the multiple hypothesis tests employed here, per findings in the literature showing it to be effective in conjunction with MH (Kim & Oshima, 2012). For both the standard MH and GMH, a purified observed score on the language or math assessment was used as the matching subscale. It is important to note that testing for DIF occurred separately for the two exams. Thus, when testing for uniform DIF on the language exam, individuals were matched on the language score only. Uniform DIF for math was assessed in the same fashion, with examinees matched only on the math score.
Results
Characterization of Latent Classes
Based on the aBIC and AIC statistics, values of which appear in Table 2, the optimal model included 3 examinee and 2 school latent classes. Therefore, further discussion will focus on results for this model. Table 3 includes both the standardized and observed sum score latent class means for the reading and math exams, respectively. An examination of these means reveals that Examinee Class 1 had high scores on both the language and math exams, whereas individuals in Class 2 had relatively low means for both tests. In contrast, examinees in Class 3 had the lowest mean performance on the language test and the highest mean performance on math. Latent Class 2 produced the most variable scores for the language and math tests, while the variances on the math exam for Classes 1 and 3 were very comparable, and Class 1 had the lowest variance on the language test. Finally, the language and math exam scores were most strongly correlated within Examinee Latent Classes 1 and 2, and also significantly correlated for Latent Class 3, though the correlation between the two tests was not as strong in this group as in the other two. A hypothesis test comparing correlation coefficients between independent groups revealed that the relationship between language and math scores was weaker for Latent Class 3 than for the other two examinee classes, while the correlations did not differ significantly between Classes 1 and 2.
Relative Fit Indices for Fitted Models.
Note. AIC = Akaike information criterion; aBIC = adjusted Bayesian information criterion.
Latent Class N, Standardized/Sum Score Test Means, Standardized/Sum Score Test Variances, and Correlations Between Latent Traits.
*Latent Class 1 for schools was set to 0 for identification purposes.
**Significantly different from 0, with p < .05.
With regard to the school level, Latent Class 1 consisted of high achieving schools when compared to Class 2, whose latent trait means were set to 0 for model identification purposes. This higher performance level was also evident in the observed language and math test scores. The language test variances were greater for School 2 than for School 1, whereas the opposite pattern was evident for math. Finally, the correlation between the two constructs was statistically significant, and large positive for both school latent classes, and slightly larger in Class 2.
Table 4 displays the distribution of examinee by school latent classes. A χ2 test of association indicated that membership in examinee latent class was significantly related to membership in school latent class (χ2 = 96.53, df = 2, p < .001), with a φ coefficient value of 0.19, suggesting a moderate sized relationship between the two latent class variables. Approximately 50% of examinees in School Class 1 (higher performing schools) were also in the highest performing examinee group (Latent Class 1). Members of the other two latent classes appeared in roughly equal proportions in the highest school performing latent class. In contrast, the lowest performing examinee latent class (2) was in the plurality in the lower performing school latent class, followed closely by the highest performing examinee class. The third examinee latent class (high math/low language) was somewhat less likely to appear in School Class 2 than in Class 1.
Examinee Latent Class by School Latent Class Membership, N (%).
Covariate Analysis
To gain further insights into the nature of the latent classes, particularly with respect to identified student disabilities and testing accommodation, which was a focus of this study, covariates were included in the MMMixRM model. The examinee and school latent classes, respectively, served as the outcome variables for these analyses. Included as covariates in the examinee-level model were examinee gender, and whether each individual had been identified with a disability in any of the following areas: emotional disability (ED), specific learning disability (LD; reading and math, respectively), speech disability, and mental retardation (MR). Accommodation types included in the model were extended time, having directions read aloud, having questions read aloud, having questions restated, being given an alternative test setting, being given a calculator, and having the teacher mark the answer given by the student. Each of these covariates was coded as either a 1 (yes) or a 0 (no). At the school level, two covariates were included in the model: the proportion of students who were identified with some disability, and the proportion of individuals receiving some type of testing accommodation.
The parameter estimates for each variable at both Levels 1 and 2 appear in Table 5. At the examinee level, the third latent class served as the reference category. Within each column the first number refers to the comparison of Class 1 with 3, and the second number refers to the comparison of Class 2 with 3. In addition, an asterisk signals a case where the coefficients for Classes 1 and 2 were significantly different from one another. These results reveal that, when compared to Class 1 (high math/high language), examinees in Class 3 (high math/low language) were statistically significantly more likely to be male, to be identified with a specific reading disability, and to be identified as having mental retardation. When compared with examinees in Latent Class 2 (low math/low language), those in Class 3 were less likely to be identified with a specific math disability, to be given additional testing time, or to be given a calculator for certain portions of the math test. In terms of comparisons of Examinee Classes 1 and 2, students in Class 1 were less likely to be identified with a reading or math disability, were less likely to be given additional time, or to be given a calculator on certain portions of the math exam.
Covariate Parameter Estimates for Logistic Regression of Examinee- and School-Level Latent Classes.
Note. ED = emotional disability; LD = learning disability; MR = mental retardation.
Significant differences between Latent Class 3 (reference category) and Classes 1 and 2 are indicated in boldface.
Significant differences between Classes 1 and 2.
Binary logistic regression was used to ascertain whether the proportion of examinees identified with any disability or the proportion given any accommodations at a school were associated with school membership in a Level 2 latent class. Latent Class 2 was the target group in this analysis. The results at the bottom of Table 5 demonstrate that schools in Class 2 had a significantly higher proportion of both identified and accommodated students than did schools in Class 1. The mean proportions of each variable by school latent class appear in Table 6. For both school classes, the proportion of disabled and accommodated students was low. However, schools in Class 2 had twice the proportion of disabled students and three times as many accommodated students as did schools in Class 1. Cohen’s d values for the differences in proportion disabled and proportion accommodated were 0.44, and 0.50, respectively, suggesting a moderately sized difference between these latent class means.
School-Level Latent Class Mean (SD) Proportions, and Cohen’s d for Difference, of Students in Schools Identified With a Disability or Receiving an Accommodation.
Differential Item Functioning
As described above, DIF was assessed at both the examinee and school levels using the GMH and/or MH tests, respectively. DIF analyses at the examinee level were conducted separately within each school-level latent class, and DIF analyses at the school level were conducted separately within each examinee-level latent class. The results of these analyses were very comparable at each level of the data. In other words, the vast majority of the same items were identified as exhibiting DIF at the student level within each of the school latent classes, just as the vast majority of items were found to contain DIF for the school level, within each of the examinee-level latent classes. Therefore, results are presented below for examinee- and school-level DIF without regard to the specific classes of the other level. The very few items for which results were not consistent across all groups are mentioned in the text.
With respect to examinee DIF, GMH identified 7 math and 12 language items displaying DIF between at least two of the three latent classes. For each of the significant GMH results, pairwise MH analyses were conducted to identify for which group pairs there were significant differences in item difficulty. The item difficulty parameter estimates appear in Table 7, with notation as to which groups’ values were significantly different from one another. For the math exam, Items 10, 11 22, 24, 25, 26, and 27 all displayed uniform DIF for at least two of the latent classes. In particular, individuals in Examinee Class 3 found Items 10, 11, 22, 24, and 25 to be significantly easier than did those in Examinee Class 1, who in turn found them to be significantly easier than did examinees in Latent Class 2. In addition, those in Classes 1 and 3 also found items 26 and 27 to be significantly easier than did examinees in Class 2. In terms of content, Item 10 assessed measurement skills, while Item 11 assessed number relations. Items 22, 24, 25, 26, and 27 each assessed some aspect of data analysis, statistics, and probability. Therefore, it is possible to conclude that individuals in Examinee Class 2 had greater difficulty with items associated with data analysis, statistics, and probability, whereas Examinee Class 3 generally found these items to be easier than did members in the other classes. Indeed, of the 6 test items designed to assess this subject area, 5 were found to exhibit uniform DIF. In contrast, several of the subject areas, including geometry/spatial sense, patterns/functions, problem solving, computation, and operation concepts, had no items with DIF.
Item Difficulty Estimates by Examinee Latent Class.
Dif associated with Group 1.
Dif associated with Group 2.
Dif associated with Group 3.
With respect to the language test, an examination of Table 7 reveals that Items 1, 2, 5, 6, 9, 10, 11, 12, 15, 16, 19, and 20 each were found to exhibit uniform DIF for at least two of the examinee latent classes. Item 12 displayed DIF between examinee latent classes for school-level Latent Class 2 but not 1. The item difficulty estimates show that Examinee Class 3 found each of these items more difficult than did those in Class 1, except for Item 16. Members of Class 3 also found language Items 1, 10, 11, 12, 16, 19, and 20 to be more difficult than did examinees in Class 2. In addition, members of Class 2 had higher item difficulty parameter estimates than did those in Class 1 for Items 1, 2, 5, 6, 9, 11, 15, 16, 19, and 20. In terms of content, Items 1, 5, 6, 11, 12, 15, and 16 all pertained to the area of Editing Skills. Items 9, 10, 19, and 20 measured use of writing strategies, and Item 2 assessed sentence structure skills. On the other hand, items measuring the areas of basic understanding of reading, analysis of text, and evaluation of meaning had no items with DIF. Given these results, it can be concluded that examinees in Class 3 generally found editing skills more difficult than did individuals in Class 1, and in some cases Class 2 as well. Furthermore, those in Class 3 also found the proper use of writing strategies more difficult than did examinees in Classes 1 or 2.
In terms of the school latent classes, item difficulty estimates appear in Table 8. Based on the MH test, Math Item 5, along with Language Items 1, 6, and 19 were found to exhibit DIF. In each case, examinees attending schools in Latent Class 1 (higher achieving schools) found the items to be less difficult than did those in Latent Class 2. Math Item 5 assessed computation skills, whereas Language Items 1 and 6 measured editing skills and 19 measured use of writing strategy. The school-level results indicate that for certain items, DIF is associated with the school that an examinee attends and not with direct characteristics associated with the examinee herself/himself. In other words, by matching individuals on the level of the latent trait being assessed, we can conclude that the school a student attends can, by itself, be associated with the presence of uniform DIF.
Item Difficulty Estimates by School Latent Class.
Dif associated with Group 1.
Dif associated with Group 2.
Discussion
The goals of this study were twofold. First, we wanted to demonstrate the MMMixRM model for investigating DIF when multidimensional testing data are collected in a multilevel context. Prior work has shown how the mixture Rasch model and multilevel mixture Rasch model can be effective tools for identifying the presence of DIF, as well as potential causes of it. The current study extends this earlier work by incorporating multiple latent traits into the modeling framework, thereby allowing for multilevel latent class estimation when multiple latent traits are measured simultaneously. This model has many applications to real-world data in which examinees are assessed on multiple constructs simultaneously, because it incorporates items measuring the various constructs when it identifies latent classes in the population. As will be described below, analysis of the same data using two separate MMixRMs yielded quite different results from the MMMixRM. In addition, while the current example focused on two clearly separate constructs, language and math, this model could also be applied in the context of, for example, a mathematics examination that is designed to measure several distinct latent traits within the broader umbrella of math. The second goal of this study was to investigate identified disability and provided accommodations as antecedents of uniform DIF, through their association with latent classes at both the examinee and school levels. To the extent that DIF is found to be associated with the classes, relationships between latent class membership and disability and accommodation status may help explain its causes.
The results presented above described the presence of 3 examinee-level and 2 school-level latent classes. Examinee Classes 1 and 2 were characterized by having uniformly high (Class 1) or uniformly low (Class 2) language and math performance. On the other hand, Examinee Class 3 had the highest mean performance on the math test and the lowest on the language assessment, and the correlation between the two constructs for this group, while certainly not small, was significantly lower than that of either Latent Class 1 or 2. In addition, individuals in Latent Class 1 were less likely to have been identified with a reading disability, than either of the other two latent classes, and less likely to be identified as having a diagnosis of mental retardation than those in Class 3. Individuals in Latent Class 2 were the most likely to be given additional time to take the test and to be allowed the use of a calculator for some portions of the math test. Finally, members of Latent Class 3 were more likely to be males than were those in Latent Class 1. In terms of uniform DIF, results showed that individuals in latent class 3 found a number of math items easier than those in other classes, particularly items pertaining to data analysis, probability, and statistics. On the other hand, they found language items assessing editing skills and the use of writing strategies to be more difficult than did examinees in Latent Class 1. In summary, at the examinee level, three distinct typologies emerged: individuals who tend to do well on both language and math; individuals who do poorly on both, and who are more likely to receive accommodations and be identified with math and reading disabilities; and individuals who excel at math but have difficulty with language, and who are more likely to be identified with a reading disability than those in the highest performing group. This latter group also found items assessing relatively complex math skills such as data analysis and probability to be easier than did members of the other latent classes, while at the same time having greater difficulty with items assessing writing and editing skills. These results also found that the school a student attends can contribute to the presence of uniform DIF, as well. In particular, individuals attending lower performing schools containing more examinees with identified disabilities and receiving accommodations found some items to be more difficult than did examinees in a school belonging to the higher performing latent class. In other words, even when examinees are matched on the latent trait being measured, the school they attend can lead to the presence of DIF.
Though not reported in the results because they were not the focus of this study, separate MMixRMs were estimated for math and language with these data in order to provide a contrast for the current set of results. The results of these unidimensional analyses revealed a simple ordering of latent classes in terms of their performance on either the math or language exams. In other words, for both tests three examinee-level latent classes were identified, which could be characterized as having high, medium, and low performance on math and, separately, high, medium, and low performance on language. In addition, DIF results generally reflected this ordering such that when uniform DIF appeared, it was characterized by the lower performing groups finding the items to be more difficult. A similar ordering was found for the school-level latent classes based on the MMixRMs. In sum, estimation of latent classes based on the two sets of items separately did not reveal the same variegated pattern that became apparent when the constructs were included together in the multidimensional model. Thus, researchers who used this latter approach would very likely miss the presence of individuals in the population who excel at mathematics, particularly complex tasks such as data analysis and probability, while having difficulty with more complex elements of language usage such as editing and writing. In this way, we believe the MMMixRM provides a unique addition to the measurement professional’s toolbox when multidimensional assessment data are present.
The results of this study demonstrate that when multiple constructs are assessed in a single examination session, MMMixRM may provide greater insights into uniform DIF at both Level 1 (examinee) and Level 2 (school) by examining the multiple dimensions simultaneously. This inclusion of multiple dimensions provides a more complete characterization of examinee and school classes by using their relative proficiencies in multiple constructs simultaneously, thereby reflecting the real-world contexts in which students learn and are assessed. In turn, it is then possible to develop a deeper understanding of examinee typologies than would be possible if the separate constructs were examined in isolation. For example, here we were able to distinguish examinees who were strong in math and weak in language from those who were strong in both, or weak in both.
Study Limitations and Directions for Future Research
As with all research, the current study has limitations that must be taken into consideration when interpreting the results, and which offer directions for future research. First of all, the two constructs included in this study were clearly different, thus potentially allowing for easier separation of individuals into latent classes with distinct item response and test score mean patterns. Future research should expand this work by examining the performance of the MMMixRM with a multidimensional scale designed to measure a more unitary construct, such as reading or math, as opposed to two clearly distinct constructs as was done here. Second, the proportions of individuals with particular identified disabilities and accommodations were relatively low in this sample, though it should be noted that these rates are comparable to those in the general population. Given the large size of the sample overall, model parameter estimation was not compromised. Nonetheless, covariates with somewhat larger numbers of individuals in each category might provide somewhat greater power than was the case here, potentially leading to a clearer picture of which covariates were associated with latent trait membership. Given the increasing popularity of constructed response items in many assessments, a third area for future research to come from this study is the extension of the MMMixRM to the polytomous item case.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
