Abstract
Differential item/test functioning (DIF/DTF) are routine procedures to detect item/test unfairness as an explanation for group performance difference. However, unequal sample sizes and small sample sizes have an impact on the statistical power of the DIF/DTF detection procedures. Furthermore, DIF/DTF cannot be used for two test forms without common items. One of the advantages of person-fit analysis (PFA) is that even with small sample sizes, and/or no common items, PFA can still be utilized to investigate whether different subgroups are on the same scale. This study used simulation data and empirical data from a large-scale state high school assessment to demonstrate test fairness between nonaccommodated and accommodated forms. The results of this study showed that the accommodated form was comparable with the nonaccommodated form within the same construct, and PFA performs well for scale comparability purposes regardless of the size of the subgroups and number of common items.
The Michigan Merit Examination (MME) is used to assess Grade 11 and eligible Grade 12 students on high school content standard and expectations in English language arts (ELA), mathematics, science, and social studies. In Michigan, all students are required to participate in state-level assessment programs approved by the State Board of Education (SBE). It is recognized that some students who customarily use accommodations in the classroom setting may also need accommodations during this assessment. According to the Code of Fair Testing Practices in Education (Joint Committee on Testing Practices, 2004), test developers should provide “tests that are fair to all test takers regardless of age, gender, disability, race, ethnicity, national origin, religion, sexual orientation, linguistic background, or other personal characteristics” (p. 1). The work of ensuring test fairness takes place during every stage of the test development process, including item writing and review, pretesting (i.e., pilot or field testing), form construction, and form review.
To ensure test fairness, items included in the MME are evaluated by two groups of external reviewers: one group is comprised of content experts (including classroom teachers, college faculty, and curriculum specialists) who represent diversity through geographic region, race/ethnicity, and gender, and their review focuses on content accuracy, item classifications, skill levels, and grade-level appropriateness. The other group is composed of bias fairness reviewers, who are of diverse races/ethnicities, genders, and geographic backgrounds and are sensitive to issues of item and test fairness. The bias reviewers carefully examine all items and materials to make sure they do not contain any language, roles, situations, or contexts that could be considered offensive or demeaning to any population subgroup. However, the bias review panels are not usually good at removing items that are statistically biased. In a practical setting, pretest items are statistically investigated as to whether they function equally among different population subgroups.
One of the factors that potentially contributes to an item functioning differentially among subgroups is the presence or absence of accommodation. Many researchers have investigated the effects of accommodation and response format on test scores of students with disabilities (SWD) using statistical methods on statewide assessment tests (Elliott, Kratochwill, & McKevitt, 2001; Fuchs, Fuchs, Eaton, Hamlett, & Karns, 2000; Helwig & Tindal, 2003; Koretz & Hamilton, 2000; McKevitt et al., 2000; Tindal, Heath, Hollenbeck, Almond, & Harniss, 1998). Although previous research has demonstrated that SWD performed better under the accommodated conditions, there is no evidence to support that an accommodated form functions equally as a nonaccommodated form. Furthermore, most comparability studies between nonaccommodated and accommodated tests have focused on only mean differences between test scores (e.g., t test). Scale comparability using only mean score differences is not adequate without evidence from measurement equivalence tests (Raju, Laffitte, & Byrne, 2002). The question about measurement invariance remains, “Do the test scores of students who took the accommodated form have the same meaning as the scores of those who took the nonaccommodated form?” This question cannot be addressed by comparing simple mean differences of scores between nonaccommodated and accommodated forms. Therefore, further investigation is needed with psychometric consideration focused on the measurement equivalence perspective.
Among the variety of ways to identify measurement equivalence, differential item functioning (DIF) has been used most frequently. DIF can detect items that are functioning in statistically different ways for distinct population subgroups (focal and reference groups) given that both groups have the same level of expertise with respect to the content knowledge (Raju et al., 2002). For this reason, DIF has been commonly used as a method of measuring construct comparability in international, comparative, and cross-cultural research, and the comparability of translated and/or adapted measures in particular (Hambleton, Merenda, & Spielberger, 2006; Kristjansson, Desrochers, & Zumbo, 2003). The typical group variables for DIF have involved ethnicity (e.g., Black and White) and gender (female and male). However, DIF can only be used if (a) the two forms being compared have the same common items and (b) the sample size of test takers from the two forms are approximately equal.
Previous research has found that unequal sample size and size of samples have an effect on the statistical power of DIF detection procedures (Awuor, 2008). Furthermore, DIF analysis has been performed only to identify “items” on which students of one group have a different probability of getting the items correct from students with the other group. It does not tell whether on test or form level, an assessment functions equally across two subgroups. In this case, differential test functioning (DTF; for example, Raju, Oshima, & Wolach, 2005) has been used to determine whether two forms/tests are comparable under the same model. However, DTF still depends on the level of DIF on each item. For example, if one item favors one group and another item favors the other group, then DTF will be canceled out. If the items exhibit drift in opposite directions (some items favoring one group and some favoring the other group), there would be some partial DIF cancellation at the overall test level. As a result, DTF may have only a trivial impact on the assessment of group mean difference or any particular examinee’s score (Reise & Flannery, 1996). As a consequence, use of DIF/DTF might be limited to determine whether accommodated form functions equally with the nonaccommodated form in a statewide assessment such as MME.
Because use of DIF/DTF is limited by specific requirements for form and test comparability purposes, person-fit analysis (PFA) has been applied to determine whether a test is comparable between two different forms/groups (e.g., Engelhard, 2009; Meijer, Egberink, Emons, & Sijtsma, 2008; Reise & Flannery, 1996). Person-fit is defined as a statistical method for evaluating the fit or misfit of individual test performance to a model-based response pattern of a population (Meijer & Sijtsma, 2001; Reise & Flannery, 1996). Because unusual responses threaten the validity of test items and personal ability, identifying unusual responses to a set of test items was the original primary purpose of PFA. Possible causes of unusual responses can be explained in several ways such as “guessing,” “cheating,” “alignment error,” and multidimensionality of a construct (Levine & Rubin, 1979; Wright & Stone, 1979).
In addition to detecting unusual response patterns, PFA has been applied to evaluate the comparability of test scores across different groups or forms (Reise & Flannery, 1996). Because the person-fit index represents the likelihood of an examinee’s response pattern given his underlying ability (
Study Objective
In standardized tests, some conditions of the administration of tests could be unfair to certain SWD. To counter this unfairness due to the test administration format, MME offered accommodated test forms to meet the special needs of SWD. Although test administration conditions were accommodated, there is no guarantee that the accommodated test conditions function equally as the nonaccommodated (standard) test conditions. Furthermore, due to test security and test requirements, the MME assessment has different items on the accommodated and nonaccommodated forms with only a few common items. Therefore, rather than use DIF/DTF, PFA was applied to the MME science data to assess scale comparability between the two forms.
Person-fit typically has been used to detect misfit responses at the individual level. However, this study applied lz person-fit to real data to investigate scale comparability at the group level. The purpose of this study was to demonstrate whether the test scores between the nonaccommodated and accommodated forms of a large-scale state science assessment are comparable by using lz PFA. To address this objective, the practical application of lz PFA with real data was conducted as follows: In the “Method” section, concurrent item parameter calibration and PFA using empirical cutoff values are described, and then lz person-fit distributions between nonaccommodated and accommodated forms are compared using 2010 MME science test data. Results of this study imply that test scores between the two forms are comparable within the same construct. The limitations and future research are described in the “Discussion” section.
Method
Data Sources
Data were drawn from 2010 MME science assessments administered to Grade 11 Students to investigate test fairness and form comparability. This science test was based on Michigan high school content expectations (HSCEs) typically covered in the high school science curriculum. Materials were drawn from the biological sciences, earth/space sciences, physics, and chemistry. The test emphasized scientific reasoning skills rather than recall of specific scientific content, skill in mathematics, or skill in reading. Nonaccommodated tests consisted of a total of six forms. Minimal arithmetic and algebraic computations were required to answer some items. Each form consisted of a common set of the HSCEs, plus unique items that covered other HSCEs, and field test items. The nonaccommodated and accommodated forms were constructed independently. For the nonaccommodated assessment, in the light of the American College Testing (ACT) linking studies, 20 items were selected from 40 science items of the ACT and 32 state-developed science items were selected from the standard operational test form. Like the nonaccommodated form, 20 of 40 ACT science items and 32 state-developed science items were selected and complied as the operational accommodated form for SWD. There were 7 common items between the nonaccommodated and accommodated forms, and 45 unique items for each form in the 2010 science assessment. The two forms were built under the same test specification (blueprint) that mapped the state high school science curricula and standards. Students’ data from only one form among six nonaccommodated forms and the accommodated form were used for a comparability study.
Accommodated Form
The decision to allow use of a particular accommodation was made on an individual basis while taking into account the needs of the student and whether the student routinely received the accommodation in classroom instruction and testing. If a student received special education services, all accommodations must be documented in the student’s individualized education program (IEP) or English Language Learners (ELLs) instructional plan. In this study, oral administrations were available accommodations for students taking the 2010 science tests. Students were accommodated by having a test administrator read a script aloud or by using a prerecorded audio version of the scripted test and Braille print. Oral accommodations were only permitted for students with IEPs or ELL instructional plans that specified the student routinely used audio accommodations during classroom assessments (“Audio versions” refers to audio cassettes, audio DVDs, and reader scripts). For more information regarding specific accommodations and specific guidelines, that must be followed when audio cassette or DVD versions of the assessments are used, please refer to “Administration Manual for Students Testing With Accommodations” (Michigan State Board of Education, 2009).
Concurrent Calibration
To compare the response patterns within the same structural model, a concurrent calibration method was used to put the two forms into the same matrix for the 2010 science data with seven common items. Before applying the concurrent calibration, DIF analysis using the Mantel–Haenszel (MH) statistic (Holland & Thayer, 1988) was performed for the seven common items to identify items on which students of the nonaccommodated group have a different probability of getting items correct compared with students of the accommodated group. Concurrent calibration works to put two forms on the same scale because none of the seven common items were identified as performing differentially between the two forms. Research (e.g., Hanson & Béguin, 2002; Kim & Cohen, 1998) has shown that concurrent calibration results in item parameters for two forms being on the same scale. Hanson and Béguin (2002) showed that concurrent calibration methods produced more accurate parameter estimates than separate linking methods. In this study, BILOG-MG (Zimowski, Muraki, Mislevy, & Bock, 1996) was used to implement a concurrent calibration method using marginal maximum likelihood estimation (MMLE; Bock & Lieberman, 1970) for a three-parameter logistic model (3PLM). The priors in BILOG-MG were set as a normal distribution (M = 0, SD = 2) for b, log normal distribution (M = 0, SD = 0.5) for a, and beta distribution (α = 5, β = 17) for c. The number of quadrature points was assigned as 100. Floating priors were used to reduce the possible effect of incorrectly specified prior distributions of item parameters (Baker, 1992). The 3PLM for dichotomous multiple-choice (MC) items with u = 0 or 1 was given as follows (Birnbaum, 1968):
where D = 1.7,
Estimated Item Parameters Using Concurrent Calibration for 2010 MME Science Tests.
Note. MME = Michigan Merit Examination.
One problem with maximum likelihood estimation (MLE) for the 3PLM was that θ could not be estimated until the responses were mixed with a correct response and an incorrect response (Birnbaum, 1968). Therefore, the maximum a posteriori (MAP) method was used to estimate all students’ abilities, which were then used to compute lz values. The MAP estimator assumes a prior probability distribution (standard normal distribution) of θ and calculates a posterior probability by dividing the marginal probability into the product of the conditional probability and prior probability. To estimate θ for MAP in this study, the Newton–Raphson procedure was used to find the maximum likelihood using an iterative procedure (Baker, 1992). When a change in
Summary of Students’
Note. MME = Michigan Merit Examination.
lz Person-Fit Index
Various person-fit statistics and indices have been proposed to detect nonfitting examinees (e.g., Drasgow & Levine, 1986; Meijer & Sijtsma, 2001; Tatsuoka, 1984). Several researchers have used the person response function as a person-fit index and compared it with other person-fit indices (e.g., Nering, 1995; Trabin & Weiss, 1983). Meijer and Sijtsma (2001) reviewed a large number of statistics invented for the purpose of identifying nonfitting response patterns. Several studies have demonstrated that the lz statistic (the standardized version of the lo index) performed better than other person-fit statistics in many cases (Drasgow, Levine, & McLaughlin, 1987; Drasgow, Levine, & Williams, 1985) and is one of the most powerful person-fit statistics for the detection of nonfitting responses (Drasgow, Levine, & McLaughlin, 1991; Li & Olejnik, 1997). The lz person-fit statistic is a type of Z score of lo (Drasgow et al., 1985):
where lo represents the likelihood of an examinee’s response pattern given their
where n is the number of items on the test,
lz is asymptotically normally distributed with a mean of 0.0 and SD of 1.0 conditional on
Selecting Empirical Cutoff
To obtain an appropriate critical value based on the theoretical distribution, this study provided empirical cutoff values based on real-parameter simulations instead of theoretical cutoff values (Seo & Weiss, 2013). In the real-parameter simulations, only the distributions of θ were specified because item parameters were obtained from real data. Then, the probability correct for each cell of the person-by-item data matrix was generated using 3PLM. To create dichotomous responses, each probability of response was converted to the appropriate scored item response by comparing a random number drawn from a uniform distribution U[0, 1] to the probability correct in each cell of the person-by-item data matrix. If the random number was greater than the probability correct, the item response was 0; if the random number was less than or equal to the probability correct, the item response was 1. Item responses for real-parameter simulations were generated using the program R (R Development Core Team, 2007). Real-parameter simulations were conducted using the same numbers of items and examinees from the MME science data, with θ parameters distributed N(0, 1). All values of lz were computed by using the PERSON z program (Choi, 2010). Finally, the cutoff values of real data were determined by the cutoff values of the bottom 5% of the lz distribution based on the real-parameter simulation.
Figure 1 shows how to obtain empirical cutoff values for real data based on real-parameter simulation. The left panel displays a scatterplot of lz statistics by

Theoretical cutoff value for real data (left side) and empirical cutoff from simulation data (right side).
Evaluation Criteria
In a comparison between the nonaccommodated and the accommodated form, two-sample t tests revealed a form effect because the sample size of nonaccommodated and accommodated forms exceeded 8,000. If the sample size is large, t tests will show statistically significant effects even with trivial practical differences. Therefore, this study applied the effect size to detect form type effect (Cohen, 1988). In addition, a two-sample proportion test was performed to test the equality of misfit ratios between the two forms.
Results
Because lz represents the likelihood of an examinee’s response pattern given his or her

Histograms of lz statistics for the students taking nonaccommodated and accommodated forms in MME science test.
Table 3 summarizes the descriptive statistics of the lz distributions from the real MME data and real-parameter simulations, and the empirical cutoff values. The mean and SD of lz person-fit for students who took the nonaccommodated form was 0.218 and 0.916, and the mean and SD of lz statistics for students who took the accommodated form was 0.247 and 0.983, respectively. The means and SDs of the lz distribution using simulated data were lower than those of the lz distribution from real nonaccommodated and accommodated forms. The effect size of
Summary of lz Statistics for Nonaccommodated and Accommodated Forms Using Real Data and Simulation Data With Real Item Parameters.
Under theoretical conditions, correlation between lz and θ should be zero across all θ values (Drasgow et al., 1985). Correlations of lz with
The empirical cut values of −1.358 and −1.447 were applied to the nonaccommodated and accommodated forms, respectively. The overall misfit ratios were 5.6% for the nonaccommodated form and 5.2% for the accommodated form and were close to the Type I error rate (one-tailed). A two-sample proportion test was performed to determine whether misfit ratios are different across the two forms. The misfit ratios between the two forms were not statistically different,
Further comparison of lz distributions between the two forms was performed by investigating lz distributions broken down by ability levels of the students (Table 4). Overall, the misfit ratios across ability levels were similar for both forms. The highest misfit rates were 8.4% and 7.3% among the middle ability level for the nonaccommodated and accommodated forms, respectively, and the differences of misfit ratios between the two forms were within 1.3% across all ability levels. The misfit ratio discrepancy between the two forms was trivial conditional on all ability levels. Two-sample proportion tests were performed to determine whether misfit ratios conditional on ability levels were different across the two forms. In Table 4, the misfit ratios between the two forms were not different across all ability levels. The patterns of misfit ratios, however, were slightly different between the two forms for the medium ability level. The highest misfit ratio for the nonaccommodated form was 8.4% where the ability ranged from −1.0 to 0.0, whereas the highest misfit ratio for the accommodated form was 6.8% where the ability ranged from 0.0 to 1.0. This different pattern of lz distributions between the two forms implies that the two forms function differently due to students’ ability level and do not function differently for form types. The misfit ratios of the medium ability level students were consistently higher than those of the low or high ability level students. The students within the range of the medium ability levels yielded unlikely response patterns on the basis of the model compared with those with low or high ability levels. As a result, the two forms functioned equally with respect to a person scale conditional on certain ability levels. However, misfit ratios at specific ability levels were slightly different between the two forms due to a group ability difference.
Percentages of Misfit Across Students’ Ability Ranges for Nonaccommodated and Accommodated Forms.
Note. A open bracket means that start of the range is inclusive. A closed parenthesis means that end of the range is exclusive.
Discussion
This study showed that PFA can be used as an alternative method for scale comparability in place of DIF/DTF. The lz person-fit index was used as an indicator of PFA between the nonaccommodated and accommodated forms of the MME science assessment. This study provided some aggregated group results of lz values so as to demonstrate scale comparability between the accommodated form and nonaccommodated form in a statewide assessment. When empirically derived cutoff values from the real-parameter simulations were applied to real-data distributions, the observed misfit rates were uniformly close to the 5% expected rates with the lower one-tail. These results confirmed that item response data from statewide students who took nonaccommodated and accommodated forms were modeled well by the 3PLM, and there were no scale differences between the two forms. Furthermore, misfit rates across the two forms were generally similar even with specific student ability levels. Overall, results from this study supported evidence that two forms are comparable in terms of test scores. Based on this PFA, there was no evidence to support a gross violation of the measurement invariance assumption. That is, the meaning of the scores at any point along the underlying ability continuum, as measured by the two forms of the assessment, is comparable and equally valid.
Because high-stakes K-12 assessments should provide the same scale regardless of which forms are distributed, a comparability study is an essential procedure. In educational and social science fields, more often, additional evidence about scale comparability from a high-stakes assessment in a specific subject area is needed for various purposes (e.g., for program evaluation and/or federal peer review), and the evidence often involves unbalanced sample size group comparisons (e.g., one subgroup size is extremely small as compared with another). Although DIF/DTF analysis has been considered the most popular method to investigate scale comparability, its use is limited by the larger sample size requirement for the subgroups as well as the equal sample size between two groups. Furthermore, DIF/DTF cannot be used unless all items are common between the two test forms. PFA, however, can be used to investigate scale comparability between two different forms/groups regardless of the size of the subgroups and number of common items. This study, by presenting an empirical example from a large-scale state assessment, will help test practitioners in the application of PFA in real testing programs. In addition, this study suggested PFA as an alternative approach to providing empirical validity evidence for test score comparability purposes when traditional methods (e.g., DIF/DTF, parametric analysis) were not appropriate due to the limited test conditions.
A major contribution of this study is to illustrate a systematic and practical approach of establishing comparability for determining whether two test forms are comparable for statewide assessments using PFA. Typically, person-fit has been used to detect misfit responses at the individual level. This study, however, applied the lz person-fit index at the group level to investigate form type effects and served as a new approach for scale comparability. This study suggests the following steps for scale comparability when DIF/DTF are not possible: First, the two forms/tests should be put on the same scale by a linking method such as concurrent calibration, and then a person-fit statistic should be computed for each examinee under each form/test. Next, misfit ratios between the two forms can be examined to decide whether each form/test is needed for a separate scoring procedure. If there is a big difference of misfit ratios between the two forms/tests, a separate scoring procedure should be applied to each form/test before reporting scores, whereas if no big difference of misfit ratios is found between two forms/tests, a single scoring procedure should be applied to each form/test. In this study, because person-misfit ratios between the two forms were not statistically different, students’ scores can be directly compared between the two forms. However, if person-misfit ratios between the two forms were different, separate scoring procedures would need to be applied to students’ scoring.
In future studies, more appropriate models should be considered with the fitting data to apply PFA to real data. The model-data fit should be examined before examining item-fit and person-fit. If data as a whole do not fit the model, the item- and person-fit indices might be invalid for their intended use. Specifically, PFA will be legitimate for person measurement given that data fit the model well at the item measurement level. Before applying person-fit statistics, this study, therefore, examined model-data fit by using the DETECT program ( Kim, Zhang, & Stout, 1995). All items were judged as supporting an essential unidimensional test because the reference D value is close to 0.1 for the MME science data. Furthermore, S-χ2 statistics were computed to investigate the item-fit (Orlando & Thissen, 2000). Although all items in this study did not exactly fit the 3PLM, this does not mean that the 3PLM model does not fit the data because the S-χ2 tests of item-fit are extremely sensitive to sample size (this study used more than 8,000 students). Another plausible reason for the observed item misfit can be explained from the inherent degree of multidimensionality of a few items in the assessments. Therefore, more comparability studies using PFA might be performed with a desired condition that data fit certain models well at the item measurement level (e.g., multidimensional item response theory (IRT) model or nonparameter IRT model). To detect a biased form/test within the same construct, continued research under DIF/DTF is warranted, and more PFA research is also suggested to add supplemental validity evidence with a variety of person-fit indices and measurement models.
Footnotes
Acknowledgements
The authors would like to thank Gayle Dejong and Remle Crowe for comments they provided that helped to improve this article. The authors also appreciate the feedback and comments from Journal of Psychoeducational Assessment (JPA) editor and two anonymous reviewers.
Authors’ Note
A portion of this work was completed while the authors worked at the Michigan Department of Education. The conclusions, discussion, and views contained in this article are not necessarily the official position of the National Registry of Emergency Medical Technicians (NREMT) or the Michigan Department of Education. During publication of this article, corresponding author’s affiliation was changed, so correspondence concerning this article should be addressed to Dong Gi Seo, Department of Psychology, Hallym University 1, Hallymdaehak-gil, Chuncheon-si, Gangwon-do, 200-702, Republic of Korea. Electronic mail may be sent via
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
