Abstract
This study introduces a novel differential item functioning (DIF) method based on propensity score matching that tackles two challenges in analyzing performance assessment data, that is, continuous task scores and lack of a reliable internal variable as a proxy for ability or aptitude. The proposed DIF method consists of two main stages. First, propensity score matching is used to eliminate preexisting group differences before the test, ideally creating equivalent groups as in a randomized experimental study. Then, linear mixed effects models are adopted to perform DIF analysis based on the matched data set. We demonstrate this propensity DIF method using a high-stakes functional English language proficiency test. DIF due to education was investigated in the writing component, which consists of two continuously scored performance-based tasks. Although the proposed method is demonstrated in the context of language testing, it can be applied to other types of performance assessments.
Keywords
Performance assessments, also known as direct or authentic assessments, have a long history in education as tools to measure students’ achievement. They have continued to gain in popularity as the need to evaluate complex skills increases. Compared with multiple-choice items, they have been recommended for their potential to gauge higher-level skills and promote deeper learning. However, without sound psychometric and measurement properties, their advantages may be jeopardized, and they may even potentially compromise test score validity, especially when used in high-stakes situations.
The validity and fairness of scores are essential in developing and maintaining a test. Because they are used to evaluate test takers’ ability and proficiency, test scores can have significant implications for individuals, organizations, and society. Inferences and decisions, such as university admission, immigration qualification, and physician certification, are made based on these scores. Test scores must be valid to support such decision making. According to the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014), “validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests” (p. 11). Measurement may be invalidated by the presence of items that show different psychometric properties across groups of people from diverse social, cultural, educational, or linguistic backgrounds. Indeed, differential item functioning (DIF) may indicate the existence of bias and threaten the validity of test score interpretation.
DIF occurs when test takers from different groups have different opportunities to succeed on an item even when they have the same level of an underlying ability, trait, or proficiency. Although DIF does not always imply test bias, it raises concerns about the comparability of test scores. Conventional DIF methods, such as Mantel–Haenszel (Mantel, 1963; Mantel & Haenszel, 1959) or logistic regression (Swaminathan & Rogers, 1990; Zumbo, 1999) have been widely used but have two limitations in the context of performance assessments. First, these methods usually require a reliable variable as a proxy for ability, most commonly an internal variable such as the (corrected) total score. 1 However, it is nearly impossible to obtain such an internal ability proxy variable because of the small number of tasks in one performance assessment (e.g., a one- or two-task writing test). Moreover, scores in performance assessments are often on a continuous or ranking scale, while many DIF methods are developed for dichotomous variables.
The combination of the increasing popularity of performance assessments and a dearth of DIF analytical methods tailored for these instruments calls for developing new DIF methods. The present study proposes a novel propensity score DIF method that can handle continuous task scores and that can work in the absence of a reliable internal ability-proxy variable. In this article, we demonstrate this DIF method in the context of language testing, but it also applies to performance assessments in other content areas.
We start with a review of the literature, then provide a demonstration using this propensity score DIF method, and end by discussing its pros and cons. In the first section, we describe logistic regression DIF methods, summarize the challenges of conducting DIF analysis in performance assessments, briefly review the DIF studies using propensity score matching, and outline our solution to the challenges of DIF analysis in performance assessments. In the second section, we illustrate the proposed propensity score DIF method in detail through a demonstration that draws on data from a performance-based writing test of functional English language proficiency. In the last section, we discuss the strengths and limitations of our method.
Literature Review
Logistic Regression Methods for DIF Investigation
Generally, DIF is examined by comparing item scores for two or more groups of test takers while controlling for their ability levels. Simply comparing group differences in mean scores on a task cannot address fairness concerns because the observed divergence may be due to true differences in the construct of interest rather than the effect of grouping. To rule out this possibility, researchers must control for individuals’ ability levels, which is often achieved through stratification (e.g., Mantel–Haenszel), matching, and covariance adjustment (e.g., logistic DIF methods). The literature describes two forms of DIF: uniform and nonuniform. Uniform DIF occurs when the group difference is consistent across ability levels. Nonuniform DIF, on the other hand, occurs when the group difference changes in its magnitude and direction depending on ability.
Various methods have been proposed to identify items with DIF, such as the Mantel–Haenszel procedure (Mantel, 1963; Mantel & Haenszel, 1959), the simultaneous item bias test (SIBTEST; Bolt & Stout, 1996; Shealy & Stout, 1993), area methods (Raju, 1988, 1990), methods based on differential functioning of items and tests (DFIT) framework (Raju, van der Linden, & Fleer, 1995), and logistic regression (Swaminathan & Rogers, 1990; Zumbo, 1999). Among these statistical tools, logistic regression and noncompensatory DIF (NCDIF) index (Oshima & Morris, 2008; Oshima, Raju, & Nanda, 2006) in the DFIT framework are arguably the more adaptive ones. They can capture both uniform and nonuniform DIF, and they can be extended to compare more than two groups simultaneously. Logistic regression allows multiple variables to be included to approximate test takers’ ability or to be used as covariates to adjust for between-group differences whereas NCDIF accomplishes this by integrating over multiple thetas using multidimensional item response theory (IRT).
Conceptually, logistic regression DIF methods are a statistical procedure where group membership (G), a proxy for ability (A), and the interaction between the two (A × G) predict the probability of a correct response to an item. The proxy variable is often referred to as a “matching variable” in the DIF literature. In this article, however, we use the term ability proxy or proxy variable to distinguish it from the concept of matching as used in the context of propensity score methods.
Typically, a conventional logistic regression DIF analysis has three steps. It begins with a baseline model in which a variable approximating test takers’ ability is included to predict item scores (Model A). The second step builds on the baseline model by adding a predictor—the grouping variable (Model B). The last step adds the interaction of the ability proxy and grouping variables (Model C). Item scores are generally on a binary or ordinal scale and the corrected total score is often used as a proxy for ability. These three models can be expressed as follows.
Model A. Baseline model:
Model B. Uniform DIF model:
Model C. Nonuniform DIF model:
In DIF detection for two groups on dichotomously scored items (correct = 1 and incorrect = 0),
One way to examine the DIF effect is to evaluate the p value of the regression coefficients. A statistically significant grouping variable (
Challenges of DIF Investigation in Performance Assessment
Performance assessments pose challenges in DIF investigation due to the lack of an internal variable for ability approximation and the continuous scale of task scores. Many existing DIF methods have been developed around dichotomously or polytomously scored variables, often with fewer than five scoring points. As a result, these DIF methods are inappropriate for many performance assessments.
When investigating DIF in performance assessments, researchers often have difficulty deriving an appropriate ability proxy (Broer, Lee, Rizavi, & Powers, 2005; Penfield & Lam, 2000; Zwick, Donoghue, & Grima, 1993). In objective tests with a large number of items, researchers can use the total scores or the corrected total scores of test takers to represent their ability levels. These proxies, derived from other items from the same measurement instrument, are considered internal proxy variables. In contrast, performance assessments often involve a small set of tasks (e.g., a one- or two-task writing test). Using overall task scores as an internal proxy variable is problematic in such cases because the ability estimated based on scores from two or three tasks is heavily influenced by including the score of the task under investigation. It is also problematic to use the corrected total score because it is solely based on test takers’ performance on the other one or two tasks, which only partially represents their true ability. Moreover, if the test has only one task, the corrected total score cannot be computed.
Alternatively, external variables have been used as ability proxies (Penfield & Lam, 2000; Zwick et al., 1993), either together with the scores of the performance assessment or without them. These variables are considered external proxies because they are derived from other measures than the one under investigation. The external proxy variables must be closely related to test takers’ scores on the performance assessment being studied.
Taking performance-based writing tests as an example, the DIF studies reported in the literature are summarized as follows. To investigate gender DIF in an assessment of writing skill that consists of 40 multiple-choice questions and two writing exercises, Welch and Miller (1995) compared three sets of proxy variables: performance on the multiple-choice questions alone (i.e., only based on external variables), performance on the multiple-choice questions and one writing task (i.e., excluding the task under investigation), and performance on the multiple-choice questions and both writing tasks. They highlighted the crucial role of selecting appropriate proxy variables and concluded that external variables alone may not be sufficient to approximate ability levels in DIF studies of performance-based tasks.
Based on Welch and Miller’s (1995) findings, Broer et al. (2005) examined DIF due to gender, ethnicity, and language background in the Graduate Record Examination (GRE) writing tasks, combining test takers’ performance on the verbal section and one of the writing tasks to create their ability proxy variables. Broer and colleagues used three methods, a generalized Mantel–Haenszel test, logistic regression, and the polytomous standardization (polySTAND) statistic, to accommodate the polytomously scored writing tasks. They found that the results were fairly consistent across the three DIF methods and recommended using logistic regression to study both uniform and nonuniform DIF.
Relatedly, Lee, Breland, and Muraki (2004) and Breland and Lee (2007) used an English language ability score as their proxy variable by summing up the standardized scores from three multiple-choice question sections, namely, reading, listening, and structure. Both studies investigated uniform and nonuniform DIF effects using logistic regression methods on TOEFL computer-based test (CBT) writing tasks. Rather than creating a new variable as an ability proxy, Chen, Lam, and Zumbo (2016) applied multiple regression using the external variables of reading and listening scores as covariates to examine DIF for the Canadian English Language Proficiency Index Program–General (CELPIP-General) writing tasks.
In brief, only a handful of studies have investigated DIF in performance-based writing tests, and even fewer have discussed the DIF methods for task/item scores on continuous scales. It is essential to create comparable groups in DIF analysis and to choose a statistical model that can handle continuous task scores, as the existing methods do not address these two issues effectively. Hence, this article proposes a new DIF approach to tackle the challenges arising from the particular nature of performance assessments. In the following subsection, we review the use of propensity score methods in DIF investigation and describe our new DIF approach through propensity score matching.
DIF Analysis Using the Propensity Score Approach
Propensity score methods (Rosenbaum & Rubin, 1983) are used to ensure comparability between groups when making causal inferences in a quasi-experimental or observational study. They help to reduce the preexisting group differences to approximate a randomized experimental study. After successful matching, the treatment and control groups follow similar multivariate distributions of the observed covariates. Propensity score matching is popular in medicine and economics (Austin, 2008), and has been introduced in education, psychology, and social science (e.g., Graham & Kurlaender, 2011; Thoemmes & Kim, 2011).
A few researchers have explored the application of propensity score matching in DIF analysis for dichotomous items using either empirical data or simulation studies (e.g., Arikan, van de Vijver, & Yagmur, 2018; Bowen, 2011; Joldersma & Bowen, 2010; Lee & Geisinger, 2014; Liu et al., 2016). Their primary goals were to eliminate preexisting group differences before conducting DIF analysis and to approximate causal claims regarding the DIF effects. These DIF studies were performed in two stages: typically, two groups were matched on propensity scores that were estimated by a combination of many key covariates; then, DIF analysis was performed on the matched data set. These researchers employed propensity score matching, together with Mantel–Haenszel and binary logistic regression methods, for DIF investigation.
It is worth noting that the independent observations assumption, which is essential for the conventional DIF methods, does not hold when the data are matched via propensity scores. Most existing DIF studies based on propensity score matching ignored the cluster effect arising from the matched data. To account for it, Liu et al. (2016) recommended using conditional logistic regression models for DIF analysis of dichotomous items with matched data. Indeed, most of these studies have focused on dichotomous items, but have not discussed how to handle tasks that are scored on continuous scales.
Based on the literature, we propose a modeling approach that can handle continuous task scores and the lack of an internal ability proxy variable. Like prior propensity score DIF methods (e.g., Lee & Geisinger, 2014; Liu et al., 2016), our method consists of two stages: (1) matching data through propensity score methods and (2) running a DIF analysis using linear mixed effects models. In the first stage, we match two groups of test takers using propensity scores. We do this in three steps: selecting covariates that are related to test takers’ task performance and group membership, estimating the propensity score and matching data, and evaluating the quality of the matched data. In the second stage, using the matched data, we investigate DIF through linear mixed effects models (also known as hierarchical linear models or multilevel models) and conduct a sensitivity analysis to evaluate the robustness of the group difference (or treatment effect) to hidden bias due to unobserved covariates.
We choose linear mixed effects models to handle cluster effects and create a proxy variable based on the scores of other related measures to deal with the lack of an internal ability proxy variable. In the demonstration, we calculate a language proficiency proxy by summing up the standardized scores of all other available language measures, including the scores of the reading, listening, and speaking tests and the other writing task score that is not used for DIF analysis. Detailed information on this new DIF approach is provided.
A Demonstration of the Two-Stage Modeling Strategy for DIF Investigation
Participants
In this demonstration, we investigate the DIF effect due to disparate educational backgrounds. Our example uses data from 1,450 adults who took a high-stakes general English writing test consisting of two tasks. Besides the writing tasks, the participants completed tests of listening, reading, and speaking, and a background questionnaire. Among the participants, 487 had education levels below undergraduate (including postsecondary studies shorter than 2 years; coded 1) and 963 were at or above the undergraduate level (coded 0). About 21% were female, while the rest were male. The test takers came from a wide range of language backgrounds. The five most represented language groups, English, Tagalog, Chinese, Korean, and Hindi, made up 52% of the sample.
Measures
CELPIP-General Writing Test
The CELPIP-General test is a measure of functional English language proficiency in reading, listening, speaking, and writing. It is intended to evaluate a person’s language ability to work and live in societies where English is the main mode of communication. Each writing test consists of two tasks, one requiring writing an e-mail and the other a response to an open question. The scores of each writing task are calculated based on ratings from multiple raters (i.e., two or three raters per task) on four dimensions (i.e., content/coherence, vocabulary, readability, and task fulfillment), resulting in a continuous scale ranging from 0 to 12. For this demonstration, we used a task that requires test takers to write a 150- to 200-word e-mail to a restaurant. For each test taker, the task score was calculated by averaging the two ratings provided by independent raters. In terms of quality of the ratings, correlation between the ratings was 0.90, and the absolute differences between the two ratings were no greater than 1.5 points (out of a maximum of 12 points) in 91% of the cases.
Background Survey
Test takers answered a survey when they registered for the test. We selected some demographic and background questions from the survey for this demonstration because we hypothesized that these variables were relevant to the grouping and outcome variables we chose. Matching based on these covariates could potentially improve the comparability of the groups. These covariates are described as follows.
Employment status
Based on the responses to two of the survey questions, work roles and job sectors, we created five binary variables indicating participants’ current employment status. These were being a student, working in construction or factories, working in stores or restaurants, working in an office, and being unemployed.
Daily use of English
We generated 15 binary variables based on participants’ responses to four survey questions that asked them to select the activities in which they used English more than three times a week. These activities included speaking (e.g., talking to friends, coworkers, or family in English), listening (e.g., watching English TV and videos), reading (e.g., reading English books, reports, or news), and writing (e.g., writing e-mails, reports, assignments, or business correspondence in English).
Language background
Three categorical variables were included to represent participants’ language background: first language (English = 1 and non-English = 0), years of learning English (less than 2 years = 1, 3 to 5 years = 2, 6 to 10 years = 3, and over 10 years = 4), and years living in English-speaking countries (less than 1 year = 1, 1 to 2 years = 2, 3 to 5 years = 3, 6 to 10 years = 4, and over 10 years = 5).
Test-taking experience
We also included the survey question on whether a participant has taken this test before (repeater = 1 and first-time test taker = 0) as one of the covariates.
Data Analysis
We investigated the DIF due to different education levels (below undergraduate = 1; undergraduate or above = 0) using the propensity score approach in two stages. The purpose of the first stage was to match groups or subpopulations using propensity scores; the goal of the second was to conduct DIF analysis using matched data. Via this approach, we demonstrate how to deal with the two challenges in DIF investigation for performance assessments (i.e., lack of an internal proxy variable and the continuous nature of the task score). We used the software program R 3.6.0 for all the analyses. We employed the package MatchIt (Ho, Imai, King, & Stuart, 2011) for propensity score matching, lme4 (Bates, Mächler, Bolker, & Walker, 2015) for linear mixed effects regression DIF analyses, sjstats (Lüdecke, 2019) for R2 approximations, and sensitivityfull (Rosenbaum, 2017) for sensitivity analysis.
Stage 1: Matching Through Propensity Score Methods
We started by selecting the covariates, then matched the data via two propensity score methods, and then finalized the matched data by evaluating the quality of the matching.
Step 1.1: Selecting covariates
Determining what to include in the propensity score model is crucial, as different sets of covariates may affect the estimation of propensity scores and thus form different matched groups (Gibson-Davis & Foster, 2006). In practice, the covariate selection is constrained by what is available in the data. In the context of educational testing, covariates are often collected from assessments of other (related) skills that are part of the testing battery as well as survey questions administered before or after the test.
A lack of consensus exists on the criteria for covariate selection in propensity score matching. Some researchers have argued for including all available covariates (e.g., Cuong, 2013). Others have recommended choosing relevant covariates based on data-driven approaches (e.g., Zigler & Dominici, 2014), expert judgments, the research literature, or theories (e.g., Graham & Kurlaender, 2011). Some studies found that overparameterization can bias the parameter estimate of the grouping variable in the final analysis (e.g., Zhao, 2008). A recent simulation study advised that researchers should include covariates related to both grouping and outcome variables when using propensity scores in DIF analysis (Liu et al., 2019). Following this recommendation, we selected covariates related to both grouping and outcome variables.
Step 1.2: Estimating propensity score and matching
Although various techniques exist for matching data based on propensity scores (see, Guo & Fraser, 2014; Pan & Bai, 2015; Rosenbaum & Rubin, 1983), we adopted optimal matching methods because they are highly recommended (Rosenbaum, 1989). Two of the most commonly used, optimal pair matching and optimal full matching, are demonstrated in this study. In pair matching, a treated unit (a score from the focal group) is matched with a control unit (a score from the reference group) whose propensity score is the closest to the treated unit. This algorithm minimizes the overall global propensity score distance by many iterations to adjust the distance at the pair level (Rosenbaum, 1991). Alternatively, optimal full matching (Rosenbaum, 1991) allows one treatment unit to be matched with multiple control units (i.e., one to many) or multiple treatment units to be matched with a single control unit (i.e., many to one).
Many methods besides optimal matching have been shown to be effective in balancing covariates between groups (e.g., Austin, 2014). Because one method may outperform another under certain conditions, we encourage researchers to compare different matching methods on their data and select those that produce more adequately balanced data sets (e.g., Gu & Rosenbaum, 1993).
Step 1.3: Evaluating matching results
Once the matching is completed, researchers can assess its quality by examining the balance of the overall propensity score distributions and the balance of individual covariates. In this demonstration, propensity score distributions are evaluated visually using histograms and jitter plots (Schuler, 2015; Stuart & Rubin, 2008). The balance of covariates is examined using the percentage of bias reduction (PBR). PBR is calculated for each variable by comparing the between-group differences before and after matching (Bai, 2015; Cochran & Rubin, 1973). A positive value suggests an improvement, that is, the between-group difference becomes smaller in the matched data set than it is in the original.
Stage 2: Conducting DIF Analyses Via Linear Mixed Effects Models
We used a five-step procedure to investigate DIF using matched data. We employed linear mixed effects models for the DIF analysis to handle the continuous task scores as well as the cluster effect created by the matched data. We describe our solution to the lack of an internal ability proxy variable in Step 2.2. In the last step, we conduct a sensitivity analysis to examine the robustness of the findings to potentially hidden bias.
Following the tradition of logistic regression DIF, a variance-accounted-for measure of effect size is reported with the statistical tests of DIF (Zumbo, 1999, 2008). In the current setting wherein linear mixed effects models are used, however, there is no consensus on effect size calculation and reporting practice. Following the recommendations by Nakagawa and Schielzeth (2013), we report the marginal and conditional R2 values as indicators of DIF effect size. Unlike R2 for conventional linear models, the marginal R2 describes the proportion of variance explained by the fixed effects and the conditional R2 concerns with the proportion of variance explained by both fixed and random effects (Nakagawa & Schielzeth, 2013). While we report both types of R2 approximations for each of the four mixed effects models, marginal R2 for models 2 to 4 are of interest because they provide an indicator of the magnitude of the DIF effect.
Step 2.1: Establishing a null model (Model 1)
We started with a model without any predictors. To determine whether the cluster effects arising from matched data should be accounted for in DIF analysis, we calculated the intraclass correlation coefficient (ICC). 2
Step 2.2: Running a baseline model with ability approximation variable(s) (Model 2)
Similar to the conventional logistic regression DIF analysis, the baseline model includes one predictor, that is, the ability proxy variable. As discussed in the literature review, total scores or corrected total scores are not always feasible in performance assessments. Instead, we created an approximation variable for an individual’s English proficiency, which is computed from all other components of the language test. We first standardized test takers’ scores on the other writing task (recall that the test comprises two writing tasks), as well as their scores on the listening, reading, and speaking tests. Then we created an English proficiency variable by summing up the standardized scores for each individual. Before entering the English proficiency variable in the models, we centered it using the cluster means (i.e., within-cluster centering) to facilitate the interpretability and avoid potential collinearity issues when examining the interaction effect (Enders & Tofighi, 2007; Raudenbush & Bryk, 2002).
Step 2.3: Detecting uniform DIF (Model 3)
Based on the model specified in Step 2.2, we then added the grouping variable. We tested the uniform DIF effect via a model comparison approach, using a likelihood ratio test to compare Models 2 and 3.
Step 2.4: Detecting nonuniform DIF (Model 4)
To investigate the nonuniform DIF effect, we added the interaction term between the grouping and ability proxy variables to the model used in Step 2.3. As in the last step, the nonuniform DIF effect is tested by comparing Models 3 and 4 through a likelihood ratio test.
Step 2.5: Conducting a sensitivity analysis
To assess the robustness of the significant findings to hidden bias, we conducted a sensitivity analysis using the method proposed by Rosenbaum (2007, 2017). Based on Huber’s m-statistics for continuous variables, Rosenbaum’s sensitivity analysis uses the sensitivity parameter, Gamma (Γ), to measure the degree of departure from the random assignment mechanism. A study is highly sensitive to hidden bias if the conclusion easily changes for a Γ value just merely larger than one, whereas it is relatively robust if the conclusion changes with a quite large Γ value. In a sensitivity analysis, a range of Γ values is usually examined starting from 1.0 (no hidden bias). In this demonstration, we raised the value of Γ beginning from 1.0 in increments of 0.1 until the conclusion was altered, that is, the p value of the test exceeds .05. See Rosenbaum (1995, 2007, 2013, 2014) for more details of sensitivity analysis.
Results
Stage 1: Matching Through Propensity Score Methods
As described in the data analysis, propensity score matching was carried out in three steps, and two methods (optimal pair and full matching) were included for this demonstration.
Step 1.1: Selecting Covariates
Covariates were selected based on their theoretical relevance to writing proficiency and their response distributions. We excluded a covariate if it had extremely unbalanced response distributions (i.e., fewer than 5% of the test takers endorsed a response category) so that we could avoid issues of sparse data in the matching.
Step 1.2: Estimating Propensity Score and Matching
Two propensity score matching methods, optimal pair matching and optimal full matching, were performed. The R code is provided in the Supplemental Appendix (available online). In this demonstration, when optimal full matching was used, we allowed both one-to-many and many-to-one matched units with a restriction of the maximum ratio to eight between the two groups. Without defining the maximum ratio, there are no upper restrictions. To optimize the estimation of propensity scores and make the algorithm run faster, we followed the recommendation by Hansen and Klopfer (2006) and Liu et al. (2016) to add an upper limit on the ratio of matched units. We chose this limit by examining the sample characteristics (e.g., sample sizes) and comparing the balance results of different ratios of treated units and controls.
Step 1.3: Evaluating Matching Results
Figure 1 shows the distributions of the estimated propensity scores of the two groups before and after matching, which are visualized by histograms (the middle column) and jitter plots (the last column). The upper panel shows the results of optimal pair matching, while the bottom panel shows the results of optimal full matching.

Propensity score distributions before and after the optimal matching.
As the histograms indicate, the distributions of the two groups diverged before matching; the higher education group (labeled as “Raw Control” in the histograms) tended to have lower propensity scores. After matching, the discrepancy decreased. Although a noticeable disparity still exists when optimal pair matching was used (histograms in the upper panel), suggesting that the covariate balance was unsatisfactory, the two groups after optimal full matching had better matched distributions (histograms in the bottom panel).
A circle in a jitter plot represents one estimated propensity score. For pair matching, the jitter plot shows that some units (476 out of 963 test takers) were removed after matching, which are shown as “unmatched control units” (top right of Figure 1). For full matching, all the data points were used. The jitter plot shows that the matched control units (i.e., the higher education group) had more lower-value propensity scores.
Table 1 presents the percentage of bias reduction (PBR), which is a ratio of the mean differences of the two groups after (numerator) and before (denominator) matching, for both matching methods. The results suggest that most of the covariates have a large magnitude of bias reduction (>80%) using optimal full matching, whereas pair matching produces a less desirable bias reduction. Negative PBR values are observed, indicating an increase in bias between the two groups after matching. However, the increase in bias reduction for these covariates was fairly small. For example, the covariate, Talk_Coworker, has a large negative bias reduction value (−1927.4) even though the mean difference after matching is only 0.02. This is because the mean difference before matching, which is the denominator in the PBR calculation, is close to zero.
Percentage of Bias Reduction (PBR) Using Optimal Pair Matching.
Note. For means and mean differences, the numbers are rounded to the nearest 100th; for PBR, the numbers are rounded to the nearest 10th.
Stage 2: Conducting DIF Analyses Using Linear Mixed Effects Models
As shown by the above results, the covariates achieved a better balance between groups with optimal full matching than with optimal pair matching. Hence, we performed the following analyses on the matched data set created via full matching. The results are presented in Table 2, following the sequence described in the data analysis.
Results for the Four Linear Mixed Effects Models for DIF Investigation.
Note. DIF = differential item functioning; AIC = Akaike information criterion; BIC = Bayesian information criterion; df = degree of freedom.
Step 2.1: Establishing a Null Model (Model 1)
The null model estimated the variance components both within and between clusters. The ICC was 0.25, indicating that 25% of the variance in the writing task scores was attributable to the effect due to matched clusters. 3 This result provides a rationale for using a mixed effects model to account for the nested nature of the matched data set.
Step 2.2: Running a Baseline Model With Ability Approximation Variable(s) (Model 2)
In Model 2, we added an ability proxy variable as a predictor. The positive coefficient for this variable (p < .001) suggests that a higher level of English proficiency is associated with a higher score on this writing task, which is expected. As Table 2 shows, with the inclusion of this ability proxy variable, the values of the Akaike information criterion (AIC), Bayesian information criterion (BIC), and deviance drop, suggesting an improved fit from Model 1 to Model 2.
Step 2.3: Detecting Uniform DIF (Model 3)
The bottom row of Table 2 presents the results from the likelihood ratio tests, which suggest a significant uniform DIF effect (
Step 2.4: Detecting Nonuniform DIF (Model 4)
The model comparison results (Table 2) reveal a statistically significant nonuniform DIF effect (

Nonuniform differential item functioning (DIF) effect of education.
Step 2.5: Conducting a Sensitivity Analysis
In this example, the alternative hypothesis of this sensitivity analysis test was that the lower education group would perform worse than the reference group (i.e., higher education group) on the writing task. Hence, we chose a one-tailed test with the critical value of 0.05 at the lower bound of the distribution. Table 3 presents the results of the sensitivity parameter, Γ, from 1.0 to 1.5 in increments of 0.1. The sensitivity analysis yields a significant p value when Γ is 1.0. This is consistent with the DIF results, which suggest a significant group difference between the less and more educated test takers. This group difference becomes nonsignificant between Γ = 1.4 and Γ = 1.5. This shows that the conclusion about DIF effect is easily altered in the presence of unobserved covariates that are related to the random assignment mechanism. This finding may also imply that the magnitude or effect size of the DIF for this writing task is small.
Results for Sensitivity Analysis: Upper Bounds on the One-Sided Significance Testing.
Note. The alternative hypothesis is that the lower education group (treated or focal group), compared with higher education (control or reference group), has a lower score on the writing task.
Discussion
The growing use of performance assessments must be supported by validity evidence, such as evaluating DIF. As Zwick et al. (1993) acknowledge, performance assessments are not free from bias. Like other types of items, they may tap construct-irrelevant factors. Psychometric procedures that evaluate test score reliability and validity, including DIF, are therefore needed to appraise performance assessments.
DIF analysis procedures are commonly used to rule out fairness concerns and ensure that the score interpretation is valid for test takers from different groups. Previous DIF studies on performance-based writing tests have consistently flagged a relatively large number of prompts as favoring female test takers (e.g., Breland & Lee, 2007; Welch & Miller, 1995). However, DIF effects due to other factors have not received much attention. Examples include test-taking experience (e.g., repeaters vs. first-time participants), processes (e.g., answering items sequentially vs. not following the preset sequence), and other variables (e.g., raters and rating rubrics) that may contribute to construct irrelevant variance.
In our demonstration, we investigated the DIF effect due to educational background on a writing task targeting functional English proficiency. Although education, literacy, and language proficiency are related, for a test of general functional language proficiency, we would not expect different education levels (below vs. above an undergraduate degree) alone to significantly change test takers’ performance on a task. We encourage researchers to look beyond conventional gender and demographic groups in their DIF investigations.
Propensity score matching methods have been shown to help balance covariates and achieve comparable groups, making it possible to draw causal inferences about the DIF effect (e.g., Arikan et al., 2018; Lee & Geisinger, 2014; Liu et al., 2016). In the present study, we applied propensity score matching methods in a context where an internal ability proxy variable is not obtainable. Without an internal variable for ability approximation, it is often difficult to adequately adjust for the ability distributions on the construct, which, in turn, challenges the estimation and interpretation of DIF effects. In such cases, propensity score matching methods can reduce the pretest differences between groups, and thus, enhance their comparability in DIF analysis.
Once the two groups are successfully matched through propensity score matching, the initial distributional differences on the covariates are amended. Researchers may then employ conventional methods to perform the DIF investigation. However, the assumption of observation independence may be violated due to the nested data structure (or cluster effect) created by matching. As has been well documented in the literature on mixed effects and hierarchical models, ignoring data dependence issues may increase Type I error due to underestimated standard errors. In our demonstration, we used linear mixed effects regression models to account for the nested nature of the matched data set and the continuous writing task scores. In practice, researchers should examine whether the clustering effect exists and choose a statistical method accordingly.
As with any statistical method, the proposed method based on propensity score matching has its limitations. Propensity score matching methods may be subject to hidden bias and fail to achieve comparable groups if some important covariates are not included or not measured. Since it is impossible to measure every covariate and include it in the analysis, researchers always run the risk of having hidden bias influence their results. To minimize the negative effect of hidden bias, researchers should carefully design their studies to collect information on the key covariates that are suggested by the literature, theories, and experience.
This article introduces a new DIF method that handles two challenges arising from analyzing performance assessment data, that is, the lack of a reliable internal proxy for ability or aptitude and the continuous scale of task scores. While demonstrated on a language test, our proposed two-stage modeling approach via propensity score matching is a practical and flexible tool for investigating DIF in other similar assessments.
Two examples of the flexibility of the propensity score DIF method are noteworthy. First, though test taker education level was the focus of DIF for our demonstration of the new method presented herein, possible differences among raters were not considered because the rating design for the demonstration data treats raters as exchangeable due to training, randomization of raters, large rater pool, certification of raters and their monitoring. However, with the appropriate rating design, one could apply the propensity score DIF method to study variation among human raters involved in the rating process. A second example of its flexibility could be applying the propensity DIF method to short scales that are often used in psychological research (e.g., five-item Satisfaction with Life Scale; Diener, Emmons, Larsen, & Griffin, 1985) or to single-item measures that are found in questionnaires and surveys (Zumbo, 2008). Although the development of DIF has arisen in the context of multi-item measures, single-item measures are widely used in health psychology and social surveys (Bowling, 2005; Macias, Gold, Öngür, Cohen, & Panch, 2015) wherein measurement validity is still a concern (Zumbo & Padilla, 2020). Single-item measures share an important feature with the writing performance assessment: That, by design, there is no internal trait proxy variable for DIF analyses. For example, one could adapt the propensity score DIF method to the study of DIF across language groups for a single-item measure of quality of life reported on a visual analogue scale.
Supplemental Material
Propensity_DIF_writing_Appendix_0822 – Supplemental material for A Propensity Score Method for Investigating Differential Item Functioning in Performance Assessment
Supplemental material, Propensity_DIF_writing_Appendix_0822 for A Propensity Score Method for Investigating Differential Item Functioning in Performance Assessment by Michelle Y. Chen, Yan Liu and Bruno D. Zumbo in Educational and Psychological Measurement
Footnotes
Acknowledgements
The second and third authors acknowledge the support of The University of British Columbia–Paragon Research Agreement.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
