Abstract
This investigation examines the efficacy of multilevel analysis of individual heterogeneity and discriminatory accuracy (MAIHDA) over fixed-effects models when performing intersectional studies. The research questions are as follows: (1) What are typical strata representation rates and outcomes on physics research-based assessments? (2) To what extent do MAIHDA models create more accurate predicted strata outcomes than fixed-effects models? and (3) To what extent do MAIHDA models allow the modeling of smaller strata sample sizes? We simulated 3,000 data sets based on real-world data from 5,955 students on the LASSO platform. We found that MAIHDA created more accurate and precise predictions than fixed-effects models. We also found that using MAIHDA could allow researchers to disaggregate their data further, creating smaller group sample sizes while maintaining more accurate findings than fixed-effects models. We recommend using MAIHDA over fixed-effects models for intersectional investigations.
Originating in the 1980s, the concept “intersectionality” highlights “the vexed dynamics of difference and the solidarities of sameness in the context of antidiscrimination and social movement politics” (Cho, Crenshaw, and McCall 2013:787). Following Crenshaw’s (1990) introduction of the term to understand the experiences of Black women, researchers have expanded applications of intersectionality to include diverse identities and disciplines (Carbado et al. 2013; Collins 2019; Collins and Bilge, 2020; Crenshaw, 1989; Harris and Patton 2019; Mena and Bolte 2019; Nash, 2008; Teffera et al. 2018). Collins (2015) advocates for developing intersectionality as a critical social theory to understand and change the existing social order. To use intersectionality in this way, Collins details four principles of power and social structures: (1) Social strata (e.g., race, class, and gender) act as markers of power that are interdependent and mutually constructed; (2) power relations across the intersections of these power markers create complex, interdependent inequalities; (3) individuals’ and groups’ locations within these intersecting power relations shape their experiences and perspectives; and (4) solving social problems requires intersectional analyses specific to the context of the social problem.
Choo and Ferree's (2010) identification of group-, process-, and systems-centered approaches to studying intersectionality highlights the breadth of its possibilities. Group-centered research centers the voices of multiple marginalized groups, aligning with McCall’s (2005) anti-categorical or intracategorical paradigms. Anti-categorical complexity challenges the use of categories that may reinforce inequalities, whereas intracategorical complexity examines specific intersectional strata to understand lived experiences better. Process-centered research focuses on power dynamics and the intersecting oppressions or privileges, often corresponding with McCall's intracategorical and intercategorical paradigms. Intercategorical complexity involves analyzing existing strata to explore inequalities and their evolution. System-centered research examines the interactions among multiple forms of oppression, moving beyond a single-system focus to encompass complexities like racism, sexism, and classism (Willis 1981). This approach typically aligns with intercategorical complexity, facilitating comparisons across social strata and exploring various social locations, such as STEM fields or teaching methods.
Given the breadth of approaches to studying intersectionality, Cho et al. (2013) highlight the necessity for a dedicated Intersectionality Studies field, and Collins (2019) envisions a vibrant intersectional community that combines grassroots and top-down theories, embracing diverse perspectives, methodologies, and disciplines to overcome theoretical limitations. Both advocate for intersectionality as a vital practice aimed at challenging and reforming unjust systems. Quantitative intersectional research can contribute to these goals by creating pathways for marginalized groups in STEM, through innovative teaching methods, and by redefining participation in STEM fields. Collins’s (2015) four principles guide quantitative researchers to develop intersectional models that are “meaningful” rather than merely statistically “significant.” The term “meaningful” is crucial here because reliance on statistical significance for model selection can overlook educationally relevant differences (Wasserstein and Lazar 2016) and yield misleading outcomes (Van Dusen and Nissen 2022). Additionally, these principles emphasize the necessity of considering power dynamics beyond individual social strata in models, including instructional methods and school characteristics.
Salem (2018) describes intersectionality as a theory that travels across time, place, and space; it has been adapted and used differently across contexts and communities. For instance, the rise of critical quantitative theories has brought intersectionality into quantitative research (Tabron and Thomas 2023). However, challenges in adapting intersectionality quantitatively have hindered practical and theoretical advances (Bowleg, 2008). Bauer et al. (2021) highlight this, noting a significant gap in theoretical consistency in quantitative intersectionality studies.
Quantitative STEM equity research often groups marginalized populations (e.g., underrepresented minorities [URMs]) and analyzes oppression separately (e.g., racism or sexism). These choices, largely driven by institutional norms rather than theory (e.g., the National Science Foundation's categorization of URMs), aim for statistical significance but can mask true inequities (Shafer, Mahmood, and Stelzer 2021; Wasserstein and Lazar 2016). Incorporating intersectionality could enhance understanding of inequities and inform equitable educational reforms (Van Dusen and Nissen 2020a, 2020b). However, this approach demands analytic methods capable of addressing the complex interplay of social identities, power dynamics, and discipline-specific practices.
In examining intersectionality in feminist literature, McCall (2005) outlines three approaches: anti-categorical, intracategorical, and intercategorical. Quantitative research, which inherently categorizes, typically uses intra- or intercategorical methods, with the latter being more prevalent (Bauer et al. 2021). The fixed-effects model is frequently used to assess intersectional outcomes, incorporating main effects for each aspect of intersectional social strata (e.g., race, gender) and all of their potential interactions (Evans 2019; Evans, Leckie, and Merlo 2020). Intersectional social strata are the provisionally adopted analytic categories (e.g., the combination of race, gender, and first-generation designation) used to document inequalities. This approach, while offering nuanced insights, becomes increasingly complex and data-intensive with each added identity stratum, leading to challenges in achieving sufficient statistical power for reliable outcomes.
Researchers have proposed a multilevel analysis of individual heterogeneity and discriminatory accuracy (MAIHDA; Evans 2015, 2019; Evans et al. 2018; Merlo 2018) to address some of the shortcomings of prior methods. To improve model predictions across strata, MAIHDA nests individuals within their social identities (see Figure 1; Evans et al. 2020; Keller et al. 2023). By combining the main effects with the variance terms for each strata, MAIHDA can create more accurate predictions than adding the primary terms without including interaction terms. MAIHDA offers several potential advantages over fixed-effects models. First, MAIHDA aligns with an intersectional perspective by including all strata in a model regardless of their sample size. Second, by reducing the number of terms in the model, MAIHDA reduces the statistical power requirements. Third, MAIHDA seeks to increase the accuracy of predictions by drawing on the shrinkage, or partial pooling, that nesting within each aspect of a strata provides in a multilevel model (Raudenbush and Bryk 1986). Shrinkage allows multilevel models to make predictions for each strata that are informed by the predictions made for other strata. Shrinkage benefits are likely to be strongest for small-N strata where the small numbers limit the ability of intersectional models to disaggregate outcomes.

Multilevel analysis of individual heterogeneity and discriminatory accuracy (MAIHDA) nesting students within k strata.
MAIHDA offers significant potential for modeling intersecting social identities, but its empirical validation requires further simulation studies. Prior work has used simplified simulations (Bell, Holman, and Jones 2019; Evans et al. 2020; Lizotte et al. 2020) or focused on health outcomes (Mahendran, Lizotte, and Bauer 2022a, 2022b), leaving a gap in demonstrating MAIHDA's effectiveness in educational settings. Specifically, there is a need to show how MAIHDA can enhance prediction accuracy for educational strata, particularly when examining smaller sample sizes. Additionally, simulations have yet to explore MAIHDA's application to education's hierarchical data structure, such as measurements within students within courses within schools.
In our study, we compare MAIHDA and fixed-effects models to evaluate their performance in analyzing intersectional social strata outcomes. We use the Force Concept Inventory (FCI; Hestenes, Wells, and Swackhamer 1992) for our simulations, set against the backdrop of physics—a field noted for its inequities (Brewe and Sawtelle 2016). Our simulations, reflective of actual strata performance and participation, incorporate students within courses, course-level variation, and varied sample sizes typical of science equity research (Van Dusen and Nissen 2022). Although we simplified course-level variation, we meticulously modeled the common educational scenario of students nested within courses to understand its influence on outcome predictions (Van Dusen and Nissen 2019).
To understand MAIHDA's efficacy for examining outcomes for intersectional social strata, we asked the following three research questions:
Research Question 1: What are typical strata representation rates and outcomes in physics research-based assessments?
Research Question 2: To what extent do MAIHDA models create more accurate predicted strata outcomes than fixed-effects models?
Research Question 3: To what extent do MAIHDA models allow for the modeling of smaller strata sample sizes?
Intersectionality in STEM Higher Education
Most intersectional research in STEM higher education has used qualitative methods to investigate the double bind that racism and sexism pose to women of color in STEM fields. These studies reveal the covert norms and expectations of these disciplines, such as the emphasis on mastery, competitiveness, and individualism, that disproportionately exclude women of color (Carter et al. 2019; Carlone and Johnson, 2007; Cochran, Boveda, and Prescod-Weinstein 2020; Dawson, 2019; Ireland et al., 2018; McGee 2023; Ong 2023; Traweek 2009; Womack et al. 2023). Such norms not only further marginalize multiply marginalized groups, but they also allow White faculty to profess inclusivity while perpetuating color-evasive racism, undermining the struggles against racism and sexism faced by women of color (Dancy and Hodari 2023; Fries-Britt et al., 2010; 2013; King, Russo-Tait, and Andrews 2023; Robertson et al. 2023).
The few quantitative studies in STEM higher education that evoke intersectionality use interaction terms to build intersectional models (Van Dusen and Nissen 2022). These studies have found meaningful differences in content knowledge and beliefs before instruction (Nissen, Horses, and Dusen, 2021; Van Dusen and Nissen 2020b; Van Dusen et al. 2021), opportunities and performance in AP physics and chemistry courses (Krakehl and Kelly 2021; Palermo, Kelly, and Krakehl 2022), and inequities in course failure rates (Van Dusen and Nissen 2020a) that represent the educational debt American society owes to Black, Brown, Indigenous, and poor students (Ladson-Billings 2006) and women in STEM. Yet the common use of p-value cutoffs to exclude interaction terms and small sample sizes results in many studies not building intersectional models at all (Stewart et al. 2021; Van Dusen and Nissen 2022).
Approaches to Modeling Intersectionality
Fixed-Effects Approach
Bauer (2014) notes that despite their mutual enhancement potential, intersectionality theory and quantitative methods have not fully converged. Health researchers have recognized this gap, acknowledging the complexity of inequities and the need for quantitative analysis to address these issues. In education studies, quantitatively applying intersectionality has provided insights by exploring how intersecting identities (race, gender, socioeconomic status) and their power dynamics affect academic outcomes (Riegle-Crumb and Grodsky 2010), disciplinary measures (Morris and Perry 2017), and financial planning for college (Quadlin and Conwell 2021). Intersectionality research also includes often overlooked factors, such as body size, alongside race and sex on education (Branigan 2017), showcasing the theory's breadth in examining diverse social strata effects.
In their examination of quantitative intersectional research, Mena and Bolte (2019) found that intersections in health studies are often analyzed using regression models that include interaction terms or through stratified analysis, a practice confirmed by Bauer et al. (2021) across various fields. This approach involves integrating interaction terms (e.g., Gender × Race) into regression equations to investigate the nuanced effects of intersecting identity strata (Evans 2019). For example, Conwell (2021) utilized cubic regression with intersectional terms, such as race and income, to study their effect on children's math scores. Similarly, Riegle-Crumb and Grodsky (2010) used multivariate regression to explore interactions between ethnicity and various socioeconomic factors, demonstrating the methodology's applicability in understanding complex social dynamics.
Researchers have also recognized that these intersectional approaches can address questions involving nested data and data that evaluate intervention treatments. For instance, Morris and Perry (2017) studied the interaction of race and gender on office referrals. They conducted a series of models including an interaction term for race/ethnicity and gender and using a three-level model, with observations (Level 1) nested within students (Level 2) nested within schools (Level 3).
Problems With Fixed-Effects Approaches
Using interaction terms to analyze intersectional social strata, although insightful, faces challenges with scalability and precision as the number of strata increases (Evans et al. 2018). Moreover, these approaches often require large data sets, typically over 1,000 students (Bauer et al. 2021). The necessity for ample sample sizes to adequately represent and disaggregate each strata complicates the examination of intersectionality, especially in studies with diverse programs or treatments. For example, Riegle-Crumb and Grodsky (2010) tackled mathematical achievement disparities by conducting separate analyses for students in different levels of math courses using intersectional terms. This division dilutes statistical power, amplifies model uncertainty, and complicates the detection of inequities among strata.
Maihda
The challenges associated with fixed effects and interaction terms in intersectional studies highlight logistical and conceptual issues. Scott and Siltanen (2017) critiqued the alignment of multiple regression, a prevalent method in intersectional research, with the core principles of intersectionality, emphasizing the importance of contextual sensitivity, open-ended examination of inequities, and acknowledgment of the multifaceted and multilayered nature of inequity. They argued that traditional regression methods fall short of capturing these intricacies, recommending instead a focus on data that contextualize individuals within broader settings. Despite slow adoption, emerging evidence from simulations indicates that MAIHDA (Merlo 2018) could offer a more fitting approach by effectively incorporating these intersectional considerations. MAIHDA uses “[h]ierarchical and multilevel models to study large numbers of interactions and intersectional identities while partitioning the total variance between two levels—the between-strata (or between-category) level and the within-strata (or within-category) level” (Evans et al. 2018:64).
The idea of using MAIHDA for intersectional analyses originated in Evans’s (2015) dissertation and was further explored by Evans et al. (2018) in the context of health inequities. Their findings indicate MAIHDA's superiority over traditional fixed-effects methods in handling intersectional terms for several reasons: MAIHDA simplifies the inclusion of additional intersectional terms, improves accuracy for smaller strata, effectively considers the effects on both multiply marginalized and partially privileged individuals, offers more insightful statistics than merely the significance of interaction terms, and facilitates analysis of within-group differences. MAIHDA's multilevel structure also allows for the incorporation of experimental designs into the analysis, enhancing its capability to dissect complex data. Evans et al. underscored MAIHDA's efficiency by comparing Bayesian information criterion (BIC) scores between fixed-effects and MAIHDA models across simulations with strata numbers ranging from 4 to 384; they found MAIHDA models maintained consistent BIC scores despite increasing strata, unlike the exponential increase observed in fixed-effects models.
Independently, Jones, Johnston, and Manley (2016) examined an analogous technique to MAIHDA. They modeled voting rates using fixed effects for primary terms and random effects to account for interaction terms. They found that using random effects improved the level of detail in the model and protected researchers from overinterpreting their data.
Since these initial publications, simulation studies have examined different aspects of MAIHDA's efficacy and utility. This work evaluates MAIHDA's performance compared to alternatives in reducing false positives and yielding accurate models with varying sample sizes. Bell et al. (2019), for example, performed a simulation study examining false-positive rates for intersectional terms in fixed-effects and MAIHDA models. They found that MAIHDA models produced fewer false positives than did saturated fixed-effects models.
Lizotte et al. (2020) demonstrated that when given large sample sizes (100,000) and large strata populations (3,125), MAIHDA models and fixed-effects models were both highly accurate. They proposed that Evans (2019) had misinterpreted the MAIHDA model terms. Specifically, they claimed that the fixed effects had been mistaken for the grand means and that the residual terms had been mistaken for the intersectional effects. Evans et al. (2020) rebutted this suggestion. Using a simulation study, they demonstrated that, as with all multilevel models, the fixed effects in MAIDHA models are precision-weighted grand means, which, under some conditions, will be equal to the grand means.
In their comparative simulation studies, Mahendran et al. (2022a, 2022b) evaluated seven modeling techniques, including cross-classification, regression with interactions, MAIHDA, and various decision trees (CART, CTree, CHAID, and random forest), for analyzing intersectional health outcomes. Using logistic regression for binary outcomes, they discovered that although some methods could occasionally reach MAIHDA's level of accuracy and sensitivity, MAIHDA consistently outperformed all other techniques across different scenarios (Mahendran et al. 2022b). In their follow-up study focused on continuous outcomes, random forest emerged as the most robust, yet MAIHDA was still favored for its performance across various sample sizes (Mahendran et al. 2022a). Notably, these studies shifted from MAIHDA's usual Bayesian framework to frequentist models, with researchers arguing that the latter offered comparable results with the advantage of quicker estimation.
Keller et al. (2023) affirmed MAIHDA's advantages in education research: MAIHDA could efficiently scale to higher dimensional data, maintain model simplicity, and provide precise estimates for strata with few observations, consistent with findings in health research. However, they identified challenges specific to education, such as the difficulty of gathering sufficiently large samples to represent diverse intersectional strata of students. They also highlighted the importance of investigating MAIHDA's capacity to handle education's inherent nested data structures, such as students within courses and schools. Evans (2019) conducted a preliminary examination of nesting individuals within schools alongside social strata, but comprehensive studies using cross-classified multilevel MAIHDA models in educational contexts remain unexplored.
In this simulation study, we expand on the literature by creating a realistic model of science student outcomes—in which strata composition varies to mirror likely data sets—and comparing MAIHDA's accuracy against fixed-effects models of varying sample sizes. We improved the sophistication of the simulated data from prior studies by building it off of large-scale, real-world education data. We used cross-classified multilevel models that nest students in courses and social strata, a critical feature of large-scale educational studies (DiPrete and Forristal 1994; Niehaus, Campbell, and Inkelas 2014; Raudenbush and Bryk 1986; Van Dusen and Nissen 2019). We also examine how MAIHDA's improved accuracy can allow researchers to include more strata in their models without sacrificing the accuracy of their predictions.
Methods
To compare the accuracy and precision of fixed-effects and MAIHDA models across three sample sizes, our analysis proceeds in four steps (see Figure 2). Step 1 uses the FCI data to create a true model of test scores across 20 intersectional identities. Step 2 simulates data for the true model 1,000 times for each of the three sample sizes. Step 3 models the data using both fixed-effects and MAIHDA models. Step 4 compares the true error (predicted – true score) across the fixed-effects and MAIHDA models.

The four steps of our analysis.
Step 1: Data Collection
To achieve a realistic simulation reflective of actual education research, we utilize simulated data inspired by real-world data from the Learning About STEM Student Outcomes (LASSO) platform (Van Dusen 2018). LASSO, an online tool for STEM education assessment, offers detailed reports on student performance in core science subjects, providing anonymized student data for research purposes with participant consent. The data set does not capture every institutional context, but it surpasses the representativeness of data commonly used in studies on science higher education (Nissen et al. 2021). The Carnegie Classification of Institutions of Higher Education public 2021 database served as the basis for detailing institution types (see Table 1).
Institution Information from Our Data Set.
Note: The subcategories do not add up to the total because two institutions were not in the Carnegie Classification of Institutions of Higher Education public 2021 database.
LASSO provided the social identities and pretest scores from 5,955 students in 171 courses at 40 institutions on the FCI. To assess the effectiveness of physics instruction, Hestenes et al. (1992) developed the FCI to probe student understanding of Newtonian forces. Researchers have applied many different quantitative methods to data from the FCI (see e.g., Eaton and Willoughby 2020). One strand of this quantitative research focuses on the fairness of the FCI and its ability to produce unbiased data across different strata. Morley, Nissen, and Van Dusen (2023) found the FCI was measurement invariant across the intersection of 10 racial and gender strata, but evidence indicates that several items on the FCI function differently for men (Traxler et al. 2018), White men in particular (Buncher et al. 2021), than for individuals identifying with other gender or racial strata. Buncher et al. (working paper), however, concluded that the differences in item performance across strata were likely due to differences in construct-relevant background knowledge rather than bias from construct-irrelevant knowledge. Nonetheless, researchers often use the FCI to investigate the effectiveness of instructional techniques (Bruun and Brewe 2013; Caballero et al. 2012; Han et al. 2015; Nissen, Her Many Horses, et al. 2022; Xiao et al. 2020) and equity in courses (Brewe, Kramer, and O’Brien 2009; Good, Maries, and Singh 2019; Van Dusen and Nissen 2020a, 2020b).
Step 2: Data Simulation
To reflect the range of sample sizes commonly encountered in equity research (Van Dusen and Nissen 2022), we generated simulated data sets with 500, 1,000, and 5,000 students, creating 1,000 data sets for each size, totaling 3,000 simulations. In these data sets, each student was placed within a course comprising 50 students; the course average scores had a standard deviation of 10 points, given the FCI score range of 0 to 100. This variability aligns with actual educational data and represents typical variance seen in educational studies (Condon, Lavery, and Engle 2016; Sun and Pan 2014; Van Dusen and Nissen 2019, 2020b). We avoided simulating complex student–classroom demographics, such as having men more represented in higher or lower performing contexts, to maintain the study's focus. Of the initial simulations, 44 data sets at the 500-student sample size were discarded due to having strata with no data, rendering them unsuitable for analysis with either fixed-effects or MAIHDA models. This adjustment led to a final count of 2,956 data sets available for our analysis.
Within each simulated data set, we included strata variables for five racial groups (i.e., Asian = 14 percent, Black = 6 percent, Hispanic = 7 percent, White = 63 percent, and White Hispanic = 10 percent), two gender groups (i.e., women = 36 percent, men = 64 percent), and two college-generation groups (i.e., first-generation [FG] = 36 percent, continuing-generation [CG] = 64 percent), creating 20 intersectional social strata combinations. Based on the LASSO data, we set each strata's representation rates, true values, and standard deviations. Table 3 shows the true score and proportional representation for each strata and the student-level standard deviations.
The simulation set the proportional representation of each strata in the data set, but how they intersected varied across each simulation. For example, we set 6 percent of students in each simulation as Black, 36 percent as women, and 36 percent as FG. We assigned each axis of social identifiers (i.e., race, gender, and FG/CG designation) independently, meaning we did not ensure that 36 percent of each racial group were women. So, although FG Black women made up 0.76 percent (6 percent × 36 percent × 36 percent) of the data on average, in any given simulation, their representation could range from 0 percent (e.g., if none of the Black students were women and FG) to 6 percent (if all the Black students were women and FG). By allowing the proportion of students in each strata to vary across simulations, we can examine the accuracy of predicted outcomes across a broader range of strata sample sizes.
Our simulation also randomly assigned students to courses. For instance, 36 percent of the population were women, but within any given course, the share of women could range from 0 to 100 percent. We set the standard deviation for scores within a strata to be 20 points.
Our student score simulation equation is as follows:
Step 3: Modeling
We analyzed each data set using a fixed-effects model and a MAIHDA model. Our fixed-effects intersectional model (see Figure 3) is a frequentist multilevel model that nests students (Level 1) within courses (Level 2). The fixed-effects intersectional model includes an interaction term for each combination of race, gender, and college-generation term:

The multilevel structure of our fixed-effects model with students (Level 1) nested within courses (Level 2).
Our MAIHDA model (see Figure 4) is a Bayesian cross-classified multilevel model that nests students (Level 1) within courses (Level 2a) and strata (Level 2b):

The multilevel structure of our cross-classified MAIHDA model with students (Level 1) nested within courses (Level 2a) and strata (Level 2b).
We follow Fielding and Goldstein's (2006) notation for multilevel models and cross-classified multilevel model equations. For example, the subscripts in
To understand the structure of the variance of our data, we ran an unconditional cross-classified model with no fixed effects. We use this model to calculate the variance partition coefficient (VPC) and the proportional change in variance (PCV). The VPC measures the variance attributed to a level in a multilevel model and is sometimes called the intraclass correlation coefficient. In most MAIHDA models, the VPC score for strata usually less than 10 percent (Evans et al. 2020). The PCV measures the total between-stratum variation explained by including additive main effects. Our unconditional cross-classified multilevel model is written as follows:
Comparing the variances from the real-world data with our simulated data shows that the share of the variance accounted for by strata (VPC) is very similar in both data sets (see Table 2). Including the fixed-effects terms in the MAIHDA model also accounts for similar shares of the variance due to strata (PCV) for the real-world and simulated data sets. The one place where the two data sets diverge is the share of the variance accounted for by the courses. We reran our simulation with a smaller standard deviation in course scores (4.5 points). This made the
The Variance, Variance Partition Coefficient, and Proportional Change in Variance for the Real-World and Simulated Data Sets.
Note:
Our unconditional model shows the
Society’s educational debts owed to marginalized groups were calculated by subtracting the mean score for a strata from those of CG White men.
Step 4: Model Comparison
To compare the precision and accuracy of MAIHDA and fixed-effects models and explore minimum sample sizes for robust estimation, we calculated the true error in the prediction for each of the 20 strata. The true error is the difference between the predicted and true scores (estimated score – true score). We compared their mean absolute true errors to determine which modeling technique yields more accurate predictions. To determine which modeling technique would allow for the modeling of smaller strata sample sizes, we examined the mean absolute true error for strata with sample sizes of 20 or fewer.
Findings
Research Question 1: What Are Typical Strata Representation Rates and Outcomes on Physics Research-Based Assessments?
LASSO provided data from 5,955 students in 171 courses at 40 institutions on the FCI. These data include 20 intersectional social strata (see Table 3). The share of the sample for a strata ranges from 26 percent (CG White men) to 1 percent (CG Black women, FG Black men, FG Black women, FG Hispanic women, and FG White Hispanic women). The mean score on the assessment for a strata ranges from 49 points (CG White men) to 28 points (CG Hispanic women). Society's educational debts range from 21 points (CG Hispanic women) to 2 points (CG Asian men).
The Share of the Total Sample, True Score, and Student-Level Standard Deviation for Each Intersectional Social Strata as Set by the Simulation.
Note: CG = continuing-generation; FG = first-generation.
Research Question 2: To What Extent Do MAIHDA Models Create More Accurate Predicted Strata Outcomes Than Fixed-Effects Models?
We compare the true errors of the two modeling methods to determine their accuracy. Table 4 shows the mean absolute true error disaggregated by model, strata, and total sample size. The MAIHDA model has smaller mean absolute true errors in 56 of the 60 cases (i.e., 20 strata in each of the three total sample sizes). In the four exceptions in which the MAIHDA model is less accurate on average, the mean absolute true errors are only marginally larger, ranging from 0.0 to 0.4 points. As the total sample sizes increase, the amount by which MAIHDA outperforms the fixed-effects model decreases. MAIHDA's mean absolute true error is 31 percent (1.7 points) smaller than the fixed-effects model when the total sample size is 500 but only 13 percent (0.2 points) smaller when the sample size is 5,000.
Each Strata's Mean Absolute True Error across the Three Total Sample Sizes.
Note: The delta mean absolute true error is the MAIHDA mean absolute true error minus the fixed-effects mean absolute true error. Negative delta values indicate the MAIHDA model generally produces less error. CG = continuing-generation; FG = first-generation; MAIHDA = multilevel analysis of individual heterogeneity and discriminatory accuracy.
The largest mean absolute true error for the fixed-effects model is for FG Black women (9.3 points), who also have the smallest mean N (3.7). In the MAIHDA model, the mean absolute true error is reduced to 4.9 points. Given that the standard deviation for student scores is 20 points, the shift from fixed-effects to MAIHDA models decreases the mean absolute true error from 0.47 SD to 0.25 SD.
To understand how strata sample sizes affect the accuracy of the models, we examine the true error for each strata prediction for each model. Figure 5 shows the true error for all the predictions versus each model's strata sample size. Both models are similarly accurate when they have larger total and strata sample sizes. The magnitude of the true errors increases for both models with smaller total sample sizes and strata sample sizes. Where the models differ, however, is that the magnitude of the true error increases more quickly in the fixed-effects model as the strata sample sizes decrease. The improvement in performance for MAIHDA over fixed-effects models is most apparent for the strata with the smallest sample size (i.e., FG Black women). The reduction in the mean absolute true error for FG Black women when using MAIHDA ranges from 47 percent with a total sample size of 500 to 25 percent with a total sample size of 5,000.

Logarithmic scatter plot of the true error for each predicted strata outcome versus the strata sample size for each model.
Research Question 3: To What Extent Do MAIHDA Models Allow for Modeling of Smaller Strata Sample Sizes?
To understand the differences in each model's ability to handle smaller strata sample sizes, we examine the absolute mean and mean of the true error for each strata sample size (see Table 5). Figure 6 shows the mean absolute true error values for strata sample sizes ranging from 1 to 20 across total sample sizes and modeling techniques. We limit the figure to this range, as Simmons, Nelson, and Simonsohn (2016) argue, using 20 as a minimum strata sample size. Figure 6 includes a dashed line as a comparison tool indicating the mean absolute true error for the fixed-effects model when the strata sample size is 20 and the total sample size is 500 (4.4). For the MAIHDA model, the mean absolute true error does not reach 4.4 until the strata sample sizes are reduced to 5 or fewer.
Mean and Absolute Mean of the True Error Values for Each Strata Sample Size Disaggregated by Model Type and Total Sample Size.
Note: MAIHDA = multilevel analysis of individual heterogeneity and discriminatory accuracy.

Plot of the mean absolute true error values by strata sample size across total sample size and modeling techniques.
For models with very small strata sample sizes, shrinkage will decrease the size of the strata's variance term. This smaller variance term leads to the strata's predicted outcomes being closer to the prediction produced by only adding the model's primary coefficients for the strata. In our MAIHDA models, when strata sample sizes fall below 10 individuals, we begin to have a small but consistent negative bias in the mean true error scores (see Table 5). The variance causes this negative bias for the least represented strata (FG Black women) to be positive, on average. In other words, the additive effects of the coefficients for the intercept, FG, Black, and women predict scores lower than the true score for FG Black women. Having fewer data for that strata regresses their predicted score closer to the additive score from the model and creates a negative bias.
Discussion
A central goal of intersectional modeling is to disaggregate intersectional social strata outcomes as much as possible to represent the diversity of lived experiences. MAIHDA was developed to support further disaggregation of data across strata than is possible with fixed-effects regression models. To identify whether MAIHDA is superior to fixed-effects models at predicting outcomes for small strata, we compared the accuracy of both modeling techniques on simulated data sets that mirror real-world physics student outcomes.
Our findings show meaningful inequities in physics student scores on the FCI and that MAIHDA creates more accurate models of these inequities than do fixed-effects models (Table 4 and Figure 5). When using the MAIHDA model rather than the fixed-effects model, we see the most significant improvements in accuracy with predictions for small total sample sizes and strata with smaller sample sizes. These improvements support the goal of using MAIDHA to create models with smaller strata, which allows for more strata to reflect the scope of intersectional experiences within a population.
Future Research
Developing and running the MAIHDA and the fixed-effects models took similar effort with our simulated data set. However, one feature of our simulated data that differed from many educational data sets is that it lacked any missing data. In these situations, multiple imputation is often recommended to maximize statistical power and limit the introduction of bias (Nissen, Donatello, and Van Dusen 2019; Rubin 1996). Examining multiple imputed data sets can complicate analyses. Future research should compare the practicality and efficacy of using MAIHDA and fixed-effect models with a real-world data set that includes missing values.
Although we did not examine it in this study, the Bayesian nature of MAIHDA models means their accuracy could be improved through informed priors. Informed priors allow researchers to directly include findings from prior research into their models, similar to what is done in a meta-analysis. Although an investigation may not have the statistical power to include a particular strata in a model, if it can draw on findings across other investigations, it can create reasonably accurate predictions for even smaller strata. Determining the effect of using informed priors with MAIHDA models will require further research.
This analysis did not examine how to account for interaction effects between strata and other factors. For example, when using a fixed-effects model to determine the differential effect of a classroom intervention across strata, researchers can add fixed interaction terms between the intervention and intersectional social strata variables. In a MAIHDA model, however, the interaction effect is between a fixed variable (e.g., the intervention) and a random variable (e.g., intersectional social strata). It must be included as a random-effect term. Modeling programs can run such models, but determining the effect of moving these interaction terms from fixed to random will require future research. Furthermore, this study examined continuous outcomes. The efficacy we found likely exists for logistic models of categorical outcomes, but a simulation study is warranted to test it empirically.
This study provides proof that MAIHDA can work with cross-classified multilevel models, but the student–course structures used in our model are reasonably simple. Further research is needed to examine the efficacy of MAIHDA in modeling student outcomes with more complex student, course, and institution structures (e.g., creating subsets of courses that match the makeup of high-, medium-, and low-selectivity institutions). Future simulation studies could also examine the effect of creating cross-classified multilevel models with levels for course and strata versus simply including the strata level.
Conclusions
The accuracy improvements observed by using MAIHDA over fixed-effects models are strong evidence that MAIHDA can improve the quality of predicted outcomes across intersectional social strata. These results also suggest that researchers may be able to disaggregate their models further, increasing the number of distinct strata represented by accommodating smaller strata sizes. Our examination of the mean absolute true error for small strata sample sizes bore out this idea (Table 5 and Figure 6). One recommended value for minimum strata sample sizes with fixed-effects models is 20 (Simmons et al. 2016). For our fixed-effects models, the mean absolute true error was very similar for total sample sizes of 500 and 1,000 when the strata sample sizes were 20 or fewer. The MAIHDA model maintained an accuracy equal to or better than those observed in the fixed-effects model until the sample sizes were 5 or fewer. These smaller standard error values indicate that researchers could model strata sample sizes below 20 without meaningful compromises in accuracy when using MAIHDA. We refrain from making specific minimum strata sample size recommendations because the context of each research project will determine how much potential error a researcher is willing to entertain to increase the number of strata a model includes.
Beyond being a more effective methodological tool, MAIHDA can change how intersectionality is theorized and reified in quantitative research. Introducing tools to a system can transform cognitive tasks (Hutchins 1995a, 1995b) and lead to novel outcomes (Alvesson and Kärreman, 2007; Roth and Lee 2007). By allowing models to include strata with a sample size of 1, MAIHDA can include all participants as they self-identify without aggregating groups or engaging in data erasure. Opening up this methodological possibility will require researchers to reenvision intersectional theory and its place in quantitative research.
Our findings provide strong evidence of the superiority of using MAIHDA when modeling intersectional outcomes, but MAIHDA is appropriate for only some scenarios. Specifically, MAIHDA requires a sufficient number of strata to offer improved predictions. Evans et al. (2018) recommend having at least 20 strata in a model before using them as a level to nest students to ensure the model has sufficient Level 2 random effects. Researchers have begun to explore MAIHDA models with fewer strata (e.g., Evans 2019; Silva and Evans 2020), but the effect of having fewer strata in a model is currently unknown. Some education equity research uses models that include 20 or more strata (e.g., Jang 2018; Nissen, Van Dusen, and Kukday, 2022; Van Dusen, Nissen, and Johnson 2024), but much of this work does not (Van Dusen and Nissen 2022). With the increased use of novel modeling methods, large-scale data sets (e.g., LASSO), and intersectional perspectives in quantitative research (Wofford and Winkler 2022), the trend may be shifting to models that offer more nuanced depictions of marginalized students’ outcomes through the inclusion of more strata.
Although researchers are investigating the implications of using MAIHDA to create intersectional models, the statistical technique of blending fixed and random effects to create predicted outcomes is applicable across many areas of quantitative research. Just as Jones et al. (2016) used an analogous technique to MAIHDA to investigate voting rates, judiciously replacing some fixed effects with random effects can improve model accuracy. When using MAIHDA, researchers have primarily examined replacing interaction terms with random effects (Keller et al. 2023), but the most efficacious replacement of fixed effects will be contextually dependent. Additional research is needed to identify the best method for selecting fixed-effect terms in MAIHDA or analogous nonintersectional models.
Equity-minded researchers are cautious about treating strata monolithically. Intersectional research that considers variables such as those introduced here somewhat addresses this concern, but more work is needed to explore the potential for disaggregation of data according to more specific racial designations, such as regional Hispanic populations. In this work, we illustrate the potential of MAIHDA using demographic data likely to exist in many surveys that ask standard demographic questions to report on broad categories related to race, gender, and socioeconomic identification (or its proxies, such as college generation). More work is needed to illustrate the potential of MAIHDA—and quantitative intersectionality work generally—using more precise demographic data, such as immigration generation, family country of origin, or country region.
Footnotes
Acknowledgements
We want to thank the leaders of the summer institute in advanced research methods for Science, Technology, Engineering, and Mathematics education research (Guanglei Hong, Kenneth Frank, Stephen Raudenbush, and Yanyan Sheng) for their support and feedback in the generation of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded, in part, by National Science Foundation Award 1928596.
Research Ethics
Because this research does not constitute human subjects research, institutional review board approval was not necessary to carry out this work.
