Abstract
At the end of 2018, Obama-era disciplinary guidance aimed at reducing the use of suspensions in schools (especially for minorities and students with disabilities) was revoked by the U.S. Department of Education. A key piece of research supporting the decision was based on the analyses of the Early Childhood Longitudinal Study, Kindergarten Class of 1998–1999 (ECLS-K), which showed that the racial suspension gap was not really about race but resulted from the differential behavior exhibited by Black and White students. We reanalyzed the public-use ECLS-K and provide syntax for our analyses to show that the findings were primarily due to sample selection bias. Several alternative model specifications were tested and continued to show the persistence of the race-based suspension gaps regardless of model or measure used.
At times, published research is used by policymakers to guide, support, and/or bolster their decision-making process. In December 2018, Obama-era disciplinary guidance aimed at reducing the use of suspensions in schools (especially for minorities and students with disabilities) was revoked 1 by the U.S. Department of Education (Camera, 2019). In the federal School Safety report, 2 Wright et al.’s (2014) analyses of the Early Childhood Longitudinal Study, Kindergarten Class of 1998–1999 (ECLS-K) data set was cited as a key piece of research, and the authors suggested that “the use of suspensions may not be as racially biased as many have argued” (p. 264).
In their study, Wright et al. (2014) compared the likelihood of suspension of 8th-grade Black and White students while controlling for several demographic, academic, behavioral, and school variables. In Wright et al.’s initial model, Black students were more likely to receive a suspension compared to White students, in line with findings going back several decades. However, after a purported measure of prior problem behaviors (PPB) was added to the initial model, the relationship between race and suspensions ceased to be statistically significant. Findings suggested that PPB accounted for differences in the likelihood of suspension and that “the racial gap in suspensions was completely accounted for by a measure of the prior problem behavior of the student—a finding never before reported in the literature” (Wright et al., 2014, p. 257).
Disparate Suspension Rates and Differential Behavior
Over several decades, Black students have been suspended at much higher rates compared to their White counterparts (e.g., Fabelo et al., 2011; McCarthy & Hoge, 1987; Skiba, Peterson, & Williams, 1997). The use of national or statewide data sets has consistently shown that generally, Black students are 2 to 4 times more likely to be suspended than White students (Anyon et al., 2014; Huang, 2018; Huang & Cornell, 2018; Mendez et al., 2002; Skiba et al., 2014). For example, in school year 2013–2014, Black K–12 students were 3.8 times more likely to receive an out-of-school suspension (OSS) compared to their White counterparts (U.S. Department of Education, 2016).
One possible reason for the disparate suspension rates may be based on the differential involvement (DI) hypothesis that Black and White students may engage differentially in types of misbehaviors that lead to a suspension (Skiba et al., 2002). In an early study of the DI hypothesis, McCarthy and Hoge (1987) found that Black students did not misbehave more than White students, although Black students generally received more severe sanctions for the same infractions. Later studies also investigated the DI hypothesis but showed that although misbehaviors were a large reason for being suspended, misbehaviors did not fully account for the disparities in suspensions based on race/ethnicity (Bradshaw, et al., 2010; Huang, 2018). Using a decomposition of hypothesized factors contributing to the disparate suspension rates in elementary school, Owens and McLanahan (2019) found that differences in student behavior accounted for only a relatively small portion of the racial suspension gap. In contrast to these studies, Wright et al. (2014) suggested that PPB could fully account for the racial disparities in suspensions.
Revisiting the Analyses: Possible Issues
However, Huang (2018) suggested possible reasons why Wright et al.’s (2014) results could have been found: (a) Sample sizes were not consistent between analytic models, and (b) the predictor of interest may not really be a measure of behavior problems. The sample sizes used in Wright et al.’s analyses differed substantially from the initial and succeeding model (n1 = 4,101 vs. n2 = 2,737) as a result of listwise deletion. For a proper comparison, samples should be consistent, and differences in findings could simply be a result of differential attrition or sample selection (survivorship) bias. In addition, the measure of PPB, measured in fall kindergarten, spring first, and spring third grades, may not actually be a measure of prior problems. The measure was based on the Social Skills Rating Scale (Gresham & Elliot, 1990) that measured approaches to learning (ATL), self-control, interpersonal skills, and externalizing problem behaviors. Huang indicated that only the last subscale was an actual manifestation of problem behaviors and that the other subscales are related (e.g., self-regulation) but different from behavior problems. For example, ATL used in K to 1st grade was comprised totally of items related to eagerness to learn, interest in things, and task persistence, which are not indicators of problem behaviors. 3 A measure excluding ATL may be more suitable. If the results are driven by ATL compared to behavior problems, then results may be weaker if ATL is removed. However, having a “purer” construct of problem behaviors could result in stronger findings as well.
In addition to the possible issues mentioned by Huang (2018), other concerns deserve attention as well. One basic issue is that the outcome (i.e., suspension that included both in- and out-of-school suspensions) was an eighth-grade parent-reported measure if a student had ever been suspended—meaning that that the student could have been suspended even once at any grade until the eighth grade. Although the use of suspensions may rise as a child progresses through school (i.e., used more frequently in middle vs. elementary school), the outcome is quite imprecise. Even as early as preschool, Black children were 3.6 times more likely to receive one or more suspensions compared to White students (U.S. Department of Education, 2016). The Wright et al. (2014) analyses implicitly assumed by the use of certain predictors that the suspensions occurred in the eighth grade (e.g., using eighth-grade predictor variables when the actual suspensions could have occurred even prior to the eighth grade). However, this is a basic limitation of the data set (i.e., ECLS-K) used, and others have used the suspension variable in a similar manner (e.g., Morgan et al., 2019).
Another potential issue is that the main predictor variable of interest, PPB, is a teacher-reported variable, and teachers may be biased reporters of student behavior (Gilliam et al., 2016). For Black students, the risk of receiving a suspension is much higher, even as early as prekindergarten, not necessarily because of problem behaviors but because teachers may expect Black boys to misbehave more and thus watch them more closely (Gilliam et al., 2016). Gilliam et al.’s (2016) experimental findings are particularly important and challenge the notion of teachers as completely unbiased reporters. In addition to teachers being potentially biased reporters, another experimental study showed that even when Black and White students commit the same infraction, teachers often issued harsher sanctions (i.e., differential treatment) for Black students (Okonofua & Eberhardt, 2015). Owens and McLanahan (2019) indicated that almost half of the racial suspension gap can be attributed to the differential treatment of Black and White children who enter school with the same behaviors. If there is bias against Black students in the assessment of behaviors and the administration of suspensions, controlling for PPB could fully explain the racial disparities in disciplinary sanctions. 4
Given that the ECLS-K is publicly available and that Wright et al. (2014) indicated that “our results await replication” (p. 263), we reanalyzed Wright et al.’s original findings and tested alternative models specifications where (a) models used the same samples instead of shifting samples and (b) a measure of PPB was constructed excluding ATL. Additional models were tested that (a) used multiple imputation to account for missing data; (b) used the more proximal measure of fifth-grade problem behaviors, which is still a measure of PPB but should be stronger; (c) used parent-reported PPB; and (d) used externalizing behavior only as a measure of PPB. The use of parent-reported PPB may address some of the issues with using a teacher-reported PPB. Parents, however, may also be biased reporters (e.g., they may be more positive about a child’s ability), but they provide an alternative perspective with regard to PPB.
Methods
Sample
The current study used data from the ECLS-K. The ECLS-K surveyed a nationally representative cohort of kindergarteners in the United States in the 1998–1999 school year and followed the sample over time until the eighth grade. The ECLS-K used a multistage probability design and sampled kindergarteners from both public and private schools. Information on the school, classroom, and home environments were obtained from the school administrator, students, and their parents as well as by observations in the schools. For the replication study (as well as in the original study), analyses were restricted to only students in public schools. For reproducibility and transparency, the ECLS-K public-use data files were used and are available online from the National Center for Education Statistics (NCES). 5 Syntax files for data management and analyses are available from the author’s website.
Data Management and Variable Recoding
We followed the data management procedures described in Wright et al. (2014; for descriptives, see Table 1). The ECLS-K analytic sample consisted only of Black or White (RACE = 1 or RACE = 2) students who attended public schools in the eighth grade (S7PUPRI = 1). In the eighth grade, parents were asked if their child ever had an in- or out-of-school suspension (P7SUSPND; 1 = yes, 0 = no), which served as the primary outcome. The predictor of interest, PPB, was created using the teacher-reported Social Skills Rating Scale (SSRS) of Gresham and Elliot (1990). The SSRS measures used a Likert scale ranging from 1 to 4 (1 = never exhibits this behavior, 4 = very often/exhibits behavior most of the time) for each item in the scale. The four SSRS teacher-reported scales (i.e., self-control [TxCONTRO], interpersonal skills [TxINTERP], approaches to learning [TxLEARN], and externalizing problem behaviors [TxEXTERN]) were each composed of the average of four to seven items depending on the grade and the subscale (where x represented the wave of the survey; x = 1 for fall of kindergarten, x = 4 for spring of first grade, and x = 5 for spring of third grade). The first three scales were reverse-coded, reflecting that higher scores were associated with worse behaviors. The kindergarten, spring first-grade, and spring third-grade scales were summed and divided by 3 (i.e., the average of the 3 years). Wright et al. indicated that they used a simple sum of the scales and reverse-coding where appropriate. However, we were unable to obtain the same mean score reported in their article (i.e., M = 12.67, SD = 1.83), although averaging over the scores from the 3 years resulted in a similar scaling of scores based on the standard deviation (M = 7.18, SD = 1.79).
Descriptive Statistics (N = 4,360)
Note. IEP = individualized education plan; FL = free lunch; PPB = prior problem behaviors; ATL = approaches to learning.
Covariates
All variables were coded in a manner where higher scores indicated a greater likelihood of being suspended. An eighth-grade parent-reported delinquency measure was included and was comprised of the average of three items: whether the child steals (P7STEALS), if the child often lies/cheats (P7CHEATS), and if the child fights (P7FIGHTS). Response options were 1 = not true, 2 = somewhat true, and 3 = certainly true. A parent-reported measure of a child’s grades (P7SCHGRD) where 1 = mostly As and 5 = mostly Fs was included. Student race (RACE; 0 = White, 1 = Black), gender (GENDER; 0 = female, 1 = male), and whether a child had an Individualized Education Program (IEP) in Grade 5 (U6RIEP; 1 = yes) were included. Socioeconomic status (SES) was measured using two variables: level of parental education (reverse-coded W8PARED; 1 = doctoral degree, 9 = eighth-grade education or below) and the household poverty level (W8POVRTY; 1 = below the poverty threshold, 0 = at or above the poverty threshold).
School-level covariates included items related to school enrollment size (S7ENRLS), the percentage of students eligible for free lunches (S7FLCH_I), and the percentage of Black students enrolled at the school (S7BLKPCT). School enrollment was coded from 1 to 5 where 1 = 0–149 students and 5 = 750 or more students. The percentage eligible for free lunches ranged from 0 to 95. The percentage of Black students enrolled at the school was coded from 1 = <1% to 5 = 25% or more Black students in the school. In addition, parents were asked to provide their opinion of the school and formed a four-item scale. Items included if parents thought the child’s school was good (P7GOOD), if the school emphasized learning (P7LEARNG), if the school has an alcohol or drug problem (P7DRUGS), and if the school has a problem with violence (P7VIOLNC). Response options were 1 = strongly agree to 5 = strongly disagree. The latter two items were reverse-coded to indicate that higher scores resulted in greater school problems. Scores were then averaged to form what Wright et al. (2014) referred to as a “bad” school scale.
Analytic Strategy
Given the binary outcome, we used logistic regression modeling with cluster robust standard errors to account for students nested within schools. Wright et al. (2014) indicated accounting for clustering at the classroom level, although it is not clear which classroom is referred to given that students had different reading and math/science classrooms and could have different teachers as well from other students within the same classroom. We opted instead, as recommended by Cameron and Miller (2015), to cluster at the higher (school) level, which is reasonable given the number of school-level predictors used in the models. 6
To be consistent with the original analyses, we also did not use NCES provided weights 7 and used listwise deletion in the replication. Weights are used to generalize from the sample to the population, but there was no mention of weights in the Wright et al. (2014) article. Although the use of sampling weights may be debated (Gelman, 2007; Lohr, 2007), without using the weights, results are purely about the sample itself, which is not generally of interest, and results do not generalize beyond that. In any case, however, the ECLS-K manual 8 states that the “sample is not representative of children in eighth grade, classrooms, or schools” (p. xxx), and the initial plan of the ECLS-K did not include the extension to the eighth grade (p. 4-23).
Although we attempted to fully replicate Wright et al.’s (2014) findings using the restricted-use data file of the ECLS-K, we were not able to precisely reproduce the findings using the same variables mentioned, although results were similar. Instead, we opted to use the public-use version of the ECLS-K to aid with the transparency and reproducibility of findings. In the public-use data set, two variables (i.e., if the student had an IEP in the eighth grade and the percentage of students eligible for reduced lunch at the school level) were not available/coded in a similar manner but were not of concern for a few reasons. First, in all the models used in Wright et al., both variables were not predictive of suspensions (all ps > .05). Second, although we did not have a measure if a student had an IEP in the eighth grade in the public-use file, we used the IEP measure in the fifth grade, which was available in the data set and was correlated with the eighth-grade IEP measure (r = .68). 9 Third, Wright et al. created a combined school-level measure of free- and reduced-price lunch eligibility, although in the public-use file, free-lunch eligibility was a continuous measure and reduced-lunch eligibility was a categorical measure (i.e., 1–5), making combining both measures challenging. Instead, we opted to use the continuous free-lunch-eligibility measure alone, which was highly correlated with the combined free- and reduced-price-lunch measure (r = .94). Finally, we also did not include a measure of eighth-grade teacher race because including teacher race makes less sense given that middle schoolers often have multiple teachers, not just one. In Wright et al.’s analyses, only one teacher race variable was used even though in the ECLS-K, two teachers were reported per child. In addition, only Black and White teachers were included in Wright et al.’s analysis, which meant students taught by other minority teachers (although small in number) were excluded from the analysis. Teacher race was also not predictive of suspensions in Wright et al.’s article. Including eighth-grade teacher race assumes suspensions occurred in the eighth grade, which is also not necessarily the case.
We first attempted to replicate Models 1 and 2 in Wright et al.’s (2014) article and included all student- and school-level variables but without the measure of PPB (Model 1). Model 2 then added the measure of PPB. If the relationship between suspension and PPB were statistically significant and the original coefficient for race ceased to be meaningful, then this suggests that PPB accounted for the differences in suspension rates (see Skiba & Williams, 2014). Considering that sample selection may be driving results, we compared the characteristics of the excluded sample (i.e., leavers) with the overall sample and the sample of those included in the final model (i.e., stayers). After reasonably reproducing the original findings, we then tested alternative models where the sample remained consistent between Models 1 and 2 and a modified version of the PPB was used.
Given the issues in comparing coefficients across models using logistic regression models (LRMs; Mood, 2010), we also provide results (see Appendix available on the journal website) using linear probability models (LPMs). Unlike linear models, which allow for the comparison of coefficients across models, odds ratios (ORs) are not directly comparable even with models using the same sample because coefficients reflect unobserved heterogeneity in the models (e.g., including a predictor that is uncorrelated with the independent variable [IV] may change the OR of the IV despite not being correlated). Standard errors for coefficients in LRMs are also not expected to decrease with the inclusion of relevant covariates, unlike in the case of linear regression models (Robinson & Jewell, 1991). However, LPMs are a viable alternative to LRMs when the predictor of interest is binary (Huang, 2019) and is not subject to the issues associated with LRMs (Breen et al., 2018; Mood, 2010).
Alternative Model Specifications
After replicating the original findings and testing the issues suggested by Huang (2018), additional model specifications were investigated.
Using multiple imputation
The use of listwise deletion when data are not missing completely at random has been shown to result in biased estimates (Acock, 2005). To account for the missing data, we used multiple imputation (Rubin, 2004). For the analytic sample, we included students (N = 4,918) who were not missing data on the outcome variable. We used the jomoImpute function in the mitml (Grund et al., 2019) package together with the jomo package (Quartagno & Carpenter, 2019) to impute missing data at both the student and the school levels. Based on guidelines of Allison (2012) and Bodner (2008), we imputed 40 data sets by imputing individual items, creating the composite scales after imputation, and using auxiliary (Enders, 2017) variables (e.g., parent-reported measures of SSRS, SSRS measures in Grade 5) to aid in the imputation.
Using Grade 5 PPB
If PPB were of interest, the measure could also be constructed using fifth-grade measures, which were available in the ECLS-K and were more proximal to the eighth grade. Wright et al.’s (2014) analysis only used data from kindergarten to third grade. Outcomes closer together in time may be more highly correlated compared to those further apart in time (e.g., a first-order autoregressive structure), so fifth-grade measures should be more predictive of eighth-grade outcomes compared to measures taken in kindergarten. In addition, the issue of differential attrition could be lessened because the time points are closer together, resulting in a larger sample size and more power to detect effects (i.e., a sample from fifth to eighth grade vs. a sample from kindergarten to eighth grade). For the measure in fifth grade, we used the sum of the reverse-coded self-control (T6CONTRO) and interpersonal skills (T6INTERP) together with the externalizing problems scale (T6EXTERN).
Parent-reported PPB
Because some teachers may possibly be biased reporters (Gilliam et al., 2016), one alternative is to use the parent-reported measures of the SSRS in the ECLS-K. Although Wright et al. (2014) indicated that they did not use these measures because it was available only in kindergarten and first grade, they did use the teacher-reported measures in the same time period, so this should not be an issue. We created a parent-reported PPB measure that consisted of three SSRS scales: self-control (PxCONTRO), social interactions (PxSOCIAL), and impulsiveness (PxIMPULS). The first two scales were reverse-coded. Measures were taken in the fall of kindergarten (x = 1) and the spring of first grade (x = 4). All the scales were added together and divided by 2.
Externalizing behaviors PPB
Because externalizing behaviors are actual manifestations of problem behaviors, we used the teacher-reported externalizing problem behaviors as a measure of PPB. We used the average of the three externalizing problem behavior variables (TxEXTERN) measured in the fall of kindergarten (x = 1), the spring of first grade (x = 4), and the spring of third grade (x = 5).
Results
Reproducing Wright et al.’s (2014) Analysis
In the analytic sample (N = 4,360) 33% of Black students (n = 188) were suspended compared to 12% of White students (n = 443). We calculated the OR for Black students being suspended compared to White students without any covariates (OR = 3.71; not shown). Next, in Model 1 (with covariates; see Table 2), the OR for Black students was 1.92, p < .001. In Model 2, after PPB was included, the sample size was reduced to n = 2,892 (compared to nWright = 2,737) and the ORBlack reduced to 1.18, p = .49. Results were similar compared to Wright et al.’s (2014) analyses where the ORs were 3.78, 1.89, and 1.20, respectively. In Model 2, PPB was added, and the OR for PPB was 1.31, p < .001, compared to ORWright = 1.30. As in the original analyses, once the measure of PPB was included in the model, the coefficient for Black students ceased to be statistically significant.
Logistic Regression Modeling Results in Odds Ratios and the 95% Confidence Intervals (in Brackets) Predicting Suspension
Note. All models used cluster-robust standard errors. IEP = Individualized Education Program; FL = free lunch; PPB = prior problem behaviors.
p < .05. **p < .01. ***p < .001.
Testing Alternative Model Specifications
The simplest test is to use the same sample in Model 2 to perform the analysis in Model 1. Results are shown in Model 3 (see Table 2) and indicate that even before including the measure of PPB, the coefficient for Black students was already not statistically significant (OR = 1.40, p = .14). In short, the lack of statistical significance in Model 2 was not due to including PPB but a result of the particular selection of students in the sample (i.e., students in the eight grade who were missing at least any one of the SSRS measures in kindergarten, first grade, or third grade).
These findings are similar even when a linear probability model is used (see Appendix available on the journal website). In the model without PPB (n = 2,892), Black students had a likelihood of being suspended that was higher by 5 percentage points compared to White students (see Model 3A LPM). Once PPB was included, the likelihood of being suspended was approximately 4 percentage points higher than White students—a difference of approximately 1 percentage point (see Model 2A LPM).
Comparing Samples
Because the original Wright et al. (2014) results may be driven heavily by sample selection, we compared the sample (i.e., the stayers) used in Model 2 (which included participants with a measure of PPB) with those who were excluded (i.e., leavers) from the analysis as a result of missing a PPB measure. Wright et al. (p. 262) noted that in the initial model, they had 527 Black students and that in the second model, they had just 289 Black students, excluding 45.1% of the Black students. In the present reanalysis, a comparison between models showed 571 versus 311 Black students and showed a similar exclusion rate of 45.5%. In the overall analytic sample, 14.5% were Black (see Table 1); however, of the leavers, 17.7% were Black versus 10.8% of the stayers, indicating that more Black students were excluded compared to White students. Overall, students who were missing a measure of PPB (and thus were excluded) tended to be Black, male, from higher poverty households, with poorer grades, have higher levels of delinquency, and come from schools where a greater number of Black students were enrolled (see Table 3). Students who were excluded had higher levels of suspension rates (18%) versus those who were included in the PPB analyses (13%).
Comparison of Means (SD) of Samples in Model 1 Versus Model 2 (Stayers vs. Leavers)
Note. d = Cohen’s d as a measure of effect size. IEP = Individualized Education Program; FL = free lunch; PPB = prior problem behaviors.
p < .05. **p < .01. ***p < .001.
Testing the Modified PPB
The next model specification (see Model 4) used a measure of PPB but excluded ATL. Using the same sample as in Model 2 showed a stronger measure of PPB (OR = 1.44, p < .001), and race was also not a statistically significant predictor (ORBlack = 1.16, p = .53)—which was not surprising given that the coefficient for race in Model 3 was already not statistically significant.
Alternative Model Results
Three additional model specifications using multiply-imputed data sets, more proximal teacher-reported fifth-grade PPB, and parent-reported PPB all showed that the coefficient for race remained statistically significant even after accounting for PPB. Findings indicate that the Wright et al. (2014) results were driven mainly because of the specific sample used and not because of PPB. Although problem behaviors are extremely important to account for and function as a large contributor to suspensions (Huang & Cornell, 2018), they do not completely explain the racial differences in suspension as originally claimed.
Using multiple imputation
The first alternative model specification tested used 40 multiply-imputed data sets where the sample was n = 4,918 (see Table 4). In Model 5, the ORBlack = 1.82, p < .001; and in Model 6, once PPB was included, the ORBlack = 1.56, p < .01. In this instance, the coefficient for Black was still statistically significant, suggesting that Black students still had a higher likelihood of being suspended even after the measure of PPB was included (OR = 1.28, p < .001).
Logistic Regression Odds Ratios and the 95% Confidence Intervals (in Brackets) Using Multiple Imputation (Models 5 and 6) and Grade-5-Based PPB (Models 7 and 8) Predicting Suspension
Note. All models used cluster-robust standard errors. IEP = Individualized Education Program; FL = free lunch; PPB = prior problem behaviors.
p < .05. **p < .01. ***p < .001.
Using Grade-5 PPB
The next models used more proximal fifth-grade PPB (without including ATL). The sample size was larger (n = 3,946) because we compared PPB in the fifth grade without having to go back from kindergarten to third grade (multiple imputation was not used). Results in Model 7 were consistent with prior models (see Table 4), and Model 8 also shows the strong relationship of PPB with suspensions (OR = 1.44). However, the coefficient of Black (see Model 8) remained statistically significant as well (OR = 1.67, p < .01).
Parent-reported PPB
The next model specification (see Table 5) used the parent-reported SSRS in kindergarten and first grade (n = 3,773). The OR for Black students in Model 9 was 1.91, p < .001 and was 1.90, p < .001 in Model 10. The OR of prior problem behaviors was 1.22, p < .001. Even after including the measure of PPB, the Black coefficient remained largely unchanged.
Logistic Regression Odds Ratios and the 95% Confidence Intervals (in Brackets) Using Parent-Reported PPB (Models 9 and 10) and Externalizing Behaviors as PPB (Models 11 and 12) Predicting Suspension
Note. All models used cluster-robust standard errors. IEP = Individualized Education Program; FL = free lunch; PPB = prior problem behaviors.
p < .05. **p < .01. ***p < .001.
Externalizing behaviors as PPB
The final models, which used the average of externalizing behaviors as a proxy for PPB, showed that the OR for Black students in Model 11 (without PPB) was 1.52 (p = .06) and was reduced (OR = 1.28, p = .28) after PPB was added to the model (see Models 11 and 12). The OR for externalizing behavior as a measure of PPB was 2.75 (p < .001).
In the first three sets of models, all of the coefficients for race remained statistically significant, indicating that Black students, regardless of the measure of PPB used, did not fully account for the discrepancies in suspension rates. These models are robust to missing data, fifth-grade PPB, or the possible reporting bias of teachers. For the last set of models using externalizing behavior as a measure of PPB, the race variable was already much lower compared to the race coefficients in the other models even at baseline.
Discussion and Conclusion
Although Wright et al.’s (2014) analyses of the ECLS-K suggested that the disparities in suspension rates could be attributed to the differential behavior between Black and White students, our reanalysis, using the public-use version of the ECLS-K, shows otherwise. Once the same sample was used in comparing models—one model without PPB and one model with PPB—findings showed that race was already not a meaningful predictor of suspensions prior to the inclusion of PPB in the model. Further analyses indicated that Wright et al.’s findings were driven primarily by sample selection bias. Additional investigation—using multiple imputation, a modified PPB measure using fifth-grade teacher reports, externalizing behavior as PPB, as well as a PPB based on parent reports—showed that disparities in suspension rates based on race could not be fully explained by PPB.
The shifting regression coefficients for the baseline models suggest differences with the sample selected resulting from survivorship bias. Because the baseline models used exactly the same predictors, point estimates should not change substantially (using a linear probability model) based on subsample investigated if there were no differential attrition. However, using the same model and merely excluding students from the sample (see Appendix Models 1A–3A available on the journal website), the Black coefficient was reduced from 10 percentage points to 5 percentage points, a difference of 5 points. In contrast, including PPB (Model 2A) and using a consistent sample between models reduced the Black coefficient from 5 to 4 percentage points: a difference of 1 point. This highlights the importance of using the same analytic sample when comparing model results.
Although the disparities in suspension rates are large based on race alone (i.e., 12% of White students were suspended vs. 33% of Black students, a difference of 21 percentage points), once additional covariates were included in the model (see Appendix available on the journal website), the difference in rates was effectively halved. In the model that included all the covariates (without PPB), Black students were 10 percentage points more likely (down from 21 percentage points) to be suspended compared to White students. The reduction in the disparities has been shown as well in other studies that have included relevant variables (e.g., gender, SES) known to be related with suspension (Huang, 2018; Huang & Cornell, 2018). Even though Black students were still more likely to be suspended, even after controlling for all other variables in the model, this is not necessarily evidence of racial bias. However, implicit bias (Ispa-Landa, 2018) and differential treatment (Okonofua & Eberhardt, 2015; Owens & McLanahan, 2019) may contribute to the racial disparities in school discipline. Despite PPB not being able to fully account for the differences in suspension rates, researchers should keep in mind that behaviors play a large role in the issuances of suspensions (Huang, 2018; Huang & Cornell, 2018; McCarthy & Hoge, 1987; Wright et al., 2014; Wu et al., 1982). Although problem behaviors are important to account for, PPB does not fully explain race-based disparities in suspensions.
Supplemental Material
Appendix_LPM – Supplemental material for Prior Problem Behaviors Do Not Account for the Racial Suspension Gap
Supplemental material, Appendix_LPM for Prior Problem Behaviors Do Not Account for the Racial Suspension Gap by Francis L. Huang in Educational Researcher
Footnotes
Notes
Author
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
