Predicting Time to Reclassification for English Learners: A Joint Modeling Approach

Abstract

The development of academic English proficiency and the time it takes to reclassify to fluent English proficient status are key issues in English learner (EL) policy. This article develops a shared random effects model (SREM) to estimate English proficiency development and time to reclassification simultaneously, treating student-specific random effects as latent covariates in the time to reclassification model. Using data from a large Arizona school district, the SREM resulted in predictions of time to reclassification that were 93% accurate compared to 85% accuracy from a conventional discrete-time hazard model used in prior literature. The findings suggest that information about English-language development is critical for accurately predicting the grade an EL will reclassify.

Keywords

joint modeling longitudinal data classification accuracy

English learner (EL) students often face considerable academic challenges as they attempt to develop content knowledge in core academic subjects and English language proficiency in tandem (Council of Chief State School Officers, 2016; Linquanti & Cook, 2013; Pompa & Villegas, 2017). The Every Student Succeeds Act (ESSA) of 2015 defines ELs as students whose difficulties in speaking, reading, writing, or understanding the English language may prevent them from meeting challenging state academic standards or succeeding in a classroom where the language of instruction is English (ESSA regulatory guidance). These difficulties often mean that ELs have lower achievement and higher dropout rates than their academic peers (Kao & Thompson, 2003; Pompa & Villegas, 2017). Given ELs constitute a rapidly growing student subgroup in U.S. schools (Cook, Linquanti, Chinen, & Jung, 2012), educators and policymakers have made the education of ELs a major emphasis in federal, state, and local policy (Hopkins, Thompson, Linquanti, Hakuta, & August, 2013; Pompa & Villegas, 2017). For example, under ESSA, districts and schools are held accountable for how quickly ELs develop English proficiency, as well as their achievement in core subjects like mathematics and reading (Council of Chief State School Officers, 2016).

A pivotal moment in ELs’ schooling is when they are reclassified as “fluent English proficient” (Umansky & Reardon, 2014), which is the point at which the student is no longer considered an EL. EL status is associated with academic supports like modified instructional content, English-language development classes, instruction from specialized teachers, and annual assessment of English language proficiency (Council of Chief State School Officers, 2016; Pompa & Villegas, 2017; Umansky & Reardon, 2014). However, once students are deemed proficient enough in English to succeed in mainstream English classrooms, they are reclassified and those supports either diminish or disappear entirely, as do most of the federal funds earmarked for those supports (Council of Chief State School Officers, 2016; Pompa & Villegas, 2017; Robinson, 2011). Depending on when students are reclassified, the transition could either slow their development if the new material is overwhelming or accelerate it if they gain access to advanced material they are prepared to learn (Abedi, 2008). Factors influencing a student’s readiness can include the type of educational program they were enrolled in (Umansky & Reardon, 2014) and the student’s native language (Slama, 2014).

Reclassification can be based on several factors, but virtually all states use some combination of reading achievement test scores and English language proficiency test scores as criteria (Pompa & Villegas, 2017). Given policymakers usually set cut cores on these tests for the purposes of reclassifying ELs, state and local reclassification criteria can impact when an EL reclassifies and, thereby, whether that transition is appropriately timed. A growing body of evidence documents the consequences of reclassifying ELs before or after they are ready. Robinson (2011) found that, in the district he studied, reclassification had a null effect on reading achievement for students in elementary and middle grades, but significant negative effect for students in high school, suggesting no evidence that the cutoff was inappropriately set in younger grades, but that students in later grades did not appear to benefit from having EL services reduced or discontinued. Robinson-Cimpian and Thompson (2016) also looked at how changes in reclassification criteria over time impacted achievement and graduation rates. They found that higher reclassification benchmarks were associated with a 0.18 standard deviation (SD) increase in reading achievement and a 11 percentage point increase in graduation rates.

Given the importance of the decision to reclassify and its timing, several studies use discrete time survival analysis to model time-to-reclassification (Kieffer & Parker, 2016; Motamedi, Singh, & Thompson, 2016; Thompson, 2015b; Umansky & Reardon, 2014). From a policy standpoint, such models are useful in three primary ways. First, they can be used to estimate an average time to reclassification that helps educators know how fast ELs entering the school system can be reasonably expected to reclassify (Hakuta, Butler, & Witt, 2000; Motamedi et al., 2016). Related information can inform policies to hold schools accountable for the achievement and language development of ELs, including how to avoid policies that expect students to be reclassified in an unreasonable amount of time (Parrish, Perez, Merickel, & Linquanti, 2006). Second, such estimates can inform early warning systems designed to identify and support ELs who are not on track to reclassify in roughly the average amount of time (Kieffer & Parker, 2016; Motamedi et al., 2016). Third, these models can help evaluate the effects of different educational programs for ELs on time to reclassification (Umansky & Reardon, 2014). For example, Umansky and Reardon (2014) showed that ELs in dual language programs were reclassified at a slower pace than ELs in other programs but that those same students had higher overall reclassification rates and English proficiency by 12th grade.

While studies using discrete time survival analysis to model time to reclassification have made valuable contributions to EL practice and policy, the models generally do not account for the rate at which English proficiency develops for individual students. For example, most of the models reviewed included controls for initial English language proficiency in Kindergarten but did not model growth in that language proficiency over time. This omission could represent a shortcoming of the models, given reclassification determinations under state policy are largely a function of when a student is deemed proficient in English, which is in turn a function of how fast the student’s English develops (Cook et al., 2012; Pompa & Villegas, 2017; Ramsey & O’Day, 2010). In fact, research shows it is likely unrealistic for English proficiency to develop at the same rate across students (Cook, Boals, & Lundberg, 2011), suggesting that including a student’s unique developmental trajectory in the time-to-reclassification model would likely be highly predictive of reclassification.

One reason English-language development is not included in discrete time survival analysis models for reclassification is that the observed test scores used to measure language proficiency over time would be endogenous time-varying covariates measured with error (Kalbfleisch & Prentice, 2011), which can induce bias in the parameter estimates (Prentice, 1982). To address this technical issue—and the gap in the EL literature—we fit two discrete-time survival models akin to those from prior studies (including using controls for initial English proficiency) and two shared random effects models (SREMs) that jointly estimate a growth model based on longitudinal English language proficiency test scores and a discrete-time survival model for the reclassification data that include the random effects estimated from the growth model as latent covariates. In doing so, we examined whether growth in English language proficiency can improve predictions of time to reclassification. After fitting the models, we compared them based on their ability to accurately predict the time at which a student reclassified. As we show in the study, the SREMs substantially improved rates of accurate classification of students as either remaining in EL status or being reclassified at a given point in time.

Joint Models for Repeated Measures Data and Discrete-Time Event Data

In research using cross-sectional data, covariate endogeneity for a fitted model exists when the covariance between the covariate and the residual does not equal zero (Angrist & Pischke, 2008). In the longitudinal context, covariate endogeneity becomes more complex and has been well studied in both the analysis of repeated measures data (Diggle, Zeger, Liang, & Heagerty, 2002) and the analysis of survival data (Kalbfleisch & Prentice, 2011). For survival analysis, a time-dependent variable is exogenous (referred to as external in the survival literature) if its process influences the rate of event occurrence over time, but its future path is not affected by the occurrence of the event (Kalbfleisch & Prentice, 2011). Such variables include defined covariates where the values are established in the study design and ancillary covariates where the stochastic process is outside of the individual under study. In an experimental setting, examples of exogenous time-dependent covariates include varying treatments determined prior to randomization.

A time-dependent variable is endogenous (referred to as internal in the survival literature) if its future path is affected by the event occurrence (Kalbfleisch & Prentice, 2011). These variables are typically the output of a stochastic process associated with the participant and therefore require their own statistical model (Kalbfleisch & Prentice, 2011). The SREM provides a useful approach for incorporating endogenous time-varying covariates into survival analysis (Kalbfleisch & Prentice, 2011; Rizopoulos & Lesaffre, 2014). Below, we provide the formulation of the SREM and review its application in research.

The SREM provides a robust framework for describing the association between one or more repeated measures outcomes and event times (see Guo & Carlin, 2004; Proust-Lima, Séne, Taylor, & Jacqmin-Gadda, 2014; Tsiatis & Davidian, 2004). For a single continuous repeated measure outcome and a single discrete event occurrence, the SREM can be understood through the specification of a submodel for each outcome that highlights their dependence structure. Start by expressing a mixed model for the repeated measures outcomes. Let there be $j = 1, 2, \dots, n$ students and $i = 1, 2, \dots, n_{j}$ longitudinal measures for student j. Define y _j as a n_j × 1 vector containing the longitudinal measures for student j. We can then specify the mixed model generally as,

y_{j} = X_{j} β + Z_{j} ζ_{j} + ϵ_{j}

where $ϵ_{j} \sim N (0, σ I_{n_{j}})$ , $ζ_{j} \sim N (0, T)$ , and

T = [\begin{matrix} τ_{11} & τ_{12} & \dots & τ_{1 q} \\ τ_{21} & τ_{22} & \dots & τ_{2 q} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ τ_{q 1} & τ_{q 2} & \dots & τ_{q q} \end{matrix}] .

For Equation 1, $X_{j}$ is a known $n_{j} \times p$ design matrix corresponding to the p × 1 vector of fixed effects β, $Z_{j}$ is a known n_j × q design matrix corresponding to the q × 1 vector of random effects, ζ_j, and $ϵ_{j}$ is an n_j × 1 vector of residuals.

Because reclassification can only occur at discrete time points, $t = 1, 2, \dots, T$ , a discrete-time hazard model can be fit to the reclassification data. Let there be $j = 1, 2, \dots, n$ students with a student’s so-called duration indicated by T_j . Those students who do not experience reclassification by time T, or T_j > T, are considered censored. Using grouped-time survival parametrization (Allison, 1982; D’Agostino et al., 1990; Hedeker, Siddiqui, & Hu, 2000; Singer & Willett, 1993), each student contributes a T_j × 1 vector of of dichotomous indicators of event status for each discrete time point, indicated as $r_{j}$ . Students who experience reclassification at time T_j ≤ T will have a vector of T_j − 1 zeros followed by a 1 indicating event occurrence. Students who are censored have a T_j × 1 vector of zeros. Define $h_{t j} = ℙ [T_{j} = t | T_{j} \geq t] = ℙ [r_{t j} = 1 | r_{t k} = 0, k < t]$ as the probability of reclassification for student j at time t given reclassification had not occurred prior to time t.

When fitting discrete event time data, one has the option of using a logit link function or a complementary log-log (clog-log) link function. Use of the clog-log link leads to the proportional hazard model while opting for the logit link leads to the proportional odds model. Only when the temporal distance between the discrete events becomes small can the estimates be interpreted on the hazard scale (Singer & Willett, 2003). For this article, we use the logit link function, and because reclassification is an annual event, the model is a proportional odds model. The log odds of reclassification for student j at time t can be expressed as

log [h_{t j} / 1 - h_{t j}] = logit (h_{t j}) = {w^{'}}_{t j} α + ζ_{j}^{'} λ,

and the probability of reclassification as:

1 / (1 + exp [- logit (h_{t j})]) = 1 / (1 + exp [- ({w^{'}}_{t j} α + ζ_{j}^{'} λ]))

where α is a p × 1 vector of fixed effects that contains T intercept terms for each discrete time point and additional manifest covariates, and $w_{t j}$ , a p × 1 known vector of covariates. Equations 1 and 3 are linked by specifying the student-specific random effects estimated by Equation 1 as latent covariates in Equation 3. The q × 1 vector of random effects ζ _j from Equation 1 is “shared” with Equation 3 which corresponds to the q × 1 vector of fixed effects λ.

Assuming the repeated measures are independent from the reclassification indicators given the random effects, we can express the conditional probability density for the SREM as $g (y_{i j}, r_{t j} | x_{i j}, w_{t j}, ζ_{j})$ and the marginal distribution for student j as

\int h (ζ_{j}) \prod_{i = 1}^{n_{j}} g (y_{i j} | x_{i j}, ζ_{j}) \prod_{t = 1}^{T_{j}} g (r_{t j} | w_{t j}, ζ_{j}) {d ζ}_{j},

where $h (ζ_{j})$ is the density of the student-specific random effects.

The SREM paradigm has been well developed in the statistics and biostatistics literatures. The first SREM was published by Wu and Carroll (1988) to deal with what Little (1995) termed latent variable dependent missingness. The use of SREMs to understand the association between a set of repeated measures and the time to some event was popularized by early HIV clinical trials (De Gruttola & Tu, 1994; Tsiatis, Degruttola, & Wulfsohn, 1995; Wulfsohn & Tsiatis, 1997). These studies aimed to understand how C4D T-lymphocytes were associated with onset of AIDS or death for those subjects with HIV. Henderson, Diggle, and Dobson (2000) utilized an SREM to reanalyze the effect of drug therapy for schizophrenia patients while simultaneously accounting for attrition.

The SREM framework has been less commonly used in education research. Muthén and Masyn (2005) fit a latent class growth model to students’ aggressive behavior in Grades 1 and 2 that was used to predict the time to removal in Grades 3 through 7 using a discrete time survival process with a latent class frailty. Feldman and Rabe-Hesketh (2012) employed an SREM to understand if achievement trajectories were impacted by possibly nonrandom dropout in a large national data set. Their discrete-time hazard submodel included separate parameters for the random intercept and random slope. Estimates from the SREM and a competing model fit to the data assuming the missing data mechanism was ignorable were consistent, suggesting the data were not sensitive to missing data assumptions (Feldman & Rabe-Hesketh, 2012). Finally, Thum and Matta (2015a, 2015b) employed an SREM for longitudinal interim assessments between Grades 4 and 9, SAT and ACT scores between Grades 10 and 12, and a logistic regression model for the probability of taking a college test. The parameter estimates were then used to establish college readiness benchmarks for the interim assessment.

Method

Data

The data were from a single cohort of ELs tracked longitudinally from third grade in academic year 2007–2008 through seventh grade in 2011–2012 from a large urban school district in Arizona. During this time, Arizona implemented an English-only instruction law resulting in homogeneity of language programs across schools (Gándara & Orfield, 2010). The Grade 3 EL cohort consisted of 277 students in 18 schools or 20.77% of all third grade students in the district. By Grade 7, there were 20 ELs in 5 schools remaining.

The outcome variables for this study included students’ repeated Arizona English Language Learner Assessment (AZELLA) total score as well as the binary indicator of student reclassification from EL to fluent English proficient. Each student has a collection of AZELLA total scores for each year they were classified as an EL. The AZELLA total score is a vertically scaled item response theory (IRT)-based score that is a composite of reading, writing, listening, and speaking scores (Harcourt, 2007). The vertical scale of the total score provides a foundation for analysis of the measures across grades, enabling the estimation of growth in English, or English-language development. The binary reclassification indicator was coded 0 for each grade a student was classified as an EL and was coded 1 for the grade in which a student met the reclassification criteria. After reclassification, the student was no longer tracked. For these data, and consistent with the Arizona reclassification policy at the time, a student was reclassified when they earned an AZELLA total score that exceeded the proficiency threshold (Harcourt, 2007).

For this study, the AZELLA Scale was transformed by dividing observed scores by 10 for computational purposes. Figure 1 plots the sample median, first and third quartile, and .025 and .975 percentiles at each grade for the AZELLA total score. Student with a disability (SWD) status was coded as 1 for those ELs who were ever classified as a SWD and as 0 for those ELs who were never classified as a SWD. Across the sample, 83.75% of students were never diagnosed with a disability. Finally, female ELs made up 48.38% of the sample and were coded as 0, while males were coded as 1. Potential idiosyncrasies of Arizona’s English language policies and practices, and their impact on generalizability, are discussed in the Limitations section.

Figure 1.

Sample quartiles for Arizona English Language Learner Assessment total scores by grade.

Analysis

In this study, we compared how accurately discrete time survival analysis models used in prior literature (Slama, 2014; Thompson, 2015a; Umansky & Reardon, 2014) and SREMs predicted when an EL student reclassified. For these data, there were $k = 1, \dots,18$ schools, $j = 1, \dots, n_{k}$ students in school k, and $t = 1, \dots, n_{j k}$ AZELLA total score measures for subject j in school k. For each student j in school k, consider the vector of repeated AZELLA total scores, $y_{j k}$ . Furthermore, consider the vector of reclassification indicators, $r_{j k}$ , where $r_{t j k} = 1$ for the time t a student reclassified to fluent English proficient, and 0 until reclassification occurred. Note that $y_{j k}$ and $r_{j k}$ are both of length $n_{j k}$ and use a common subscript t to denote the occasion. To that end, we fit four models, each including a submodel for the reclassification data $r_{j k}$ and a submodel for the longitudinal AZELLA measures, $y_{j k}$ .

The first two models used discrete time survival analysis, with Model 1 (M1) controlling for disability status and Model 2 (M2) controlling for disability status and initial English language proficiency. M2 was very similar to the models used in several other studies (Slama, 2014; Thompson, 2015a; Umansky & Reardon, 2014). The third and fourth models were SREMs. Model 3 (M3) included disability status and the random student intercept from the AZELLA growth model in the reclassification model, which is akin to including a latent estimate of initial language proficiency. Thus, M3 and M2 were similar, with the latter using a manifest covariate for initial status and the former using a latent covariate. Model 4 (M4), the second SREM, added the random slope coefficient from the AZELLA growth model (linear growth trajectory).

All four models were built from two submodels, one to estimate reclassification and the other to estimate English-language development as measured by the longitudinal AZELLA scores. In the SREMs (M3–M4), the two submodels shared student-specific random effects. In the traditional discrete time survival analysis models (M1–M2), the submodels were estimated simultaneously but were not connected by random effects (nor by any other parameter). The two submodels were estimated together in M1 and M2, so that the number of parameters were more comparable across discrete time survival analysis models and SREMs, which made comparing model fit more straightforward. In the remainder of the Methods section, we detail our four models, including submodels for reclassification and language development, and discuss how we compared model fit.

The four models have been depicted by two directed graphs in Figure 2. Figure 2a characterizes the simultaneous but separate estimation of M1 and M2. The directed path from g to y illustrates the regression of the repeated measure y on grade g at occasion t for student j. The student-specific random effects are characterized by the round nodes, the random intercept, $ζ_{1}$ , and the random slope, $ζ_{2}$ . The time-invariant growth-specific covariates, disability status and sex, are characterized by the regression of x onto the random effects. The directed path from g to r represents the regression of the reclassification indicator r on grade g at occasion t for student j. The regression of the time-invariant covariates, disability status for M1, and disability status and initial English proficiency score for M2, onto the reclassification indicator are characterized by the directed arrow from w to r. Figure 2b characterizes the SREMs that are M3 and M4. The only change from the previous figure is the addition of directed paths from random effects to r. For M3, where only the random intercept was used as a covariate in the reclassification model, the path from $ζ_{2}$ to r was fixed to equal 0.

Figure 2.

Directed graphs for English language proficiency (y) growth and time-to-reclassification (r). (a) Separate growth and reclassification models with covariates (M1 and M2). (b) Shared random effects model with covariates (M3 and M4).

The growth submodel estimated for the longitudinal AZELLA total scores can be expressed as a unit-level model:

y_{t j k} = {x^{'}}_{t j k} β + z_{t j k}^{(2)'} ζ_{j k}^{(2)} + z_{t j k}^{(3)'} ζ_{k}^{(3)} + ϵ_{t j k},

where β was the p × 1 vector of fixed effects corresponding to the known p × 1 vector of explanatory variables, $x_{i j k}$ ; $z_{t j k}^{(2)}$ was the known $q^{(2)} \times 1$ vector of explanatory variables with student-specific random effects $ζ_{j k}^{(2)}$ ; and $z_{t j k}^{(3)}$ was the known $q^{(3)} \times 1$ vector of explanatory variables with school-specific random effects $ζ_{k}^{(3)}$ . Student- and school-level random effects were assumed multivariate normal, $ζ_{j k}^{(2)} \sim N (0, T^{(2)})$ and $ζ_{k}^{(3)} \sim N (0, T^{(3)})$ where $T^{(2)}$ and $T^{(3)}$ are $q^{(2)} \times q^{(2)}$ and $q^{(3)} \times q^{(3)}$ unstructured variance-covariance matrices, respectively, and the residual was assumed normally distributed with constant variance, $ϵ_{t j k} \underset{\sim}{iid} N (0, σ)$ . Because the AZELLA scores are measured with error, and this error is not constant, the iid assumption placed on $ϵ_{t j k}$ may be overly restrictive.

The final growth submodel was selected based on the evaluation of a series of univariate growth models using a systematic search for the best functional form. Because so many students in the sample were reclassified between third and fifth grade, however, the number of potential growth models was limited. Both linear and quadratic functions for the mean structure were assessed, although the data supported student- and school-level variance components for the intercept and linear growth component only. The final growth submodel used in all four models included main effects for female and disability status as well as a disability status by linear growth interaction.

All four reclassification submodels utilized a logit link, resulting in a proportional odds model for reclassification. The first two submodels can be expressed as

logit {ℙ (r_{t j k} = 1 | w_{t j k}, η_{j})} = logit (h_{t j k}) = {w^{'}}_{t j k} α + η_{k},

where $h_{t j k}$ was the probability of reclassification for student j in school k at the end of year t, $w_{t j k}$ was a u × 1 vector of fixed effects, including indicators for each discrete time $1, \dots, T$ , α was a u × 1 vector of corresponding fixed effects, and η_k was a $N (0, ν)$ school-level frailty. The reclassification submodel for M1 included intercepts for Grades 3 through 7, a main effect for SWD, and a school-level frailty. For M2, the submodel was extended to include the grand-mean-centered initial (Grade 3) total AZELLA score.

Recall that M3 and M4 are the SREMs. As such, the reclassification submodels included student-specific random effects estimated by the growth model as latent covariates. We express the reclassification submodels for M3 and M4 as

logit (h_{t j k}) = {w^{'}}_{t j k} α + ζ_{j k}^{(2)'} λ + η_{k},

where $ζ_{j k}^{(2)}$ in Equation 8 was a $q^{(2)} \times 1$ vector of student-specific random effects estimated by the English proficiency growth model specified in Equation 6, and λ was a $q^{(2)} \times 1$ vector of fixed effects corresponding to the latent covariates. The M3 and M4 reclassification submodels were extensions of the M1 reclassification submodel. The M3 reclassification submodel included the random intercept (the latent version of initial status), while the M4 reclassification submodel included both the random intercept and slope (linear growth trajectory).

All models were fit within the probabilistic modeling language Stan (Gelman, Lee, & Guo, 2015) using Hamiltonian Monte Carlo estimation (Hoffman & Gelman, 2014). Unless otherwise noted, uninformative priors were specified such that a random variable, x, was $U (a, b)$ where $a \leq x \leq b$ . If x was an SD, $0 < x$ and $x < \infty$ . For computational efficiency, Equation 6 was specified such that the Cholesky-factorized random effects were given independent $N (0, 1)$ priors while $T^{(2)}$ and $T^{(3)}$ , respecified as correlation matrices, were given $L K J (1.5)$ priors. While an $L K J (1)$ prior results in a uniform density over all correlation matrices of a given order (Lewandowski, Kurowicka, & Joe, 2009), an $L K J (1.5)$ was found to drastically reduce the autocorrelation in the chains. Due to the complexity of the shared-parameter model, priors for Model 8, in particular α and λ, were specified to constrain the support of the parameters (Gelman, Jakulin, Pittau, & Su, 2008). Specifically, α and λ were specified with $N (0, 6)$ , which, on the Logit Scale, provided ample space for the data to dominate the posterior estimates. The Stan model code can be found at the first author’s GitHub site, https://github.com/tmatta

Predictive Accuracy

Model performance was evaluated in two ways, by information criteria and classification accuracy. We use approximate leave-one-out (LOO) cross-validation using Pareto smoothed importance sampling (Gelman, Hwang, & Vehtari, 2014; Vehtari, Gelman, & Gabry, 2016) for an information criteria-based metric. LOO computes the expected log pointwise predictive density for a new data set, elpd, and may be multiplied by $- 2$ to be placed on the Deviance Scale. The deviance-scaled elpd is referred to as the leave-one-out information criteria (LOOIC) and is a fully Bayesian information criterion that is viewed as an improvement over the deviance information criterion and is more robust than the widely applicable Watanabe-Akaike information criterion in the finite case with weak priors or influential observations. Model comparison may be conducted by taking the difference of the LOOIC (ΔLOOIC) and evaluate the difference in ratio to its standard error. Vehtari, Gelman, and Gabry (2016) suspect that this method of model comparison provides a better sense of uncertainty than conventional approaches that evaluate the difference of deviances in comparison to a $χ^{2}$ distribution.

The performance of the four models was also compared based on classification accuracy. ELs are eligible for reclassification at discrete times, $t = 1, 2, \dots, T$ . Consider the $T_{j k} \times 1$ vector of reclassification indicators $r_{j k}$ , where $r_{t j k} = 1$ when $t = T_{j k}$ indicating student j in school k was reclassified to fluent English proficient and was 0 until reclassification.

The classification accuracy of the models can be understood through the cross-classification of $(r_{t j k}, {\hat{r}}_{t j k})$ . Here, ${\hat{r}}_{t j k} = 1$ when $h_{t j k} \geq π$ where π is some probability threshold between 0 and 1. This cross-tabulation results in four conditions: (a) true positives, (b) false positives, (c) true negatives, and (d) false negatives as seen in Table 1. False positives, or Type I errors, are the counts or proportions of those subjects who have not reclassified at time t but were predicted to reclassify based on the model prediction $(r_{t j k} = 0, {\hat{r}}_{t j k} = 1)$ . False negatives, or Type II errors, are the counts or proportions of those students who do reclassify at time t but were not predicted to do so by the model $(r_{t j k} = 1, {\hat{r}}_{t j k} = 0)$ . The true positive rate (TPR) for the model, defined as $ℙ ({\hat{r}}_{t j k} = 1 | r_{t j k} = 1)$ , gives the probability that the model correctly predicted reclassification at time t. The true negative rate (TNR) for the model, defined as $ℙ ({\hat{r}}_{t j k} = 0 | r_{t j k} = 0)$ , gives the probability that the model correctly predicted those students who were not ready to reclassify at time t. The overall classification accuracy of the model is then given by the probability that the model made an accurate prediction,

Table 1.

Classification Table of Cell Counts or Proportions

Predicted Outcome	True Outcome
Predicted Outcome	$r_{t j k} = 1$	$r_{t j k} = 0$
${\hat{r}}_{t j k} = 1$	True positive (TP)	False positive (FP)
${\hat{r}}_{t j k} = 0$	False negative (FN)	True negative (TN)

ℙ ({\hat{r}}_{t j k} = 1 | r_{t j k} = 1) ℙ (r_{t j k} = 1) + ℙ ({\hat{r}}_{t j k} = 0 | r_{t j k} = 0) ℙ (r_{t j k} = 0) .

The receiver operating characteristic (ROC) curve was generated by computing the TPRs and TNRs for $0 \leq π \leq 1$ . The area under the ROC curve was used to compare the predictive power of the reclassification models.

Results

Four models were fit to understand whether, and to what extent, students’ English-language development improved predictions of time to reclassification. The first model, M1, predicted reclassification using students’ disability status as a time-invariant manifest covariate. The second model, M2, added initial (Grade 3) English proficiency scores as a manifest covariate. The third model, M3, the first SREM, replaced the manifest initial English proficiency score with the student-specific initial status random coefficient from the AZELLA growth model as a latent covariate. The final model, M4, the second SREM, added the student-specific linear growth random coefficient from the AZELLA growth model. The fitted models were compared based on their LOOIC and their ability to accurately predict the time a student reclassified. Table A1 in the Appendix provides posterior means and SDs for the parameters associated with the four models.

Model Fit

The LOOIC for each model is presented in Table 2. Comparing M1 to M2, the difference in the LOOIC, compared to its standard error, suggests there is an improvement in model fit when the Grade 3 English proficiency scores are used to predict reclassification, $Δ {LOOIC}_{M 1, M 2} = 100.4, (21.2)$ . Comparing M2 to M3, the latent Grade 3 English proficiency score outperforms the manifest Grade 3 English proficiency score of M2, $Δ {LOOIC}_{M 2, M 3} = 141.0, (17.6)$ . Finally, there is negligible improvement in model fit when the linear growth random coefficient is included alongside the intercept random coefficient, $Δ {LOOIC}_{M 3, M 4} = 1.4, (2.6)$ .

Table 2.

Model Fit and Classification Statistics for Fitted Models

Model	LOOIC	TPR	TNR	ACC	AUC
Model	M (SD)	M (SD)	M (SD)	M (SD)	M (SD)
1	3,245.29 (153.48)	.73 (.07)	.83 (.03)	.80 (.01)	.78 (.02)
2	3,144.95 (154.97)	.76 (.03)	.88 (.01)	.85 (.01)	.82 (.01)
3	3,003.92 (157.09)	.89 (.02)	.95 (.01)	.93 (.01)	.92 (.01)
4	3,002.51 (157.23)	.89 (.02)	.95 (.01)	.93 (.01)	.92 (.01)

Note. Classification statistics were computed using π = .5. LOOIC is the deviance-scaled approximate leave-one-out cross validation, TPR is the true positive rate, TNR is the true negative rate, ACC is accuracy, AUC is the area under the curve. M is the posterior mean and SD is the posterior standard deviation.

The likely reason M3 outperforms M2 by so much, and why M4 shows little improvement over M3, is due to the fact that the growth submodel estimates the random coefficients as a bivariate normal with a strong positive correlation. As a result, the latent intercept covariate in the M3 reclassification submodel is borrowing information from the slope coefficient through the estimated correlation. This becomes clearer when we examine the parameter estimates of M3 and M4 in Table A1. For M3, the estimated correlation between the initial status and linear growth is $τ_{21}^{(2)} / \sqrt{(} τ_{11}^{(2)} τ_{22}^{(2)}) = .94$ ! Nearly all of the information about a student’s linear growth is captured by the intercept. This is evident when we examine the parameters from the reclassification submodels, under M3, λ₁ = 3.57, whereas for M4, λ₁ = 1.98 and λ₂ = 2.57. That is, for M4, by estimating a coefficient associated with the linear growth in the reclassification model, λ₂, much of the information formally associated with the intercept parameter, λ₁, is attributed to the linear growth parameter.

Classification Accuracy

The classification statistics regarding how well each of the four models predicted whether or not reclassification would occur for subject j at time t are presented in Table 2. Using π = .5 as a probability threshold, M1 was 80% accurate when predicting whether reclassification did or did not occur for the 720 observations across 277 ELs. The TPR was .73 indicating that 73% of the ELs who reclassified at the end of a given grade were predicted by the model to reclassify at the end of that grade. The TNR was .88 indicating that 88% of the students who did not reclassify at the end of a given grade were also predicted by the model to not reclassify at the end of that grade. The area under the curve (AUC) was .78 which is illustrated by the ROC curve in Figure 3. In the context of model comparison, the AUC and ROC curve for M1 provided a baseline to compare the predictive power of the remaining three models.

Figure 3.

Receiver operating characteristic curves for the time-to-reclassification models.

Including the manifest Grade 3 English proficiency scores in the reclassificaiton model improved the classification accuracy by 5 percentage points over the baseline. The TPR was .76, the TNR was .85, and the AUC was .82. The first SREM with only a latent covariate for initial English proficiency, M3, predicted the time to reclassification with 93% accuracy. The TPR was .89 indicating that 89% of the ELs who reclassified at the end of a given grade were predicted by the model to reclassify at the end of that grade. The TNR was .95 indicating that 95% of the students who did not reclassify at the end of a given grade were also predicted by the model to not reclassify at the end of that grade. Furthermore, the AUC for the model was .92. Finally, the inclusion of the latent growth coefficient in the last model, M4, resulted in the same classification statistics as M3. The classification statistics corroborate the results from the $Δ LOOIC$ . The ROC curves in Figure 3 provide a graphical depiction of the four models’ predictive power.

Subject-Specific Predictions

Student-specific predictions from M4 are illustrated using the model estimates and observed data to plot both the subject-specific AZELLA total score growth trajectory and the subject-specific probability of reclassification for four individuals selected at random from the data set. Figure 4a to d illustrates the average developmental trajectory (dashed line) and the subject-specific developmental trajectory (solid line) for total English proficiency for four students in the data set. The five horizontal lines illustrate the reclassification benchmarks for each grade. Figure 4e to h illustrates the corresponding probability of reclassification at each grade for those same students. These plots demonstrate how, as a student’s estimated growth trajectory approaches the benchmark, his or her probability of reclassification increases.

Figure 4.

Top row: Marginal (dashed) and subject-specific (solid) Arizona English Language Learner Assessment total English proficiency trajectories for a random sample of students. Bottom row: Probability of reclassification for the same random sample of students.

Figure 4a and e illustrates how Student 34’s probability of reclassification increased from nearly 0 to nearly 1 over 2 years as predicted scores increased. Figure 4b and f illustrates that, as Student 88’s predicted score exceeded the Grade 6 benchmark, there was a 40% chance he or she would reclassify at the end of Grade 6. While the student met the benchmark in Grade 6, the probability of reclassification remained low because the student was quite far from the average growth trajectory. Thus, using π = .5, Student 88’s Grade 6 prediction contributed to the model’s false negatives. Figure 4d and h illustrates that, as Student 96’s predicted score approached the benchmark, the probability of reclassification increased. Finally, Figure 3 illustrates the situation where the growth model incorrectly predicted Student 244 to reach the benchmark in Grade 5, but the estimated probability of reclassification at the end of Grade 5 remained lower than .5. By Grade 6, when the predicted score far exceeded the threshold (as did his or her observed score), the student’s probability of reclassification increased above .5 (to nearly 1).

Discussion

For students who enter the U.S. school system not speaking English, several factors influence when they are ready to benefit from English-only instruction in mainstream classrooms (Motamedi et al., 2016; Robinson-Cimpian & Thompson, 2016; Slama, 2014; Umansky & Reardon, 2014). Two of the most important factors are whether ELs have sufficient content mastery in core academic subjects like mathematics and whether their English language proficiency is sufficient to understand the instruction being given (Council of Chief State School Officers, 2016; Linquanti & Cook, 2013). The intent of reclassification policies is generally to identify students who meet these criteria and are ready to learn in mainstream classrooms. Learning core academic content and English are both complex developmental processes. Oftentimes, students learn colloquial English quickly but struggle to master some of the more complex language used in academic settings (Bailey, 2007; Scarcella, 2003), suggesting that language development is often nonlinear and asymptotic. These developmental language processes can also differ depending on how old the student is upon entering the United States and what the student’s original language is (Hakuta et al., 2000; Thompson, 2015a). Although researchers have used models to predict when a student will reclassify, these models do not incorporate English-language development (likely due to endogeneity concerns), opting instead to control for initial English proficiency (Slama, 2014; Thompson, 2015a; Umansky & Reardon, 2014). One could imagine explicitly accounting for language development in models designed to predict reclassification might improve on classification accuracy relative to those that do not.

The primary purpose of this article was to propose a multilevel SREM that shows whether heterogeneity in English-language development contributes to prediction of reclassification. These SREMs explicitly incorporate English-language development via a statistical model through a specification that incorporates its endogenous properties (Kalbfleisch & Prentice, 2011). The classification accuracy of these SREMs was then compared to that of more traditional discrete-time survival models used in the prior literature on reclassification (Slama, 2014; Thompson, 2015a; Umansky & Reardon, 2014). The final discrete-time survival model that controlled for whether the student was ever diagnosed with a disability and initial English proficiency status predicted the time of a student’s reclassification with 85% accuracy. The final SREM with disability status, and initial language status and linear growth random effects as predictors, by contrast, improved classification accuracy to 93%. While the predictive accuracy of the discrete-time survival model met most conventional criteria for diagnostic models (Steyerberg et al., 2010), classification rates based on the SREM were markedly better.

The superior predictive accuracy of the SREM was, in part, due to the fact that the growth submodel was able to track the longitudinal measures well. The SREM reclassification submodel utilized estimates that describe student-specific English development. Had the growth model fit the longitudinal data poorly, the estimates that characterize English-language development would likely have had little, or possibly detrimental, impact on predicting time-to-reclassification. That is, had the model-based estimates of English proficiency for a given student at a given grade been far off from the observed proficiency measures, their use as covariates in the reclassification submodel may not have had such a positive outcome. Therefore, although this article does not focus on the fitting of the growth submodel, one should realize the inherent connection between good model fit for the growth submodel and the classification accuracy of the survival submodel of the SREM.

Beyond predicting time of reclassification, SREMs likely have other potential uses in education, and for EL policy in particular. Specific to ELs, explicitly accounting for English-language development in models of reclassification could be useful when evaluating programs. For example, Umansky and Reardon (2014) used discrete-time survival models to show that ELs in dual language programs were reclassified at a slower pace than ELs in other programs, but that those same students had higher overall reclassification rates and English proficiency by the end of high school. Conducting a similar study using the SREM may provide additional insight. For example, the SREM would allow researchers to test if language development acts as a mediator between language program and time-to-reclassification.

There also may be benefits to using the SREM when modeling other educational outcomes. For instance, a vast literature describes early warning indicators that can be used to identify students at risk of dropping out and, ideally, intervene before the student gets offtrack (Balfanz, Herzog, & Mac Iver, 2007; Davis, Herzog, & Legters, 2013; Heppen & Therriault, 2008; Neild, Balfanz, & Herzog, 2007). One could imagine SREMs that incorporate survival models for dropout and developmental processes like growth in mathematics and reading might improve the accuracy with which dropout is classified, as well as help identify additional indicators that a student is not on track. Similarly, developmental processes for students with learning disabilities could be modeled in tandem with special education status to help understand those processes better. Although SREMs are still in their relative infancy, they show potential in helping to illuminate the complex processes that underlie educating students.

Limitations and Future Research

As discussed in the Data section, we used data from a large urban school district in Arizona collected between 2007 and 2012. Thus, one of the most significant limitations of this study is its generalizability. In particular, there are two broad threats to generalizability. First, the data were collected prior to the passage of ESSA when the No Child Left Behind (NCLB) Act of 2001 was still the law governing federal accountability related to ELs. Second, during this time, Arizona based reclassification entirely on whether students score above certain thresholds on their English language proficiency test. By contrast, states like California include educator and parent input in their reclassification decisions, as well as other criteria in some districts (Parrish et al., 2006). Arizona has also been the defendant in several lawsuits questioning the state’s reclassification policies, parts of which are still ongoing (Jimenez-Silva, Gomez, & Cisneros, 2014). These aspects of our sample naturally raise fundamental concerns about the generalizability of our findings.

Despite these concerns, there are several reasons the SREM is still likely valuable to educators interested in better understanding reclassification rates and timing. In terms of the policy context, while there are fundamental differences between how ESSA and NCLB treat ELs, one could argue the new federal law actually increased the emphasis on English-language development test scores and related growth trajectories. For example, whereas states needed to set English-language development standards in listening, speaking, reading, and writing under NCLB, ESSA goes one step further and requires that proficiency levels be defined in each of those subdomains (Council of Chief State School Officers, 2016). ESSA further requires states to report the number and percentage of ELs who attain language proficiency based on state English-language development standards and separately report the number and percentage of ELs who are reclassified based on their attainment of language proficiency (Council of Chief State School Officers, 2016).

Specific to state reclassification policy, there are also reasons that Arizona’s idiosyncrasies do not necessarily make the SREM irrelevant in other contexts. Under proposed ESSA plans, many states will still rely heavily if not entirely on cut scores on achievement and English language proficiency test scores when making reclassification determinations (Council of Chief State School Officers, 2016; Pompa & Villegas, 2017). Further, the lawsuits filed against Arizona did not challenge the general approach to reclassification; rather, they contended that the cut scores used were not rigorous enough to make the transition from EL status effective for students (Jimenez-Silva et al., 2014). Thus, while states may differ from Arizona on the number and rigor of the criteria used to reclassify students, the general processes governing and, core inputs to, reclassification are generally quite consistent across states, as well as pre- and postpassage of ESSA.

From a modeling standpoint, one of the main benefits of the SREM is its flexibility. For districts and states that use additional tests or other measured criteria that are time-varying and potentially endogenous, the SREM can incorporate those information, whereas models used in other studies typically cannot. For example, California also requires that ELs show proficiency in reading achievement. The SREM is flexible enough that a reading growth submodel could be fit to the reading test scores, whereby the reading random effects could also be incorporated into the reclassification submodel. Thus, theoretically, the SREM’s ability to address the endogeneity of time-varying covariates is a potential advantage, not a shortcoming when considering how to apply the model in other contexts.

Ultimately, we cannot be certain how the SREM might perform in terms of classification accuracy in other districts or states, nor how it will perform when using criteria established post-ESSA. To address those questions, more research is needed that employs the SREM in a broader range of contexts. The purpose of the article is not to argue that SREMs will uniformly outperform other models or that it will be applicable in all situations. Rather, the study is meant to demonstrate the potential utility of incorporating developmental processes in models for event outcomes like reclassification that are often the focus of educational accountability systems.

Footnotes

Appendix

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Abedi

(2008). Classification system for English language learners: Issues and recommendations. Educational Measurement: Issues and Practice, 27, 17–31. doi:10.1111/j.1745-3992.2008.00125.x

Allison

P. D.

(1982). Discrete-time methods for the analysis of event histories. Sociological Methodology, 13, 61–98.

Angrist

J. D.

Pischke

J.-S.

(2008). Mostly harmless econometrics: An empiricist’s companion. Princeton, NJ: Princeton University Press.

Bailey

A. L.

(2007). The language demands of school: Putting academic English to the test. New Haven, CT: Yale University Press.

Balfanz

Herzog

Mac Iver

D. J.

(2007). Preventing student disengagement and keeping students on the graduation path in urban middle-grades schools: Early identification and effective interventions. Educational Psychologist, 42, 223–235.

Cook

Boals

Lundberg

(2011). Academic achievment for English learners: What can we reasonably expect? Phi Delta Kappa, 93, 66–69.

Cook

Linquanti

Chinen

Jung

(2012). National evaluation of title III implementation supplemental report: Exploring approaches to setting English language proficiency performance criteria and monitoring English learner progress. Washington, DC: American Institutes for Research.

Council of Chief State School Officers. (2016). Major provisions of Every Student Succeeds Act (ESSA) related to the education of English learners. Washington, DC: Author.

D’Agostino

R. B.

Lee

M. L.

Belanger

A. J.

Cupples

I. A.

Anderson

Kannel

W. B.

(1990). Relation of pooled logitstic regression to the time dependent Cox regression analysis: The Framingham Heart Study. Statistics in Medicine, 9, 1501–1515.

10.

Davis

Herzog

Legters

(2013). Organizing schools to address early warning indicators (EWIs): Common practices and challenges. Journal of Education for Students Placed at Risk, 18, 84–100. doi:10.1080/10824669.2013.745210

11.

De Gruttola

X. M.

(1994). Modelling progression of CD4-lymphocyte count and its relationship to survival time. Biometrics, 50, 1003–1014.

12.

Diggle

Zeger

S. L.

Liang

Heagerty

(2002). Analysis of longitudinal data (2nd ed.). Oxford, England: Oxford University Press.

13.

Feldman

B. J.

Rabe-Hesketh

(2012). Modeling achievement trajectories when attrition is informative. Journal of Educational and Behavioral Statistics, 37, 703–736. doi:10.3102/1076998612458701

14.

Gándara

Orfield

(2010). A return to the Mexican room: The segregation of Arizona’s English learners. Los Angeles, CA: Civil Rights Project.

15.

Gelman

Carlin

J. B.

Stern

H. S.

Dunson

D. B.

Vehtari

Rubin

D. B.

(2014). Bayesian data analysis (3rd ed.). Boca Raton, FL: Chapman & Hall/CRC. doi:10.1007/s13398-014-0173-7.2

16.

Gelman

Hwang

Vehtari

(2014). Understanding predictive information criteria for Bayesian models. Statistics and Computing, 24, 997–1016. doi:10.1007/s11222-013-9416-2

17.

Gelman

Jakulin

Pittau

M. G.

Y. S.

(2008). A weakly informative default prior distribution for logistic and other regression models. Annals of Applied Statistics, 2, 1360–1383. doi:10.1214/08-AOAS191

18.

Gelman

Lee

Guo

(2015). Stan: A probabilistic programming language for Bayesian inference and optimization. Journal of Educational and Behavioral Statistics, 40, 530–543. doi:10.3102/1076998615606113

19.

Guo

Carlin

B. P.

(2004). Separate and joint modeling of longitudinal and event time data using standard computer packages. The American Statistician, 58, 16–24.

20.

Hakuta

Butler

Y. G.

Witt

(2000). How long does it take English learners to attain proficiency? Davis, CA: The University of California Linguistic Minority Research Institute.

21.

Harcourt. (2007). Arizona English Language Learner Assessment (AZELLA) technical manual. San Antonio, TX: Author.

22.

Hedeker

Siddiqui

F. B.

(2000). Random-effects regression analysis of correlated grouped-time survival data. Statistical Methods in Medical Research, 9, 161–179. doi:10.1191/096228000667253473

23.

Henderson

Diggle

Dobson

(2000). Joint modelling of longitudinal measurements and event time data. Biostatistics, 1, 465–480. doi:10.1093/biostatistics/1.4.465

24.

Heppen

J. B.

Therriault

S. B

. (2008). Developing early warning systems to identify potential high school dropouts (Issue Brief). Washington, DC: National High School Center.

25.

Hoffman

Gelman

(2014). The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15, 1593–1623.

26.

Hopkins

Thompson

K. D.

Linquanti

Hakuta

August

(2013). Fully accounting for English learner performance: A key issue in ESEA reauthorization. Educational Researcher, 42, 101–108. doi:10.3102/0013189X12471426

27.

Jimenez-Silva

Gomez

Cisneros

(2014). Examining Arizona’s policy response post Flores v. Arizona in educating K–12 English language learners. Journal of Latinos and Education, 13, 181–195.

28.

Kalbfleisch

J. D.

Prentice

R. L.

(2011). The statistical analysis of failure time data (2nd ed.). Hoboken, NJ: Wiley-Interscience.

29.

Kao

Thompson

J. S.

(2003). Racial and ethnic stratification in educational achievement and attainment. Annual Review of Sociology, 29, 417–442. doi:10.1146/annurev.soc.29.010202.100019

30.

Kieffer

M. J.

Parker

C. E.

(2016). Patterns of English learner student reclassification in New York City public schools (REL 2017–200). Washington, DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory Northeast & Islands. Retrieved from http://ies.ed.gov/ncee/edlabs

31.

Lewandowski

Kurowicka

Joe

(2009). Generating random correlation matrices based on vines and extended onion method. Journal of Multivariate Analysis, 100, 1989–2001. doi:10.1016/j.jmva.2009.04.008

32.

Linquanti

Cook

G. H.

(2013). Toward a “common definition of English learner.” Washington, DC: Council of Chief State School Officers.

33.

Little

R. J. A.

(1995). Modeling the drop-out mechanism in repeated-measures studies. Journal of the American Statistical Association, 90, 1112–1121.

34.

Motamedi

J. G.

Singh

Thompson

K. D.

(2016). English learner student characteristics and time to reclassification: An example from Washington state. REL 2016-128. Washingon, DC: Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory Northwest.

35.

Muthén

Masyn

(2005). Discrete-time survival mixture analysis. Journal of Educational and Behavioral Statistics, 30, 27–58.

36.

Neild

R. C.

Balfanz

Herzog

(2007). An early warning system. Educational Leadership, 65, 28–33.

37.

Parrish

T. B.

Perez

Merickel

Linquanti

(2006). Effects of the implementation of Proposition 227 on the education of English learners, K–12: Findings from a five year evaluation. Palo Alto, CA: American Institutes for Research and WestEd.

38.

Pompa

Villegas

(2017). Analyzing state ESSA plans for English learner accountability: A framework for community stakeholders. Washington, DC: Migration Policy Institute.

39.

Prentice

R. L.

(1982). Covariate measurement errors and parameter estimates in a failure time regression model. Biometrika, 69, 331–342.

40.

Proust-Lima

Séne

Taylor

J. M.

Jacqmin-Gadda

(2014). Joint latent class models for longitudinal and time-to-event data: A review. Statistical Methods in Medical Research, 23, 74–90. doi:10.1177/0962280212445839

41.

Ramsey

O’Day

(2010). Title III policy: State of the states. Washington, DC: U.S. Department of Education.

42.

Rizopoulos

Lesaffre

(2014). Introduction to the special issue on joint modelling techniques. Statistical Methods in Medical Research, 23, 3–10. doi:10.1177/0962280212445800

43.

Robinson

J. P.

(2011). Evaluating criteria for English learner reclassification: A causal-effects approach using a binding-score regression discontinuity design with instrumental variables. Educational Evaluation and Policy Analysis, 33, 267–292.

44.

Robinson-Cimpian

J. P.

Thompson

K. D.

(2016). The effects of changing test-based policies for reclassifying English learners. Journal of Policy Analysis and Management, 35, 279–305.

45.

Scarcella

(2003). Academic English: A conceptual framework (Technical Report 2003-1). Santa Barbara, CA: The University of California Linguistic Minority Research Institute.

46.

Singer

J. D.

Willett

J. B.

(1993). It’s about time: Using discrete-time survival analysis to study duration and the timing of events. Journal of Educational Statistics, 18, 155–195.

47.

Singer

J. D.

Willett

J. B.

(2003). Applied longitudinal data analysis: Modeling change and event occurrence. New York, NY: Oxford University Press.

48.

Slama

R. B.

(2014). Investigating whether and when English learners are reclassified into mainstream classrooms in the United States: A discrete-time survival analysis. American Educational Research Journal, 51, 220–252. doi:10.3102/0002831214528277

49.

Steyerberg

E. W.

Vickers

A. J.

Cook

N. R.

Gerds

Gonen

Obuchowski

… Kattan

M. W.

(2010). Assessing the performance of prediction models: A framework for traditional and novel measures. Epidemiology, 21, 128–138. doi:10.1097/EDE.0b013e3181c30fb2

50.

Thompson

K. D.

(2015a). English learners’ time to reclassification: An analysis. Educational Policy, 1–34. doi:10.1177/0895904815598394

51.

Thompson

K. D.

(2015b). Questioning the long-term English learner label: How categorization can blind us to students’ abilities. Teachers College Record, 117, 12.

52.

Thum

Y. M.

Matta

T. H.

(2015a). MAP college readiness benchmarks: A research brief. Portland, OR: NWEA.

53.

Thum

Y. M.

Matta

T. H.

(2015b). Predicting college readiness from interim assessment results: Growth modeling with selection. Paper presented at the Annual Meeting of the American Educational Research Association, Chicago, IL.

54.

Tsiatis

A. A.

Davidian

(2004). Joint modeling of longitudinal and time-to-event data: An overview. Stat Sinica, 14, 809–834.

55.

Tsiatis

A. A.

Degruttola

Wulfsohn

M. S.

(1995). Modeling the relationship of survival to longitudinal data measured with error. Applications to survival and CD4 counts in patients with AIDS. Journal of the American Statistical Association, 90, 27–37.

56.

Umansky

I. M.

Reardon

S. F.

(2014). Reclassification patterns among Latino English learner students in bilingual, dual immersion, and English immersion classrooms. American Educational Research Journal, 51, 879–912. doi:10.3102/0002831214545110

57.

Vehtari

Gelman

Gabry

(2016). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. arXiv. Retrieved from http://arxiv.org/abs/1507.04544

58.

M. C.

Carroll

R. J.

(1988). Estimation and comparison of changes in the presence of informative right censoring by modeling the censoring process. Biometrics, 44, 175–188. doi:10.2307/2531905

59.

Wulfsohn

M. S.

Tsiatis

A. A.

(1997). A joint model for survival and longitudinal data measured with error. Biometrics, 53, 330–339. doi:10.1111/j.1541-0420.2006.00719.x