Elementary School Interventions

Abstract

This study exploits a randomized trial of two light-touch elementary school interventions to estimate long-run impacts on postsecondary attendance and attainment. The first is a classroom management technique for developing behavioral skills in children. The second is a curricular intervention aimed at improving students’ core reading skills. We detect no average impact of either intervention on the likelihood of college enrollment or degree receipt, but find heterogeneous effects by student gender and initial level of academic achievement. Assignment to the behavioral intervention increases the likelihood of college attendance for females, especially at 2-year institutions, but has little impact on males. We find suggestive evidence that exposure to the behavioral intervention benefits low-performing students more than high-performers, whereas exposure to the curricular intervention influences college outcomes more for middle- to high-performing students.

Keywords

elementary school Good Behavior Game Mastery Learning postsecondary

Policy interventions for children aim to improve long-term outcomes such as educational attainment, health, and earnings. Yet, mostly due to data and budget limitations, numerous evaluations of such programs assess their effectiveness by examining impacts on a variety of short-run outcomes such as standardized test scores. While short-run findings provide some insights concerning the education production function, researchers and policymakers cannot be certain that short-run effects will lead to lasting benefits along outcome domains that really matter for students’ future quality of life. If short-run treatment effects were permanent and fully predictive of long-run outcomes, there would be little need for long-term evaluations. Yet, work on the “fade out” of short-run intervention impacts (Currie, 2001; Currie & Thomas, 1995; Huston et al., 2011; Ludwig & Miller, 2007) as well as teacher-induced learning gains (Jacob, Lefgren, & Sims, 2010) illustrate that this is not the case. Instead, such work demonstrates that childhood educational interventions can produce long-term benefits without much impact in the near term on test scores (Belfield, Nores, Barnett, & Schweinhart, 2006; Deming, 2009; Krueger & Whitmore, 2001); similarly, findings of short-run test score improvements do not necessarily ensure positive effects along other outcome domains later in life (Deming, Hastings, Kane, & Staiger, 2011).

Recent work evaluating the long-run impacts of interventions that target students in early primary school grades has mainly focused on class size (Dynarski, Hyman, & Schanzenbach, 2011) or global classroom quality (Chetty et al., 2011). Yet, for many urban schools that are resource-constrained—either by space, money, staff, or a combination of factors—implementing smaller classes, or other more intensive interventions, is not a feasible policy option. In such cases, we must look to evidence on programs or polices that can be implemented in classrooms or schools for a modest investment. While there is substantial evidence on the short- and long-run impacts of intensive preschool programs (Anderson, 2008; Camilli, Vargas, Ryan, & Barnett, 2010), we have much less wisdom about which programs and policies are likely to generate long-term benefits in the lives of traditionally disadvantaged students when implemented during the more traditional K–5 years of schooling. Such evidence is particularly important for policy since low-cost, light-touch interventions have the potential for large returns to incremental improvements in what goes on in a classroom, especially early in life.

This study provides just this sort of evidence on two light-touch elementary school interventions: the Good Behavior Game (GBG) and Mastery Learning (ML). The GBG intervention was a classroom management technique that targeted the development of behavioral skills in students necessary for success in school and beyond (e.g., self-control, attention regulation). The ML intervention was a curricular shift in how reading was taught to elementary school students that aimed to equip them with a more solid and functional understanding of core reading concepts. Both treatments were randomly assigned in a carefully controlled experimental design to students in 19 Baltimore City public schools during the mid-1980s.

We make two primary contributions to the growing literature that examines the long-run impacts of childhood educational interventions. First, we provide the first causal estimates of the long-term college-going effects of these two light-touch elementary school interventions. We focus on college attendance and degree receipt because these two long-run outcomes affect the earnings and general well-being of children later in life. Individuals with a college degree earn more than their high school-educated counterparts and are more likely to be employed, even during times of economic downturn (Bureau of Labor Statistics, 2011; Economic Policy Institute, 2007). We link baseline data on first-grade students in these schools to detailed information on their college-going behavior from the National Student Clearinghouse (NSC). This allows us to observe whether students affected by these two interventions ever enroll in college or earn a degree over more than two decades (i.e., up until most study children are in their early 30s). The positive effects of the interventions on short- and medium-run educational outcomes of these youth (e.g., Dolan et al., 1993; Kellam et al., 2008) suggest that important long-term effects on college outcomes may exist. Second, we probe potential mechanisms underlying any long-term effects by exploring the degree to which short-run impacts of these interventions on academic achievement and behavior are associated with any of the long-run postsecondary effects.

In the next section, we provide background on extant work exploring the impacts of educational interventions occurring early in children’s lives on a variety of outcomes, with an emphasis on studies that examine medium- to long-run impacts. We then describe the interventions of interest, hypothesized mechanisms through which these interventions may impact long-run outcomes for children, the various data sets we use, and detail our empirical approach. The section “Results” presents the main findings; and we conclude in section “Discussion and Conclusion.”

Background and Literature Review

Educational Interventions During Childhood and Long-Run Impacts

There is a substantial body of work addressing the short- and medium-run impacts of a variety of childhood educational interventions. These interventions span preschool-aged and elementary-school-aged children, with substantially more evidence accrued concerning preschool programs, policies, and initiatives. In both cases, such interventions target young children based on the logic that human capital interventions are likely to be particularly promising for disadvantaged children during the early years of life (Heckman, Krueger, & Friedman, 2003). Simply put, investments made at early points in a child’s life allow a longer time horizon over which benefits can accumulate.

Although policymakers and researchers have been able to improve their collective understanding of the long-run impacts of some childhood investments, programs, and policies, it is difficult to examine such long-term outcomes for a variety of reasons. First, long-term evaluations require that researchers follow study participants for an extended period of time after the intervention has ended, and this follow-up is often costly and difficult. Second, when long-run follow-up data are available, these data are frequently self-reported and/or suffer from substantial attrition over time.

The few recent studies that have looked at policies or programs during the K–5 time frame and been able to overcome these common limitations match students to third-party administrative data to estimate effects on long-run outcomes. Both studies examine class size. These analyses link information on students from the Tennessee STAR (Student/Teacher Achievement Ratio) experiment to college-going information (Dynarski et al., 2011) as well as labor market earnings (Chetty et al., 2011). Chetty and colleagues (2011) find that students who were randomly assigned to higher quality classrooms in Grades K–3 earn more, are more likely to attend college, save more for retirement, and live in better neighborhoods. Similarly, Dynarski et al. (2011) find that assignment to a small class increases the probability of attending college by 2.7 percentage points, with larger effects for Black students and those least likely to attend college. They also find that small classes increase the likelihood of earning a college degree by 1.6 percentage points and tend to push students toward majors in higher earning STEM (science, technology, engineering, and medicine) fields.

Other work evaluating light- to heavy-touch elementary school interventions has focused on short-run outcomes. For example, a carefully completed meta-analysis of “peer-assisted learning” (PAL) interventions in elementary schools concludes that such interventions are effective at raising short-run student academic performance (Rohrbeck, Ginsburg-Block, Fantuzzo, & Miller, 2003). PAL interventions focus on enhancing student learning through the incorporation of peer-mediated teaching strategies into elementary school curricula. This meta-analysis also concluded that PAL interventions were most effective with “vulnerable students, including students in the lower elementary school grades, minority students, students attending schools in urban settings, and possibly low-income students” (Rohrbeck et al., 2003, p. 240). This collection of work illustrates to policymakers that the integration of PAL-like strategies into elementary school curricula has the potential to raise the academic achievement of traditionally disadvantaged children. Yet, what this type of work cannot speak to is whether such interventions and short-run performance improvements translate into better outcomes later in life for similar student groups.

In contrast, researchers have accumulated a substantial body of evidence on the long-run benefits of preschool interventions and policies. For example, one of the largest government programs aimed at improving the life chances of young children is Head Start. This program was created in 1965 as part of President Johnson’s War on Poverty to provide a wide range of preschool, health, and other social services to poor children ages 3 through 5 (U.S. Department of Health and Human Services, 2010b). Head Start also provided childcare services of substantially better quality than what was traditionally available to low-income working parents (Currie, 2001). Evidence from the best studies on the effectiveness of Head Start¹ suggests positive long-term effects on earnings and criminal behavior, and health (Garces, Thomas, & Currie, 2002; Ludwig & Miller, 2007). This evidence further suggests that the largest benefits accrued to the most disadvantaged children (Deming, 2009).

A few of the most famous preschool programs were able to tie randomized designs to their implementation. Three preschool interventions have received wide attention in the literature: the Abecedarian Project (1970s in North Carolina), the Perry Preschool Program (1960s in Michigan), and the Early Training Project (1960s in Tennessee). In each of these interventions, children received several years of preschool education up until they began regular elementary school. The intensity and particular components varied by program: For example, in addition to a robust full 5-day per week preschool program, children in the treatment condition in the Perry Preschool experiment also received one 90-minute home visit per week during the academic year (Schweinhart et al., 2005). In evaluations of all three experiments, researchers identified positive impacts on short-run IQ test scores during the pre-kindergarten treatment period (Anderson, 2008). Yet, these IQ effects disappeared by the end of third grade (Duncan, Ludwig, & Magnuson, 2007). Although the positive impacts of the Abecedarian, Perry Preschool, and Early Training programs on some short-run outcomes (such as IQ and test scores) diminished over time, there is evidence that these programs increased high school graduation and self-reported college-going rates of females (Anderson, 2008) and had lasting effects on employment rates, earnings, and likelihood of arrest through age 40 (Duncan, Ludwig, et al., 2007).

Broadly, this collective body of work on preschool and elementary school interventions illustrates that intensive and often multifaceted interventions are capable of producing long-run improvements in children’s lives (at least for females). However, early childhood is a time of rapid changes in cognitive and behavioral abilities, so interventions implemented in pre-K environments may have different short- and long-run impacts than those conducted in K–5 settings. Our work explores the ability of less-intensive, potentially more scalable elementary school interventions to affect long-run outcomes for students in urban, traditionally disadvantaged schools. Below we describe each intervention in detail along with hypothesized mechanisms through which these types of interventions may affect long-run educational outcomes, and then discuss extant evidence on the short- to long-run impacts of these interventions on a range of student outcomes.

Interventions of Interest: “Good Behavior Game” and “Mastery Learning”

During the 1985–1986 and 1986–1987 school years, the Johns Hopkins Prevention Research Center (PRC) conducted a randomized field trial of two classroom-based, universal interventions. Teachers were trained on how to implement each intervention throughout the entire intervention period (Dolan, Ford, Newton, & Kellam, 1989; Eaton, Or, Ialongo, Storr, & Roth, 2011; Kellam et al., 2008). The interventions occurred at the classroom level and lasted for 2 years (first and second grade). Both interventions were carried out in close partnership with the Baltimore City Public School System (BCPSS).

The first intervention was a classroom management technique called the “Good Behavior Game” (GBG; Barrish, Saunders, & Wolf, 1969). This method provides teachers with a concrete way to manage inappropriate and disruptive classroom behaviors in the context of a game. Students are divided into teams, and each inappropriate or disruptive behavior by a student results in an increased chance of her team losing a privilege shared by all members of the team (e.g., extra recess time, first to line up for lunch, stickers, and/or any other classroom-level privileges). The GBG uses team competition along with peer influence and reinforcement procedures to affect change in student behavior (Tingstrom, Sterling-Turner, & Wilczynski, 2006).

The GBG is easy to use, time-efficient, and widely versatile (for a host of variations and adaptations, see Tingstrom et al., 2006). In the Baltimore intervention, teachers received 40 hours of training, most of which occurred during the beginning of the program, followed by supportive mentoring during the course of the first-grade school year (Kellam et al., 2008). A comparable amount of mentorship and attention was given to teachers in the control group, but without focusing on the GBG classroom behavior management techniques (Kellam et al., 2008). The GBG strategies were implemented by teachers and were aimed at “socializing children to the role of student and reducing aggressive, disruptive classroom behavior”—a risk factor associated with a host of negative adolescent and adult outcomes, including drug, alcohol, depression, and antisocial behavior disorders (Kellam et al., 2008, pp. S5–S6).²

The skills targeted by the GBG also link to more recent work on self-control, attention, and matters of “executive function” (e.g., Center on the Developing Child, 2011). One of the basic points of this emergent literature is that variation in kids’ abilities to focus and shift their attention, follow rules, self-regulate their feelings, and respond to other sets of demands is associated with a host of longer-run outcomes in adulthood (Shonkoff, 2012). Furthermore, since the biological processes through which executive function develops “begin in early infancy and continue into the early adult years,” (Shonkoff, 2012, p. 7202), there is scope for policy action that may affect the acquisition and development of such skills over a fairly long time horizon. While couched in different language, the GBG intervention conducted in Baltimore City Public Schools sought to target these types of behavioral skills in children (Kellam et al., 2008).

The second intervention consisted of a series of curricular changes called “Mastery Learning” (ML). ML is a teaching strategy based in the underlying theory and supporting research that under appropriate instructional conditions, nearly all students can learn most of what they are taught (Block & Burns, 1976; Bloom, 1971). A teacher using an ML approach to instruction would begin by defining “mastery” in terms of the particular subject being taught (usually via a series of key objectives). He then would design a final, or “summative,” assessment based on these overarching objectives and define what level of understanding a student needs to meet to master the course material (Block & Burns, 1976, p. 7).

To achieve this goal, an ML approach to teaching would generally: (a) break the prespecified objectives into a series of smaller learning units; (b) teach each unit for mastery—using formative assessments to uncover gaps in student understanding; (c) use a flexible “correctives” process tailored to the specific weaknesses of students, as identified in the formative assessments; and (d) evaluate each student’s mastery of material over the course as a whole on the basis of achievement relative to the standard of mastery setup at the beginning, and not relative to the performance of other students (Block & Burns, 1976; Dolan et al., 1989).

To solidify ideas about how the treatment and control classrooms differed in teaching strategies and approaches, consider Figure 1 (adapted from the Mastery Learning Manual created by Baltimore City Public Schools [BCPS] in conjunction with Preventive Research Center at JHU, Figure 2.1).

Figure 1.

Mastery learning (ml) versus traditional practice.

Traditional approaches hold time constant and allow mastery to vary, while ML approaches hold mastery constant and allow time to vary as necessary (Robinson, 1992). At the heart of the ML approach is the ability of the teacher to provide a wide variety of “correctives”—which offer students unable to master a particular concept after it is first taught the opportunity to circle back and attempt to learn it again via a different mode, presentation, or learning style.³ Students who have already demonstrated mastery for a particular subunit are free to engage in enrichment activities related to the unit’s concepts and/or serve as tutors for their classmates still progressing toward mastery (Block & Burns, 1976).⁴

Within this experiment, BCPS provided training to elementary school teachers randomized into the ML intervention over the course of the 2-year treatment period.⁵ They focused on implementing ML within the reading curriculum of young elementary school students—providing teachers with methodological training, curricular materials, and the flexibility to adjust the curriculum time with respect to student progress (Dolan et al., 1989).⁶ Therefore, the critical elements of this implementation of ML included a group-based approach to mastery, a flexible correctives process that often used peers, and material support for teachers. ML classes were instructed not to proceed to the next unit until the majority of students (80%) had fulfilled the vast majority (80%–85%) of learning objectives for the current unit (Dolan et al., 1989). In addition to the direct hypothesized impact of better content mastery on future academic outcomes for students, advocates of ML approaches also hypothesize that ML will lead to higher levels of student confidence and “grit,” with students more likely to want to take on higher units and more complex concepts, having mastered the basic underlying skills (Block & Burns, 1976; Davis & Sorrell, 1995). There is concern that ML approaches to instruction hinder the performance of gifted students, but little concrete evidence to support this concern. In fact, high-achieving girls in ML classrooms demonstrated short-run achievement improvements (Dolan et al., 1993). Furthermore, as high-achievers in ML classrooms often functioned as “tutors” for students struggling to master particular concepts, it is plausible that the act of explaining concepts to a peer enhanced the student−tutor’s own understanding of the material.

Effects of the GBG and ML Interventions

In Table 1, we present a summary of work that explores the short-run (and sometimes long-run) impacts of the GBG and ML interventions on a variety of outcomes. All of these studies use the Johns Hopkins data on Baltimore City Public School students. Here we summarize this collection of work in Baltimore and then discuss research on ML and GBG that has occurred in other locations and with different populations of K–12 students.

Table 1

Findings from Good Behavior Game (GBG) and Mastery Learning (ML) Interventions in Baltimore City Public Schools.

Outcome domain	Findings
	GBG	ML
Effects in elementary school: Dolan et al. (1993); Kellam, Rebok, Mayer, Ialongo, and Kalodner (1994)
Reading performance	No effects	Positive impacts on reading achievement in the spring of first grade—with suggestive evidence that low-achieving boys and high-achieving girls benefit the most
Aggressive or disruptive behavior	Reductions in aggressive behavior by end of first grade, with larger and more robust results for boys	No effects
Shy behavior	Reductions in shy behavior by the end of first grade	Reduction in shy behavior exhibited by girls by the end of first grade
Depressive symptoms	Not examined	Reductions in depressive symptoms by the end of first grade (among students who initially presented with depressive symptoms and improved their academic performance)
Effects in high school and early adulthood: Kellam et al. (2008); Petras et al. (2008); Poduska et al. (2008); Wilcox et al. (2008)
High school graduation	Some evidence of higher high school graduation rates among males; no impacts on females	Not examined
School-based service use^a	Males in GBG classrooms rated as highly aggressive or disruptive in fall of first grade by their teachers used school-based services at a lower rate than their counterparts in control classrooms	Not examined
Drug/alcohol dependence disorders	Substantial reduction in males’ drug abuse/dependence disorders—with the largest impacts for males who were rated as more aggressive/disruptive by their first-grade teachers	Not examined
Violent and criminal behavior	Reductions in the rates of antisocial personality disorder (ASPD) and violent and criminal behavior among males who were initially rated as highly aggressive/disruptive	Not examined
Smoking	Reduction in the probability of males smoking greater than 10 cigarettes per day, with the largest effects among boys with high initial levels of aggressive, disruptive behavior; no effects on girls	Not examined
Suicidal ideation	Reduction in risk for suicidal ideation by end of high school/early adulthood	No effects

Note. All short-run behavioral outcomes are measured by Teacher Observation of Classroom Adaptation—Revised (TOCA-R) scores in the spring of Grade 1; short-run reading performance is measured by scores on the California Achievement Test (CAT) in the spring of Grade 1; long-run outcomes come from follow-up surveys with participants (i.e., ages 19–21).

School-based services include “being placed in a special school or special classroom for problems with behavior, feelings, or drugs or alcohol; receiving special help in the regular classroom; and receiving other counseling or therapy in school” (Kellam et al., 2008, p. S32).

There are some short-run effects of the GBG on behavior (i.e., by the end of first grade), but these pale in comparison to the variety of long-run impacts researchers have uncovered. For example, prior studies found exposure to the GBG to reduce drug and alcohol dependence, antisocial personality disorder, use of school-based social and health services, as well as suicidal ideation (Kellam et al., 2008; Wilcox et al., 2008). These effects were largest for male students who were rated as highly aggressive or disruptive in the fall of first grade (Poduska et al., 2008). Largely absent from these long-run outcomes are educational measures. There is suggestive evidence that the GBG may have led to higher high school graduation rates among males—but this conclusion is based on self-reported follow-up survey data for only a portion of the baseline sample.⁷ The GBG had a long-term impact on a range of health-related outcomes (Petras et al., 2008), so it seemed worthwhile to explore whether it had any effect on long-run educational outcomes.

Implementation of a GBG intervention with older K–12 students and in a different state also found positive effects on behavior. Among New York City high school students, those exposed to the GBG experienced reductions in the rate of seat leaving, talking without permission, and aggression (Kleinman & Saigh, 2011). A randomized evaluation of a school-based preventive intervention in New York City elementary schools among younger kids that was very similar to the GBG (i.e., it focused on social-emotional learning and literacy development) found short-run improvements across several domains including student self-reported levels of aggression and depression, as well as teacher reports of attention skills and socially competent behavior (Jones, Aber, & Brown, 2011).

In the Baltimore study, researchers found the ML intervention to increase the number of students making substantial gains in reading performance (i.e., 50 points on the California Achievement Test [CAT]) over the course of first grade by 36% (Crijnen, Feehan, & Kellam, 1998). There was suggestive heterogeneity by gender in these short-run effects: Female high-achievers benefited more from the ML intervention than their lower achieving counterparts, whereas male low-achievers benefited more than their higher achieving counterparts (Dolan et al., 1993). The long-run educational impacts of the ML intervention remain largely unexplored.

An early meta-analysis of studies that examined the efficacy of ML programs across a wide range of contexts found overwhelmingly positive impacts of ML on short-run student academic achievement, as well as on student attitudes toward academics (Kulik, Kulik, & Bangert-Drowns, 1990). These impacts were often on the order of .5 standard deviations, which is huge compared with recent work on the impacts of additional instructional time, smaller class sizes, and grade retention for low-achieving students (i.e., between .10 and .15 standard deviations; Marcotte & Hansen, 2010). Yet, test score effects were consistently larger on “local” (i.e., teacher- or school-developed) tests than on standardized tests (Kulik et al., 1990, p. 277).⁸ Even if the positive short-run impacts of ML on test scores are modest, or differ depending on the nature of the tested material, there is substantial evidence to conclude that they do exist.

With this background on the state of knowledge surrounding both interventions, we now turn to describing the data we use to assess the long-run impacts of both interventions on a variety of postsecondary outcomes. We focus not only on estimating average impacts, but explore heterogeneity by several subgroups of students based on past literature. We also investigate the degree to which short-run impacts on behavior and reading achievement translate into long-run college-going effects.

Data and Empirical Approach

We begin with student-level data from the randomized controlled trial (RCT) of the GBG and ML interventions that took place during the mid- to late-1980s in a group of Baltimore City public elementary schools. We then match on information describing the college-going behavior of students in the sample. Below we describe each data set in detail and our empirical approach to analyzing both interventions.

Johns Hopkins PRC Data

The experiment included a total of 2,311 first-graders across two cohorts (1985–1986 and 1986–1987). In each cohort, there were about 42 classrooms and 19 schools that participated. Ultimately, students were randomly assigned to individual classroom units, resulting in a total of 84 separate classroom units spread across the treatment and control conditions (Dolan et al., 1993). In total, there were 16 GBG classrooms, 18 ML classrooms, 26 internal control classrooms, and 24 external control classrooms (i.e., control classrooms in the 12 fully control school-by-cohort groups).

To randomize classrooms into treatment and control conditions, public elementary schools were first grouped into five socio-demographically distinct areas of Baltimore City. Within these geographic clusters, schools were then randomized to treatment (GBG, ML) or control, so that some schools were fully control schools with no intervention classrooms. All others were “intervention specific” (i.e., no school contained an ML and GBG classroom). Classrooms in schools in the treatment condition were then randomized to a particular treatment or the control condition⁹; students were randomly assigned to fill these classrooms; and first-grade teachers were randomly assigned to the classrooms of students (Dolan et al., 1993). When the children moved from first to second grade, they stayed with the same group of classroom peers and remained in the same intervention condition, but second-grade teachers were then randomly assigned to these groups of kids.¹⁰ The result is that students assigned to treatment (either GBG or ML) were exposed for a total of 2 years (first and second grade).

Baseline data include student race/ethnicity, gender, free and reduced meals (FARM) status, and cohort membership, along with information on assignment to the GBG treatment group, the ML treatment group, or the control group. We also use identifiers for the school and classroom in which each student was present during either the fall of 1985 (Cohort 1) or the fall of 1986 (Cohort 2).

In addition to these demographic characteristics of students, the data include students’ reading test scores on the CAT, and TOCA-R (Teacher Observation of Classroom Adaptation—Revised) scores for students in the fall and spring of Grade 1 and the spring of Grade 2.¹¹ TOCA-R scores are recorded on a 6-point frequency-based scale (Werthamer-Larsson, Kellam, & Wheeler, 1991) and are generated through a structured interview process with classroom teachers: A trained member of the intervention team

administers the survey [in a private place] . . . responds in a standardized way to issues the teacher initiates, and records the teacher’s ratings of the adequacy of each child’s performance on three basic tasks: social participation (the maladaptive form being shy behavior); accepting authority (the maladaptive form being aggressive/disruptive behavior); and concentration and being ready for work (the maladaptive form being inattention or having concentration problems). (Dolan et al., 1993, p. 325)

Higher ratings indicate greater manifestation of the maladaptive version of the behavior.¹² To give some sense of what types of attributes are combined to generate the TOCA-R behavioral scales, take “concentration problems” as an example. The component behaviors/skills on which teachers rated each student (along a 6-point scale) included completes assignments, concentrates, poor effort, works well alone, pays attention, learns up to ability, eager to learn, works hard, stays on task, mind wanders, and easily distracted.¹³ These values are then aggregated into one “concentration problems” TOCA-R score for each student. Changes in these types of skills and behaviors lead a student to improve (or not) along the “concentration problems” aggregate TOCA-R score.

To ease interpretation, we standardize all TOCA-R variables and use these normalized values whenever we include TOCA-R variables as covariates in our models.¹⁴ In addition to the three scales described above (aggressive/disruptive, concentration problems, and shy behavior), we also make use of one other rating: the teacher’s “global assessment” of each student’s overall academic performance. This variable also ranges from 1 to 6, but this time higher values are better (i.e., they indicate higher levels of overall academic performance). Of particular interest are students’ reading test scores and TOCA-R ratings in the fall of Grade 1, prior to the start of the GBG and ML interventions. These measures are used to explore the similarity of treatment and control groups at baseline and as additional covariates in our main models.

NSC Data

We submitted the population of 2,311 students to the National Student Clearinghouse (NSC) obtain information about the college-going behavior of students in treatment and control groups. The NSC is a nonprofit organization that was founded to assist student loan companies in validating students’ college enrollment. Colleges submit enrollment data to NSC several times each academic year, indicating whether a student is enrolled and with what intensity (e.g., part-time or full-time). NSC also records degree completion and the field in which the degree is earned. The NSC matches students based on their name and date of birth.¹⁵ For every student appearing in the NSC database, we observe the name, state, and type (2/4-year, public/private) of the postsecondary institution in which she enrolled, the enrollment status (full-time/part-time), and start and end date of every enrollment spell; and if the student graduates, the date of graduation, college major, and degree earned.

The NSC has become an excellent source of rich data on the college-going experiences of students for researchers interested in higher education economics and policy, but it is relatively new to researchers. Therefore, it is important to point out a few key limitations inherent in using NSC data that may shape the types of implications we can draw from our analyses.

First, not all schools participate in NSC. Nationally, the company estimates they capture about 93% of undergraduate enrollment. The institutions covered by the NSC compare favorably to those in the Integrated Postsecondary Education Data System (IPEDS), a federally generated database that lists every college, university, and technical or vocational school that participates in the federal financial aid programs (e.g., Pell Grants and Stafford Loans), a total of about 6,700 institutions. Across a range of institutional measures, the NSC colleges appear similar to the IPEDS colleges, with one exception: NSC contains relatively few private, less-than-4-year colleges. These are primarily private trade schools such as automotive, technology, business, nursing, culinary arts, and beauty schools.¹⁶ If either the GBG or ML causes students who would otherwise not go to college to attend these types of colleges, we will underestimate the impacts of these interventions on our postsecondary outcomes of interest.

The estimated coverage rate for Maryland institutions of higher education during the primary years in which these students would have first attended college is reasonably high, beginning at about 74% of undergraduate enrollment in 1999 and rising to 90% or higher by 2002 (Dynarski, Hemelt, & Hyman, 2012). Coverage rates for neighboring states such as Virginia, Delaware, and West Virginia are similar. Furthermore, the coverage rates among public institutions and degree-granting institutions of all kinds are even higher (usually by about 2 to 4 percentage points) throughout this period (Dynarski et al., 2012).¹⁷

Second, students (and schools) can choose to “block” personal educational information from being released by the NSC under FERPA (Family Educational Rights and Privacy Act). For these students, we are unable to observe detailed information on where they attended college and whether they ever earned any type of degree, even though we can see the aggregate number of students who “matched.” For example, from our sample of 2,311 students, we are able to see detailed college-going and degree receipt information for 719 students (31.1%). Yet, in the aggregate report, NSC indicates that 881 students (38.1%) matched to their database—indicating some form of college enrollment. While we may slightly underestimate college attendance and/or degree attainment for this sample, we have no reason to believe that such FERPA-blocking is systematically correlated with the treatment status of students in our sample.¹⁸ Furthermore, our estimates compare favorably to national survey estimates of educational attainment for cohorts born in Maryland, and specifically for those in Baltimore City: According to the 2006 American Community Survey (ACS),¹⁹ among those born in Baltimore City in the same years as our PRC cohorts, 33.4% identify as having attained “some college,” and 9.4% have “a bachelor’s degree or higher.” Our NSC estimates of college attendance and attainment²⁰ are quite close to the ACS estimates.

Empirical Approach

The empirical basis for this study is an RCT, so we first explore baseline differences in a variety of covariates to assess the adequacy of the randomization process and to guide our decisions about appropriate covariates to include in our main analytic model. We then estimate (OLS) regression models of the following type:²¹

Y_{i s c} = α + β_{1} GB G_{i s c} + β_{2} M L_{i s c} + β_{3} X_{i s c} + δ_{c} + ε_{i s c},

where i indexes students, s schools, and c denotes whether each student was a member of the first (1985–1986) or second (1986–1987) cohort. Y_isc is a dichotomous outcome indicating whether student i from school s in cohort c attended college or received a degree. GBG_isc and ML_isc are dummy variables denoting whether student i in school s and cohort c was randomly assigned to either the GBG intervention or the ML intervention²²; δ_c is a dummy variable denoting cohort membership, which captures any static unobserved differences between the two cohorts of students; ε_isc is a stochastic error term. Those not randomly assigned to either intervention comprise the control group. Then, β₁ and β₂ give the causal impact of the GBG and ML interventions on the outcomes of interest, respectively. These impacts are intent-to-treat (ITT) effects. We focus on ITT rather than treatment-on-the-treated (TOT) impacts as ITT effects are arguably more policy relevant. School administrators and policymakers interested in light-touch school- or classroom-level interventions are most concerned about the impacts on those to whom the treatment was assigned, as they do not have the ability to force students (or teachers) to be “fully” treated.²³

The data used in these analyses come from an experimental setting, so any additional covariates added to control for the characteristics of students, their families, their schools, or other factors thought to affect the outcomes of interest should not substantially alter the main estimates. Yet, adding covariates can improve the statistical precision of the point estimates of interest. Therefore, in our preferred specification, we include X_isc, a vector of student-level covariates including achievement (Grade 1 fall CAT reading test scores), four TOCA-R behavioral scores from the fall of Grade 1, demographic characteristics (race/ethnicity, gender), and an indicator for whether the student qualified for free/reduced-price meals (FARM) at baseline. In all models, we cluster standard errors at the classroom-school-cohort level to allow for arbitrary correlation of error terms across students within classroom-school-cohort groups. Models that instead cluster standard errors at the school-cohort level yield similar results.²⁴

The randomization design followed in this experiment lends itself to a clear check on contamination concerns. One may worry that any treatment effects from a model that uses information on all students may be polluted by interactions among treatment and control teachers in treatment schools or by peer effects through student interactions. If so, estimated effects of these interventions on postsecondary outcomes would be attenuated toward zero. Spillover concerns may be especially relevant for the behavioral intervention as treatment teachers could easily share related (even if not identical) strategies with control teachers of the same grade, and kids whose behavioral skills improve may have positive impacts on their current (and future) peers. To test for spillover effects, we add two additional indicator variables to the main specification above, one denoting control students in GBG schools and another for control students in ML schools. Examining the coefficients on these dummies allows us to assess the degree of intervention-specific spillover.

For our long-run outcomes of interest, we estimate average impacts on attendance (at any college, 2-year, and 4-year institutions) and on attainment (any degree, associate’s degree, bachelors’ degree or more). Second, we explore potential heterogeneity in effects across two sets of student subgroups. We rely on prior research to help us prespecify the groups for whom we might expect to see different long-run effects of the GBG and ML intervention: Accordingly, we estimate effects by gender and by initial academic/reading achievement level. Finally, we conduct an exercise in which we attempt to understand how much of any long-run postsecondary effects of these interventions can be explained by short-run impacts on academic achievement and/or behavior.

Results

Descriptive Statistics and Threats to Internal Validity of Experimental Estimates

In Table 2, we compare mean characteristics of treatment and control groups of students at baseline. The intent of this table is to probe the adequacy of the randomization process. We test for differences in means across treatment and control groups using simple F-tests. The “adjusted” p values come from a simple regression of the variable of interest (e.g., female) on an indicator for ML (or GBG) and a dummy variable denoting cohort membership. Conditional on cohort, students should be randomly distributed between treatment and control conditions. We account for the nested structure of the randomization procedure by clustering standard errors at the classroom-school-cohort level.

Table 2

Comparisons of Mean Baseline Characteristics Across Treatment and Control Groups.

Variable	Full sample	ML	GBG	Control	Adjusted p value (ML)	Adjusted p value (GBG)
Demographics
Female	0.50	0.52	0.50	0.50	0.16	0.98
Black	0.66	0.61	0.77	0.64	0.83	0.24
White	0.33	0.38	0.21	0.35	0.81	0.20
Other	0.02	0.01	0.02	0.01	0.83	0.69
FARM	0.53	0.51	0.61	0.50	0.92	0.32
Reading achievement
CAT reading score, Grade 1 fall (standardized)	0.00 (1.00)	0.07 (1.00)	−0.02 (1.00)	−0.02 (1.00)	0.55	0.91
Missing CAT reading score, Grade 1 fall	0.02	0.01	0.02	0.02	0.36	0.98
Teacher-Rated Academic and Behavioral Measures (TOCA)
Global academic performance, Grade 1 fall	4.02 (1.32)	4.19 (1.26)	4.02 (1.43)	3.95 (1.31)	0.03	0.33
Aggressive/disruptive behavior, Grade 1 fall	1.85 (0.92)	1.71 (0.71)	1.96 (1.06)	1.86 (0.94)	0.12	0.41
Concentration problems, Grade 1 fall	2.99 (1.36)	2.82 (1.29)	2.99 (1.47)	3.06 (1.34)	0.05	0.64
Shy behavior, Grade 1 fall	2.71 (1.00)	2.69 (0.92)	2.46 (1.05)	2.80 (1.00)	0.50	0.10
Missing TOCA score, Grade 1 fall	0.11	0.13	0.06	0.12	0.79	0.04
Overall F-test of all observables	—	—	—	—	0.17	0.54
n	2,311	520	452	1,339	—	—

Note. Standard deviations appear in parentheses for nonbinary variables. Adjusted p values come from simple regressions of the variable of interest on an indicator for ML (or GBG) and a dummy variable denoting cohort membership. The p values for the overall F-test of all observables come from jointly testing the significance of the full set of baseline covariates in the context of one regression model, where the outcome is either an indicator for ML or GBG membership. Standard errors are always clustered at the classroom-school-cohort level. ML = Mastery Learning; GBG = Good Behavior Game; FARM = free and reduced meals; CAT = California Achievement Test.

We see that half the sample is female; about two thirds are Black and the other third White. Over half of the students in the full sample were eligible for free/reduced-price meals. Overall, there are very few significant differences in baseline demographic characteristics, achievement, and behavioral measures across treatment and control groups. Students in ML classrooms were rated a bit higher in terms of academic performance by their fall first-grade teachers, relative to students in control classrooms. Yet, the grand mean suggests that the magnitude of this difference is small. Furthermore, as a number of these individual variables are likely to be correlated with one another (e.g., race and FARM; aggressive/disruptive behavior and global academic performance, etc.), we perform a test that accounts for such correlations in a regression framework.²⁵ For neither ML nor the GBG do we find evidence of nonequivalent groups at baseline. Still, in all of our main models, we control for this set of demographic characteristics, CAT reading scores, and all four TOCA-R ratings from the fall of first grade.

Differential attrition and crossover throughout the treatment period (first and second grades) can also compromise the internal validity of experimental estimates. For example, if some of the students assigned to control classrooms for whom reading was a bit more of a struggle decided to switch classes or leave the school altogether, this would leave behind relatively more capable students in the control classrooms and cause us to potentially understate full impact of the ML intervention. A similar story can be told for the GBG. Yet, in this study, we are able to observe long-run college-going outcomes from a source other than the students themselves. So, even if a student left her elementary school or exited the sample of Baltimore City Public Schools entirely, we will still capture outcome data for that student. Therefore, our ITT estimates will not suffer from bias attributable to differential attrition of treatment versus control students.²⁶

Long-Run Impacts on Postsecondary Attendance and Attainment

We now consider the impacts of the ML and GBG interventions on our postsecondary outcomes of interest. On average, 31% of our sample ever attended college and 11% ever earned a degree. Roughly equal proportions of students ever attended a 2-year (19%) college as ever attended a 4-year institution (20%). Of our sample, 8% earned a bachelor’s degree (or higher), while only about 2% earned an associate’s degree as their highest degree. As students can (and do) attend a 4-year and 2-year college, the proportions of students attending each of these types of college will not sum to the proportion ever attending any type of postsecondary institution. Students can also earn multiple degrees, though this is less common than attending multiple types of institutions.

Table 3 presents results from estimations of equation 1 on three outcomes of postsecondary attendance and three outcomes measuring degree attainment. In panel B, we add indicators for control students in treatment schools to assess spillover. In general, we see little evidence of spillover impacts of the ML intervention. For the GBG intervention, we see no evidence of spillover effects on college attendance, but noisy, suggestive evidence of a bit of spillover of GBG on the educational attainment of control students in GBG schools (specifically on ever earning a bachelor’s degree).

Table 3

Impacts of GBG and ML Interventions on Postsecondary Attendance and Attainment.

	Attendance			Attainment
	Attend any college ever	Attend 2-year college ever	Attend 4-year college ever	Earn any degree ever	Highest degree = associate’s	Highest degree = bachelor’s or more
Independent variable	(1)	(2)	(3)	(4)	(5)	(6)
A. No controls for internal control groups
GBG	.003 (.025)	.019 (.018)	.009 (.023)	.010 (.018)	.001 (.007)	.015 (.016)
ML	−.009 (.026)	.018 (.021)	−.009 (.021)	−.011 (.017)	−.003 (.008)	−.001 (.016)
B. Add controls for internal control groups
GBG	.004 (.032)	.017 (.019)	.013 (.029)	.026 (.025)	.003 (.008)	.024 (.023)
ML	−.008 (.033)	.017 (.022)	−.006 (.028)	.004 (.025)	−.001 (.009)	.007 (.024)
GBG—internal control	.008 (.037)	−.005 (.024)	.013 (.037)	.039 (.029)	−.002 (.013)	.037 (.028)
ML—internal control	−.005 (.030)	−.001 (.029)	−.000 (.027)	.018 (.026)	.008 (.011)	−.003 (.026)
Outcome mean for control students	.308	.189	.188	.110	.023	.079
n	2,311	2,311	2,311	2,311	2,311	2,311
R ²	.091	.036	.100	.101	.013	.092

Note. All models are linear probability models estimated via OLS and control for the set of baseline achievement and demographic characteristics listed in Table 2. We account for students with missing data on FARM status (n = 6), reading test scores from the fall of Grade 1 (n = 43), and/or TOCA behavioral ratings in the fall of Grade 1 (n = 256) with dummy variables. Only 2 students are missing information on baseline reading test scores and baseline behavioral (TOCA) ratings. R² values correspond to the specifications estimated with indicators for internal control group membership. Standard errors clustered at the classroom-school-cohort level appear in parentheses. GBG = Good Behavior Game; ML = Mastery Learning; OLS = ordinary least squares; FARM = free and reduced meals; TOCA = Teacher Observation of Classroom Adaptation.

p < .1. **p < .05. ***p < .01.

Across both panels, we see no statistically significant average impact of either ML or the GBG on the likelihood of college attendance or degree receipt. While our standard errors do not allow us to rule out small effects of either intervention on these postsecondary outcomes, most of the coefficients of interest are close to zero in magnitude. The few coefficients that may suggest positive (if noisy) intervention effects are for the impacts of GBG (and ML) on ever attending a 2-year institution and the impact GBG on earning a bachelor’s degree. Yet, these are not statistically significant, and we conclude that neither intervention had an appreciable impact on average rates of college-going or degree receipt.

Heterogeneity in Effects Across Student Subgroups

Although we find little effect of the GBG and ML interventions on postsecondary attendance and attainment on average, these mean findings could obscure interesting and important heterogeneity in treatment impacts by subgroups of students. Prior work on the GBG and ML, as well as work examining other childhood programs, found clear differences in the effects of these programs on a host of outcomes by gender (Anderson, 2008; Dolan et al., 1993). Furthermore, the ML intervention in particular had differential short-run achievement effects by students’ initial level of academic preparedness (Dolan et al., 1993). Therefore, we investigate the impacts of both interventions separately for male and female students, and examine impacts by baseline student academic achievement level.

We turn to the subgroup-specific results presented in Tables 4 (by gender) and 5 (by achievement level).²⁷ For each subgroup, we estimate impacts of the GBG and ML on the same set of postsecondary outcomes as in Table 3 using specifications akin to equation 1.²⁸ To group students into “low-performing” and “high-performing” groups, we seek to use the maximum amount of reliable information related to student academic performance that exists in our data set. We also want to classify as many of the 2,311 students as possible, based on the available baseline achievement information. Fewer students are missing baseline reading test scores (n = 43) than behavioral ratings (n = 256). Therefore, we define “low-performing” students as those who have a baseline reading test score below the 33rd percentile. For those students with missing baseline test score information, we classify any student with a first-grade fall TOCA-R rating for “global academic achievement” that is below the 33rd percentile as “low-performing” (i.e., we spilt our sample into thirds).²⁹ We perform analogous steps to identify the “high-performing” students (i.e., those with a test score [or TOCA-R rating] above the 67th percentile). Those students remaining form our “middle-performing” group (i.e., between the 33rd and 67th percentiles). We examine the sensitivity of our findings to using baseline achievement halves and quartiles.³⁰ In both cases, the same patterns of findings emerge.

Table 4

Impacts of GBG and ML Interventions on Postsecondary Attendance and Attainment by Gender.

	Attendance			Attainment
	Attend any college ever	Attend 2-year college ever	Attend 4-year college ever	Earn any degree ever	Highest degree = Associate’s	Highest degree = Bachelor’s or more
Independent variable	(1)	(2)	(3)	(4)	(5)	(6)
Good Behavior Game
GBG female	.033 (.030)	.054** (.021)	.025 (.039)	.025 (.031)	.011 (.014)	.020 (.026)
GBG male	−.026 (.036)	−.016 (.027)	−.006 (.024)	−.005 (.017)	−.009 (.006)	.011 (.017)
Mastery Learning
ML female	.001 (.039)	.029 (.035)	−.001 (.032)	−.002 (.026)	−.008 (.012)	.013 (.025)
ML male	−.018 (.031)	.008 (.024)	−.018 (.023)	−.019 (.020)	.004 (.011)	−.015 (.017)
H₀: (GBG female) = (GBG male)	p = .18	p = .04	p = .51	p = .35	p = .21	p = .75
H₀: (ML female) = (ML male)	p = .70	p = .62	p = .64	p = .58	p = .46	p = .33
Outcome mean for female control students	.37	.23	.23	.14	.03	.10
Outcome mean for male control students	.25	.15	.15	.08	.02	.06
n	2,311	2,311	2,311	2,311	2,311	2,311
R ²	.091	.037	.100	.100	.014	.090

Note. All models are linear probability models estimated via OLS and control for the set of baseline achievement and demographic characteristics listed in Table 2. We account for students with missing data on FARM status (n = 6), reading test scores from the fall of Grade 1 (n = 43), and/or TOCA behavioral ratings in the fall of Grade 1 (n = 256) with dummy variables. Only 2 students are missing information on baseline reading test scores and baseline behavioral (TOCA) ratings. Standard errors clustered at the classroom-school-cohort level appear in parentheses. GBG = Good Behavior Game; ML = Mastery Learning; OLS = ordinary least squares; FARM = free and reduced meals; TOCA = Teacher Observation of Classroom Adaptation.

p < .1. **p < .05. ***p < .01.

Table 5

Impacts of GBG and ML Interventions on Postsecondary Attendance and Attainment by Initial Performance Level.

	Attendance			Attainment
Independent variable	Attend any college ever	Attend 2-year college ever	Attend 4-year college ever	Earn any degree ever	Highest degree = Associate’s	Highest degree = Bachelor’s or more
	(1)	(2)	(3)	(4)	(5)	(6)
Good Behavior Game
GBG bottom third	.066 (.048)	.070* (.036)	.034 (.034)	.013 (.026)	.007 (.016)	.007 (.019)
GBG middle third	−.046 (.043)	.016 (.029)	−.033 (.036)	−.024 (.027)	−.010 (.012)	−.009 (.023)
GBG top third	−.017 (.043)	−.039 (.036)	.030 (.041)	.049 (.044)	.005 (.015)	.057 (.038)
Mastery Learning
ML bottom third	−.051 (.031)	−.013 (.024)	−.021 (.029)	−.037* (.020)	−.016 (.010)	−.017 (.017)
ML middle third	−.027 (.044)	−.012 (.036)	.033 (.040)	.022 (.036)	.002 (.017)	.025 (.028)
ML top third	.046 (.046)	.076* (.039)	−.042 (.039)	−.016 (.035)	.005 (.016)	−.009 (.032)
H₀: (GBG bottom third) = (GBG top third)	p = .19	p = .03	p = .94	p = .51	p = .93	p = .27
H₀: (ML bottom third) = (ML top third)	p = .08	p = .06	p = .65	p = .61	p = .31	p = .84
Outcome mean for control students in the:
Bottom third	.20	.13	.09	.06	.02	.03
Middle third	.34	.21	.19	.09	.02	.06
Top third	.41	.23	.29	.19	.03	.15
n	2,311	2,311	2,311	2,311	2,311	2,311
R ²	.095	.042	.102	.104	.014	.094

Note. All models are linear probability models estimated via OLS and control for the set of baseline achievement and demographic characteristics listed in Table 2. We account for students with missing data on FARM status (n = 6), reading test scores from the fall of Grade 1 (n = 43), and/or TOCA behavioral ratings in the fall of Grade 1 (n = 256) with dummy variables. Only two students are missing information on baseline reading test scores and baseline behavioral (TOCA) ratings. Standard errors clustered at the classroom-school-cohort level appear in parentheses. GBG = Good Behavior Game; ML = Mastery Learning; OLS = ordinary least squares; FARM = free and reduced meals; TOCA = Teacher Observation of Classroom Adaptation.

p < .1. **p < .05. ***p < .01.

Our clearest finding is that exposure to the GBG intervention increases the likelihood of college attendance for females, especially at 2-year institutions (Table 4). Specifically, females in GBG classrooms are about 5 percentage points (i.e., about 23%) more likely to ever attend a 2-year college, relative to control female students. In contrast, we find no impact of the GBG or ML on the postsecondary attendance and attainment of male students.

When we examine impacts of the GBG and ML on students of varying levels of baseline academic achievement, our results are less precise and clear. Still, the estimates in Table 5 suggest that the GBG may have benefited low-performing students (in terms of college-going) more than middle- to high-performers; while the ML intervention appears to have benefited middle- to high-performers relatively more than low-performers. For example, we estimate that exposure to the GBG increases the likelihood that a low-performing student ever attends a 2-year college by about 7 percentage points. This point estimate is statistically different from the corresponding estimate for high-achievers (which itself is negative and insignificant). In contrast, the ML intervention increases the likelihood of college-going for relatively high-achieving students by about the same magnitude, but may lower the likelihood of degree receipt for low-achievers.

Taken together, these results highlight the fact that different types of elementary school interventions (i.e., behavioral vs. curricular) are likely to have differential effects on students with different levels of academic preparation at the time of the intervention. In our case, it seems that the behavioral intervention was relatively more successful with low-performing students, while the curricular intervention may have benefited more academically prepared students. This global pattern of findings makes sense if the types of abilities targeted by the behavioral intervention (i.e., attention, self-control) are prerequisites for benefiting from curricular changes and low-achieving students tend to lack more of these skills than their higher-achieving counterparts. While only speculative, these results lead to questions about sequencing of interventions of different types and of differential targeting of student groups with different interventions. These are interesting and important questions for future work that is able to compare the effects of different types of elementary school policies and programs.

These subgroup findings present a more nuanced picture of how the GBG and ML interventions affected the postsecondary educational experiences of students, but shed little light on the mechanisms that may be driving such effects. In attempt to advance our understanding of potential mechanisms that may underlie these long-run impacts, we turn to an exploration of overlap between short-run impacts on achievement and behavior and long-run impacts on college-going.

Can Short-Term Impacts Predict Long-Term Effects?

Given the difficultly, in terms of financial and time costs, of obtaining reliable long-run information on outcomes of interest related to interventions early in children’s lives, policymakers often wonder whether long-run educational effects of interventions could simply have been predicted by focusing on short-run impacts, namely short-run test scores. Recent work by Duncan, Dowsett, et al. (2007) on the relationship between measures of school readiness and later academic performance finds a variety of measures of socio-emotional behaviors to be poor predictors of later achievement. Instead, the authors find that math and reading skills measured at school entry (i.e., usually the fall of kindergarten) are consistently and significantly predictive of levels of academic performance in later grades. The one exception to the inability of noncognitive measures to predict later academic performance was attention-related skills: Better attention skills at school entry were moderately predictive of higher levels of later academic achievement.

These recent findings are quite provocative—especially as behavioral measures are very time-intensive and often quite expensive for researchers to collect. Although the authors conducted a careful meta-analysis across six different longitudinal studies to draw the above conclusions, it is difficult to know whether their findings are due to the fact that socio-emotional measures are generally poor predictors of later academic outcomes, or whether the measures in these surveys are simply poor representations of the underlying constructs of interest. For example, the authors mention that a number of their socio-emotional measures are counts of student problems, not comprehensively developed indices. Therefore, this construction may limit the range of behaviors such variables can capture. Furthermore, their study focuses on “skills and behaviors that emerge at the time of school entry and not on the effects of socio-emotional behaviors that emerge after children enter school” (p. 1443). Perhaps the importance of such abilities grows as students master basic academic skills. Finally, Duncan, Dowsett, et al. (2007) limit their outcomes to math and reading test scores in later grades. Perhaps socio-emotional skills are more important for some types of academic outcomes than others.

As a way to explore such questions in the context of the ML and GBG interventions and postsecondary outcomes, we use information on reading test scores of the students in our sample from the spring of first and second grade in conjunction with the four TOCA-R behavioral ratings (aggressive/disruptive, concentration problems, shy behavior, and global academic performance). We focus on the subgroups for which we find positive and statistically significant effects of either intervention.

In Table 6, we attempt to explore the power of short-run test score and behavioral effects of ML and the GBG to “explain away” the long-run postsecondary impacts of these same interventions.³¹ For each outcome (and subgroup) we begin by re-estimating the main model that generates the long-run postsecondary impact of interest. For example, the top coefficient in Table 6, Column 1, is the point estimate on GBG for female students from Table 4, Column 2. First, we add to this model our measures of short-run reading effects (first- and second-grade CAT reading scores) and observe what happens to the coefficient of interest. Second, we remove the short-run test scores and instead add controls for the vector of behavioral effects during these same grades—again observing if, and by how much, the main intervention effect is attenuated. Finally, we estimate a model that includes the short-run test score and short-run behavioral measures. We follow this same process to relate short-run impacts of GBG and ML to college attendance for low- and high-achievers, respectively.

Table 6

Short-Run Impacts and Long-Run Postsecondary Outcomes: Effects of Introducing Elementary School Measures.

Intervention	GBG		ML
Subgroup	Females	Low-achievers (bottom third)	High-achievers (top third)
Outcome	Attend 2-year college ever	Attend 2-year college ever	Attend 2-year college ever
Impact estimate	(1)	(2)	(3)
Preferred model (coefficient on ML or GBG)	.054** (.021)	.070* (.036)	.076* (.039)
Add SR reading test scores	.053** (.021)	.072* (.036)	.068* (.040)
Add SR behavioral measures	.053** (.022)	.069* (.037)	.065 (.040)
Add SR reading test score AND SR behavioral measures	.050** (.022)	.071* (.037)	.057 (.040)
Reduction in intervention impact
% explained by SR reading test scores	1.9	0.0	10.5
% explained by SR behavioral measures	1.9	1.4	14.5
% explained by all SR measures	7.4	0.0	25.0

Note. All models are linear probability models estimated via OLS and include all baseline student-level demographic, behavioral, and achievement controls listed in Table 2. GBG = Good Behavior Game; ML = Mastery Learning; SR reading test scores = standardized reading test scores in the spring of Grades 1 and 2; SR behavioral measures = standardized aggression/disruption, concentration problems, shy behavior, and global academic teacher ratings in spring of Grades 1 and 2; OLS = ordinary least squares.

p < .1. **p < .05. ***p <.01.

Several conclusions and questions emerge from the results in Table 6. First, we are able to explain a greater share of the long-run impact of ML on college attendance for high-achievers than we are for any other subgroup-outcome combination. Yet, after accounting for short-run impacts on reading test scores and a variety of behavioral measures, we are still only able to capture about 25% of the long-run effect on college attendance for this group. Second, these same short-run measures do even less well in capturing the long-run postsecondary effects of the GBG intervention. This may suggest that the long-run impacts of the GBG intervention operate largely through pathways other than short-run test score effects and/or short-run changes in ostensibly observable behavioral skills. This finding underscores the need for more work, especially in experimental settings, that seeks to understand the relationship between short-run effects along a variety of dimensions and long-term outcomes for kids. As we are limited to using information on behavioral ratings and test scores from the spring of first and second grade, we wonder whether effects that develop immediately after the treatment period (i.e., third grade through middle school) would increase the ability of these test score (and/or behavioral) measures to “explain away” the long-run intervention impacts on college-going and degree receipt. Given that past work on the GBG has found a variety of short- and medium-run impacts on measures of physical and mental health, as well as on violent and criminal behavior, we wonder if the more salient pathway of mechanisms that underlie these long-run college effects could be characterized by such outcomes, especially for low-achieving students.

Overall, we find that measures of socio-emotional and behavioral skills as well as short-run test scores are moderately predictive of long-run academic outcomes. Yet this conclusion is limited to relatively high-achieving urban children. Interestingly, in their paper relating small classes with postsecondary outcomes using a similar analytic approach, Dynarski et al. (2011) found contemporaneous impacts of small class sizes on K–3 test scores almost fully explain the long-run college impacts they observe. These sets of findings are not necessarily contradictory. While they both look at similar sets of postsecondary outcomes, the interventions of focus are quite different. Perhaps assignment to a small class affects later outcomes largely through academic skills—and the tests that STAR students took did a good job of capturing those skills. Our interventions were a bit more targeted, and it is reasonable to expect them to affect students along different dimensions. So, the causal pathways may differ, the testing instruments used differ, and the student populations are different. Yet, together, these results do suggest that we need to develop a better understanding of how different sets of short-run outcomes relate to specific long-term impacts, under what conditions, and for which groups of students.

Discussion and Conclusion

In this article, we investigate the long-run impacts of two widely applicable light-touch childhood educational interventions—one behavioral (GBG) and one curricular (ML)—on postsecondary enrollment and degree completion. While we find no average impact of either intervention on the likelihood of college attendance or degree receipt, this obscures interesting and important heterogeneity in treatment effects across different student subgroups.

We find that assignment to a GBG classroom increases the likelihood of ever attending college for female students, with this increase concentrated along the 2-year sector. Specifically, exposure to the GBG intervention increases the likelihood that a female student ever attends a 2-year college by 5.4 percentage points. We find no evidence that either intervention benefits male students in terms of our long-run postsecondary outcomes of interest.

We find suggestive evidence that exposure to the GBG benefits low-performing students relatively more than high-performers and that exposure to the ML intervention influences college outcomes more for middle- to high-performing students than for low-performers. Collectively, these results underscore the need for future work to better understand (and estimate) the effects of various types of elementary school interventions (e.g., curricular, behavioral) on different subgroups of students. Our findings suggest that heterogeneity in effects of interventions of different types is especially relevant for students with different levels of preintervention academic preparation.

When we map contemporaneous measures of reading achievement and behavior onto these long-run effects, we find that short-run measures of behavioral (and global academic) skills and test scores have fairly little utility in accounting for the long-run postsecondary effects we uncover. Understanding differences in the evolution of treatment effects for different subgroups of students is clearly another important avenue for future research.

Under ideal circumstances, we would have detailed cost data for both interventions. Unfortunately, discussions with those in charge of the intervention implementation and data collection revealed that no official cost data exist. Even without reliable cost information, we can still explore the relative cost-effectiveness of these light-touch interventions. We focus on the intervention for which we observe the clearest statistical evidence of effects for a subgroup of students—the GBG—and roughly compare the cost-effectiveness of this intervention with reducing elementary school class sizes (the only other K–5 randomized intervention to examine long-run postsecondary impacts).

We parameterize the benefits associated with the effects of each intervention using information on the lifetime earnings associated with various levels of postsecondary education. While there is a clear earnings premium to associate’s and bachelor’s degrees, there is also evidence that attending college for some period of time is associated with greater lifetime earnings relative to earning a high school diploma and to not completing high school (Carnevale, Rose, & Cheah, 2011). The difference in estimated lifetime earnings between individuals with a high school diploma and those who attend some college (but receive no degree) is US$260,000 in 2012 dollars (Carnevale et al., 2011).³² This premium is even greater when high school dropouts are the comparison group: US$614,000. If we take the earnings premium to “some college” over a high school diploma for females and adjust this figure to its present discounted value following the suggested approach in Carnevale et al. (2011, p. 22), we arrive at a premium of about US$137,000. If we do the same for the female-specific premium relative to high school dropout, the figure is US$346,000.

In the absence of cost information on the GBG, we can take our preferred point estimate for females (i.e., an increase of about 5 percentage points or .13 standard deviations in the likelihood of attending a 2-year college) and solve for the “break-even” cost. In this case, the GBG would need to cost about US$3,700 (= US$137,000 × (.054 / 2)) or less per student³³ for the anticipated net benefit per female student to be zero or greater (based on a female-specific premium relative to a high school graduate, the most conservative estimate). In our discussion with Hopkins researchers, all estimates of the total direct costs of the GBG were well below this threshold. Even if the GBG were to cost US$3,000 per student (i.e., a very high estimate for such a light-touch intervention), this would imply a cost of about US$111,000 per female student caused to attend college who would not otherwise do so (= US$3,000 / (.054 / 2)). Compare this with the cost per student induced into college by reducing elementary school class sizes: US$443,000 (based on Dynarski et al., 2011).

Clearly there are important differences between the effects of a global class size reduction and a light-touch intervention like the GBG. For one, the long-run impact of a smaller elementary school class on college-going was an overall effect, not an impact isolated to a particular subgroup of students. However, our main goal here is to illustrate the potential cost-effectiveness of early light-touch interventions like the GBG (at least where the outcome of interest is college attendance), especially if such interventions could be targeted at groups most likely to benefit. The success of these types of interventions is likely to vary with the fidelity of implementation (e.g., trainer quality, engagement level of teachers), and the specific student population under study. This underscores the need for additional research to replicate and challenge the long-run findings we uncover for the GBG and ML interventions—in different settings, with different students, and over different time horizons.

Footnotes

Acknowledgements

The authors thank Larry Dolan and Lisa Ulmer for taking time to talk with us about details of intervention implementation and particular measures in the Johns Hopkins Prevention Research Center (PRC) data. We are also grateful for helpful comments and suggestions from Sue Dynarski, Maria Fitzpatrick, Nick Ialongo, Brian Jacob, David Salkever, and Kevin Stange; seminar participants at Johns Hopkins University, University of Michigan, and the Association for Education Finance and Policy (AEFP) spring meetings in Boston, MA; as well as three anonymous referees. This work would not have been possible without the support and collaboration of the Baltimore City Public Schools, and the parents, children, teachers, principals, school psychologists, and social workers who participated. Of course, any errors and all opinions are those of the authors.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research has been supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305B110001 to the University of Michigan. The opinions expressed are those of the authors and do not represent views of the Institute of Education Sciences or the U.S. Department of Education. This work has also been supported by the National Institute of Mental Health grants NIMH 5 PO MH38725, Epidemiologic Prevention Center for Early Risk Behaviors, Sheppard G. Kellam, P.I.; R01 MH42968, Periodic Followup of Two Preventive Trials, Sheppard G. Kellam, P.I.; R01 MH 4296806A2, Development & Malleability from Childhood to Adulthood, Sheppard G. Kellam, P.I.; and National Institute of Drug Abuse awards RO1 DA09592, Transitions to Adulthood, James C. Anthony, P.I.; and R01 DA009897, Risks for Transitions in Drug Use in Urban Adults, William W. Eaton, P.I. Other principal collaborators include Nicholas Ialongo, Lisa Werthamer, Hendricks Brown, Lawrence Dolan, and Jeanne Poduska. Hemelt also gratefully acknowledges financial support from the W.E. Upjohn Institute for Employment Research through Grant 12-137-06.

Notes

Author Biographies

STEVEN W. HEMELT is Postdoctoral Research Fellow, Gerald R. Ford School of Public Policy, University of Michigan, 735 South State Street, Ann Arbor, MI 48109; hemelts@umich.edu. His research focuses on the economics of education, education policy, and program evaluation.

KIMBERLY B. ROTH is Research Associate, Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, 624 N. Broadway, Baltimore, MD 21205. Her research involves the epidemiology of mental and behavioral disorders, with specific interests in mood disorders, suicide, and Hispanic and low-resource populations.

WILLIAM W. EATON is Professor, Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, 624 N. Broadway, Baltimore, MD 21205. The bulk of his research is grounded in psychiatric epidemiology, with a specific focus on schizophrenia as well as on common mental disorders such as major depressive disorder and anxiety disorders.

References

Anderson

M. L.

(2008). Multiple inference and gender differences in the effects of early intervention: A reevaluation of the Abecedarian, Perry Preschool, and early training projects. Journal of the American Statistical Association, 103, 1481–1495.

Barrish

H. H.

Saunders

Wolf

M. M.

(1969). Good behavior game: Effects of individual contingencies for group consequences on disruptive behavior in a classroom. Journal of Applied Behavior Analysis, 2, 119–124.

Belfield

C. R.

Nores

Barnett

Schweinhart

(2006). The High/Scope Perry Preschool Program: Cost benefit analysis using data from the age-40 follow-up. Journal of Human Resources, 41, 162–190.

Block

J. H.

Burns

R. B.

(1976). Mastery Learning. Review of Research in Education, 4, 3–49.

Bloom

B. S.

(1971). Mastery learning. In Block

J. H.

(Ed.), Mastery Learning: Theory and practice (pp. 47–63). New York, NY: Holt, Rinehart & Winston.

Bradshaw

C. P.

Zmuda

J. H.

Kellam

S. G.

Ialongo

N. S.

(2009). Longitudinal impact of two universal preventive interventions in first grade on educational outcomes in high school. Journal of Educational Psychology, 101, 926–937.

Bureau of Labor Statistics. (2011). Retrieved from http://www.bls.gov/emp/ep_chart_001.htm

Camilli

Vargas

Ryan

Barnett

W. S.

(2010). Meta-analysis of the effects of early education interventions on cognitive and social development. Teachers College Record, 112, 579–620.

Carnevale

A. P.

Rose

S. J.

Cheah

(2011). The college payoff: Education, occupations, lifetime earnings. Washington, DC: Center on Education and the Workforce, Georgetown University. Retrieved from http://cew.georgetown.edu/collegepayoff/

10.

Center on the Developing Child at Harvard University. (2011). Building the brain’s “airtraffic control” system: How early experiences shape the development of executive function (Working Paper No. 11). Retrieved from http://wwwdevelopingchildharvard.edu

11.

Chetty

Friedman

J. N.

Hilger

Saez

Schanzenbach

D. W.

Yagan

(2011). How does your kindergarten classroom affect your earnings? Evidence from project star. Quarterly Journal of Economics, 126, 1593–1660.

12.

Crijnen

A. A. M.

Feehan

Kellam

S. G.

(1998). The course and malleability of reading achievement in elementary school: The application of growth curve modeling in the evaluation of a Mastery Learning intervention. Learning and Individual Differences, 10, 137–157.

13.

Currie

(2001). Early childhood education programs. Journal of Economic Perspectives, 15, 213–238.

14.

Currie

Neidell

(2005). Getting inside the “Black box” of head start quality: What matters and what doesn’t. Economics of Education Review, 26, 83–99.

15.

Currie

Thomas

(1995). Does head start make a difference? American Economic Review, 85, 341–364.

16.

Davis

Sorrell

(1995). Mastery Learning in public schools: Educational psychology interactive. Valdosta, GA: Valdosta State University.

17.

Deming

D. J.

(2009). Early childhood intervention and life-cycle skill development: Evidence from head start. American Economic Journal: Applied Economics, 1, 111–134.

18.

Deming

D. J.

Hastings

J. S.

Kane

T. J.

Staiger

D. O.

(2011). School choice, school quality and postsecondary attainment (NBER Working Paper Series No. 17438). Cambridge, MA: National Bureau of Economic Research. Retrieved from http://www.nber.org/papers/w17438.pdf

19.

Dolan

L. J.

Ford

Newton

Kellam

S. G.

(1989). The Mastery Learning manual. Unpublished manuscript.

20.

Dolan

L. J.

Kellam

S. G.

Brown

C. H.

Werthamer-Larsson

Rebok

G. W.

Mayer

L. S.

. . . Wheeler

(1993). The short-term impact of two classroom-based preventive interventions on aggressive and shy behaviors and poor achievement. Journal of Applied Developmental Psychology, 14, 317–345.

21.

Duncan

G. J.

Dowsett

C. J.

Claessens

Magnuson

Huston

A. C.

Klebanov

. . . Duckworth

(2007). School readiness and later achievement. Developmental Psychology, 43, 1428–1446.

22.

Duncan

G. J.

Ludwig

Magnuson

K. A.

(2007). Reducing poverty through preschool interventions. Future of Children, 17, 143–160.

23.

Dynarski

Hemelt

S. W.

Hyman

(2012). Data watch: Using national student clearinghouse data to track postsecondary outcomes. Unpublished manuscript.

24.

Dynarski

Hyman

Schanzenbach

D. W.

(2011). Experimental evidence on the effect of childhood investments on postsecondary attainment and degree completion (NBER Working Paper Series No. 17533). Cambridge, MA: National Bureau of Economic Research. Retrieved from http://www.nber.org/papers/w17533

25.

Eaton

W. W.

Ialongo

Storr

Roth

K. B.

(2011). Prevention research center cohort 1 and 2 data user’s guide. Unpublished manuscript.

26.

Economic Policy Institute. (2007). State of working America data. Retrieved from http://www.epiorg/resources/research_data/state_of_working_america_data/

27.

Garces

Thomas

Currie

(2002). Longer-term effects of head start. American Economic Review, 92, 999–1012.

28.

Heckman

J. J.

Krueger

A. B.

Friedman

B. M.

(2003). Human capital policy. In Friedman

B. M.

(Ed.), Inequality in America: What role for human capital policies? Cambridge, MA: MIT Press.

29.

Huston

A. C.

Gupta

A. E.

Walker

J. T.

Dowsett

C. J.

Epps

S. R.

Imes

A. E.

McLoyd

V. C.

(2011). The long-term effects on children and adolescents of a policy providing work supports for low-income parents. Journal of Policy Analysis and Management, 30, 729–754.

30.

Jacob

B. A.

Lefgren

Sims

D. P.

(2010). The persistence of teacher-induced learning gains. Journal of Human Resources, 45, 915–943.

31.

Jones

S. M.

Aber

J. L.

Brown

J. L.

(2011). Two-year impacts of a universal school-based social-emotional and literacy intervention: An experiment in translational developmental research. Child Development, 82, 533–554.

32.

Kellam

S. G.

Brown

C. H.

Poduska

J. M.

Ialongo

N. S.

Wang

Toyinbo

. . .Wilcox

H. C.

(2008). Effects of a universal classroom behavior management program in first and second grades on young adult behavioral, psychiatric, and social outcomes. Drug and Alcohol Dependence, 95(Suppl. 1), S5–S28.

33.

Kellam

S. G.

Rebok

G. W.

Mayer

L. S.

Ialongo

Kalodner

C. R.

(1994). Depressive symptoms over first grade and their response to a developmental epidemiologically based preventive trial aimed at improving achievement. Development and Psychopathology, 6, 463–481.

34.

Kleinman

K. E.

Saigh

P. A.

(2011). The effects of the Good Behavior Game on the conduct of regular education New York city high school students. Behavior Modification, 35, 95–105.

35.

Krueger

A. B.

Whitmore

D. M.

(2001). The effect of attending a small class in the early grades on college-test taking and middle school test results: Evidence from project STAR. Economic Journal, 111, 1–28.

36.

Kulik

C. C.

Kulik

J. A.

Bangert-Drowns

R. L.

(1990). Effectiveness of Mastery Learning programs: A meta-analysis. Review of Educational Research, 60, 265–299.

37.

Ludwig

Miller

D. L.

(2007). Does head start improve children’s life chances? Evidence from a regression discontinuity design. Quarterly Journal of Economics, 122, 159–208.

38.

Marcotte

D. E.

Hansen

(2010). Time for school? Education Next, 10, 52–59.

39.

National Student Clearinghouse. (2012). Impact of directory information blocks on StudentTracker Results. NSC Working Paper. Retrieved from http://research.studentclearinghouse.org/files/NSC_Directory_Block_Rates.pdf

40.

Petras

Kellam

S. G.

Brown

C. H.

Muthén

B. O.

Ialongo

N. S.

Poduska

J. M.

(2008). Developmental epidemiological courses leading to antisocial personality disorder and violent and criminal behavior: Effects by young adulthood of a universal preventive intervention in first- and second-grade classrooms. Drug and Alcohol Dependence, 95(Suppl. 1), S45–S59.

41.

Poduska

J. M.

Kellam

S. G.

Wang

Brown

C. H.

Ialongo

N. S.

Toyinbo

(2008). Impact of the Good Behavior Game, a universal classroom-based behavior intervention, on young adult service use for problems with emotions, behavior, or drugs or alcohol. Drug and Alcohol Dependence, 95(Suppl. 1), S29–S44.

42.

Robinson

(1992). Mastery Learning in public schools: Some areas of restructuring. Education, 113, 121–126.

43.

Rohrbeck

C. A.

Ginsburg-Block

M. D.

Fantuzzo

J. W.

Miller

T. R.

(2003). Peer-assisted learning interventions with elementary school students: A meta-analytic review. Journal of Educational Psychology, 95, 240–257.

44.

Schweinhart

L. J.

Montie

Xiang

Barnett

W. S.

Belfield

C. R.

Nores

(2005). Lifetime effects: The High/Scope Perry Preschool Study through age 40. Ypsilanti, MI: HighScope Press.

45.

Shonkoff

J. P.

(2012). Leveraging the biology of adversity to address the roots of disparities in health and development. Proceedings of the National Academy of Sciences of the United States of America, 109(Suppl. 2), 17302–17307.

46.

Tingstrom

D. H.

Sterling-Turner

H. E.

Wilczynski

S. M.

(2006). The Good Behavior Game: 1969–2002. Behavior Modification, 30, 225–253.

47.

U.S. Department of Health and Human Services. (2010a). Head start impact study final report (Government). Washington, DC: Administration for Children and Families, Office of Planning, Research and Evaluation.

48.

U.S. Department of Health and Human Services. (2010b). Head start program fact sheet. Retrieved from http://eclkc.ohs.acf.hhs.gov/hslc/mr/factsheets/fHeadStartProgr.htm

49.

Werthamer-Larsson

Kellam

Wheeler

(1991). Effect of first-grade classroom environment on shy behavior, aggressive behavior, and concentration problems. American Journal of Community Psychology, 19, 585–602.

50.

Wilcox

H. C.

Kellam

S. G.

Brown

C. H.

Poduska

J. M.

Ialongo

N. S.

Wang

Anthony

J. C.

(2008). The impact of two universal randomized first- and second-grade classroom interventions on young adult suicide ideation and attempts. Drug and Alcohol Dependence, 95(Suppl. 1), S60–S73.