Abstract
We use data from one of the few states where information on curriculum adoptions is available—Indiana—to empirically evaluate differences in performance across three elementary-mathematics curricula. The three curricula that we evaluate were popular nationally during the time of our study, and two of the three remain popular today. We find large differences in effectiveness between the curricula, most notably between the two that held the largest market shares in Indiana. Both are best characterized as traditional in pedagogy. We also show that the publisher of the least-effective curriculum did not lose market share in Indiana in the following adoption cycle; one explanation is that educational decision makers lack information about differences in curricular effectiveness.
I. Introduction
According to a 2002 survey sponsored by the National Education Association and the American Association of Publishers, 80% of teachers use textbooks in the classroom and over half of students’ in-class instructional time involves textbook use (Finn, 2004). 1 Braswell et al. (2001) report that 56% of fourth graders do math problems from their textbooks every day. Given the central role that curriculum materials play in the education production process, it stands to reason that differences across curricula in terms of content, organization, and pedagogy can lead to differences in student achievement. This sentiment is echoed in a recent research brief from the National Council of Teachers of Mathematics (NCTM, 2009), which notes that selecting a math curriculum is “one of the most critical decisions educational leaders make” (p. 1).
The curriculum market is diverse—in the case of elementary mathematics, for example, the What Works Clearinghouse (WWC) has identified over 70 different curriculum options. 2 But there are few rigorous, empirical evaluations of curricular effectiveness; the research literature is surprisingly thin. One reason is that most state education agencies do not provide information about which curricula are used in which schools and districts. In fact, many states do not collect centralized data at all. The lack of data prevents empirical analyses, and as a result, there is little in the way of reliable evidence on curricular effectiveness (Slavin & Lake, 2008; WWC, 2007). This limits the ability of educational administrators to make informed curriculum-adoption decisions.
This study makes two contributions to the research literature on curricular effectiveness. First, we use data from one of two states that track curriculum adoptions over time (Indiana) to estimate differences in effectiveness between three elementary-mathematics curricula. Each of the curricula had large, national market shares during the time of our study (1998–2004), and two of the three have large market shares today. The three curricula differ in organization and pedagogy and share similarities with other curricula that we do not evaluate directly. A notable and ongoing disagreement in the literature is between advocates of “traditional” and “reform” approaches to mathematics instruction. A key insight from our analysis is that there can be large differences in effectiveness between curricula that share the same pedagogical approach.
A second contribution of our study relates to the larger issue that the research literature in this area is so thin. There are too many curriculum options within any given subject-grade group, including elementary mathematics, for a single study to cover them all. Moreover, a single study cannot replicate the variety of educational environments in which curricula are used, which is important given that curricula may perform differently in different contexts. But a series of independent evaluations from multiple contexts, taken together, could provide valuable information about the effectiveness of the various curricular alternatives. With this in mind, we provide extensive technical details regarding our evaluation so that it can be used as a resource for future, similar studies (some of these details are provided in the online appendix at http://epa.sagepub.com). Every evaluation environment will be different, but empirical studies along the lines of what we present here are likely to be feasible in many states. Furthermore, they would be relatively inexpensive to perform. If state education agencies would simply begin collecting data on curriculum adoptions and make these data available, studies could be produced that would arm decision makers with valuable information on this important topic.
We highlight two key findings from this particular evaluation. First, we identify statistically significant and meaningful differences in curriculum performance as measured by school-level test scores on the Indiana state test (ISTEP). We find the most substantial differences between two curricula that use the same pedagogical approach (traditional) but differ in other respects. Although much attention is devoted to the debate over traditional- versus reform-based mathematics instruction, our findings suggest that other differences in curriculum design are substantively important. A second key result is that the publisher of the curriculum we found to be least effective did not lose market share in the following adoption cycle in Indiana. There are several potential explanations for this result. Perhaps the most compelling is that decision makers have virtually no information about which curricula are most effective. 3
II. Background
To the best of our knowledge, only two states—Indiana and Florida—make current and historical information on curriculum adoptions publicly available. Many states do not track curriculum adoptions at all, making it impossible to perform empirical analyses that can inform decision makers. This is an issue that can be easily remedied moving forward, and we argue that it should be remedied; however, the current data infrastructure in most states makes large-scale empirical investigations of curricular effectiveness infeasible.
In the present study, we use the Indiana data to estimate relative curriculum effects for the three most commonly adopted elementary-mathematics curricula in the state during the 1998–2004 curriculum-adoption cycle. 4 These three curricula—Saxon Math (Saxon), Silver-Burdett Ginn (SBG) Mathematics, and Scott Foresman–Addison Wesley (SFAW)—accounted for 86% of all curriculum adoptions in Indiana during our study. All three were popular outside of Indiana as well and were used in other states, including California, Florida, Louisiana, Tennessee, and Texas (Educational Marketer, 1998a, 1998b, 1999a, 1999b). And two of the three remain popular today. The exception is SBG, which was bought by the publisher of SFAW and ultimately discontinued.
Curriculum Descriptions
The three curricula share similarities with other curricula that are widely circulated. For instance, Saxon and SBG both are best described as “traditional” in pedagogy. Both emphasize teacher-led instruction where students receive step-by-step guidance for problem solving and are drilled in implementation. Singapore Math, a curriculum that is gaining popularity in schools across the United States, takes a similar pedagogical approach (WWC, 2007). Alternatively, SFAW is best characterized as a blend of “traditional” and “reform” instruction. Reform-based curricula emphasize student inquiry, real-world applications of problems, and the use of visual aids for understanding. 5 Recent research suggests that reform-based instruction can be highly effective (e.g., Riordan & Noyce, 2001; WWC, 2007), and SFAW shares similarities with popular reform-based curricula including Everyday Mathematics (WWC, 2007).
There are many other differences between the curricula beyond the dimension of pedagogy, which we highlight in comparative reviews below. Our reviews draw on information from the WWC, the publishers themselves, several research studies, and a curriculum advocacy group (Mathematically Correct). 6 They reveal four main differences between the curricula: (a) Saxon presents related material in incremental units (distributive approach), whereas SBG and SFAW present related material in self-contained units (massed approach); (b) SBG structures lessons by interweaving examples and student practice; (c) the “reform” elements of SFAW include an emphasis on real-world examples and conceptual understanding before technical details; and (d) Saxon does not cover some higher order topics covered by the other curricula.
Saxon Math
The WWC (2007) describes instruction to students in Saxon Math as “incremental and explicit” and based on “teacher-directed conversations.” Slavin and Lake (2008) similarly describe Saxon Math as “traditional” and “algorithmically focused.” Saxon provides teachers with scripts for each lesson, and teachers are directed to structure daily lessons in three parts. First, teachers review prior concepts with students, usually through an interactive activity. Next they introduce new concepts and teach students exact methods for solving problems. Finally, students practice solving problems in class. Students are assigned homework to be completed individually, and assessments are given every five lessons (Agodini, Harris, Atkins-Burnett, Heaviside, & Novak, 2010; Slavin & Lake, 2008). Continued practice and review is a key aspect of Saxon (Bolser & Gilman, 2003).
A feature of the Saxon curriculum that differentiates it from the other two curricula is its use of the distributed approach to presenting related material (Houghton Mifflin Harcourt Publishers, 2008; WWC, 2007). That is, for a given topic, instruction and assessment on the topic is distributed throughout the academic year in incremental phases, rather than in a single setting. For instance, when students are taught how to tell time in Grade 2, they are first taught how to tell time to the hour, then move on to another subject, return back to time and learn half hour increments, move on, learn 5-minute increments, move on again, and finally learn how to tell time to the minute level. 7 This is in stark contrast to how SBG and SFAW teach time—both use a massed approach where all concepts related to time are taught without interruption in a self-contained unit (Ellis, 2006).
SBG Mathematics
SBG is also best classified as a traditional mathematics curriculum. A 1999 review from Mathematically Correct describes SBG as providing material to students in a structured way, similarly to Saxon. 8 Teachers first introduce a topic to the class and students participate in small-group or whole-class activities on the topic. Students are then tested using book problems. Teachers reassess student understanding with another activity, followed by student practice.
While Saxon and SBG fall on the same end of the traditional/reform spectrum, three notable differences became apparent during our review. First, as noted above, SBG uses a massed approach to instruction, whereas Saxon uses a distributive approach. Second, SBG focuses more on group work and interweaves class or small group activities with individual practice. This is in contrast to presenting all examples upfront and then having students practice afterward. 9 Finally, SBG presents higher order material for some topics that is not presented in Saxon. As an example, the Grade 2 SBG curriculum covers addition and subtraction for three-digit numbers, whereas Saxon only covers addition and subtraction up to two-digit numbers. 10
SFAW Math
SFAW offers a blend between the traditional and reform approaches to mathematics instruction. A traditional feature of the curriculum is that it encourages students to practice, although there are no “drills” per se. Instead, teachers are directed to structure lessons in a check-learn-check-practice format. First, teachers check student knowledge about a particular concept, then introduce new concepts, then check their understanding, and then students practice problems from the text. The problems are designed to be real-world oriented in the reform-based mold. The organization of SFAW also highlights the “reform” aspect of the curriculum. For example, when covering one-digit addition and subtraction, SFAW first devotes an entire unit to conceptual understanding and recognizing patterns, and then lays out strategies for problem solving, whereas SBG and Saxon teach problem-solving strategies upfront.
The WWC report also indicates that the SFAW curriculum is well-designed for use by students of differing ability levels. Correspondingly, it covers higher order topics not covered by Saxon, similarly to SBG. A final notable feature of SFAW is that it uses a variety of different instructional materials, including transparencies, workbooks, and technology (Agodini et al., 2010; Resendez & Manley, 2005; WWC, 2007). 11
Curriculum Selection Process in Indiana
Curriculum adoptions occur annually in Indiana and rotate in 6-year cycles by subject. For example, Indiana’s districts adopted new math curricula in 1998, 2004, and 2010. Similarly, recent reading adoptions occurred in 1994, 2000, and 2006. We focus our evaluation on the math-curriculum adoption that occurred in 1998 and on adoptions in Grades 1, 2, and 3.
The adoption process has centralized and decentralized components. It begins in July of the year prior to the new adoption (for the cycle where the new curricula were first used in the fall of 1998, this was July 1997). First, there is a 4-month review of the curriculum options by an official Textbook Advisory Committee (TAC) at the Indiana Department of Education (DOE) (state level). By October, the TAC compiles a list of approved curricula and distributes this list to school districts. At this point, the review process becomes decentralized and varies from district to district, but a common approach is for a district to form a committee of administrators, parents, and teachers to review the material and recommend a curriculum. The general public is also typically given an opportunity to comment. Overall, the district portion of the review process lasts for roughly 9 months and involves many individuals in different capacities. Each district makes a final decision in the summer before the new curricula are used in classrooms. 12
At the conclusion of the process, districts make one of three decisions. First, and most commonly, they choose to adopt one or more of the state-approved curricula. Second, they may apply to use alternative curricula that are not on the list, but this rarely happens in practice (e.g., no more than 1 out of the roughly 300 districts chooses this option in any grade in our data). Third, districts can apply for “continued use” where they quite literally continue to use the old textbooks from the prior adoption cycle. Over 98% of the districts in Indiana adopted new math curricula from the approved list during the 1998 adoption cycle.
III. Data
We construct a 17-year data panel of schools and districts for our analysis. The data include information about curriculum adoptions along with detailed school- and district-level information on student achievement, attendance, enrollment, demographics, and financing. We perform our primary analysis at the school level.
Our data panel starts with the 1991–1992 school year and ends in 2007–2008. The curricula of interest were first used in schools in the fall of 1998 and were replaced in the fall of 2004. We observe seven cohorts of Grade 3 students who were never exposed to the curricula during the pre period (1991–1992 through 1997–1998), one cohort that was exposed in Grade 3 only (1998–1999), one cohort that was exposed in Grades 2 and 3 only (1999–2000), four cohorts that used the curricula in all three grades and were thus “fully exposed” (2000–2001 through 2003–2004), one cohort that was exposed in Grades 1 and 2 only (2004–2005), one cohort that was exposed in Grade 1 only (2005–2006), and two cohorts in the post period (2006–2007 and 2007–2008) that were never exposed. The key cohorts of interest are the cohorts that were directly exposed to the curricula that we evaluate. We use the unexposed cohorts to perform falsification tests, which allow us to investigate the extent to which our primary findings are likely to be biased (see Section VII).
Our measure of achievement is the Indiana Statewide Testing for Educational Progress (ISTEP) exam. The ISTEP is a standards-based, criterion-referenced test administered in math and language arts. During most of our data panel, it was administered in Grades 3, 6, 8, and 10 (more recently it has been given annually in Grades 3–8). The math ISTEP assesses student skills in the following areas: number sense, computation, algebra, geometry, measurement, and problem solving. Student scores on the ISTEP are reported in scale scores, and the tests are constructed to measure student knowledge of the core concepts and practices outlined in the Indiana DOE standards. Given that the DOE standards for mathematics are a major factor in the curriculum selection process and that the ISTEP is designed to test mastery of these standards, the test should be well suited to evaluate the relative effectiveness of the three curricula. 13
ISTEP scores are first available for analysis in Grade 3, and Grade 3 scores are a function of the curricula to which students are exposed in earlier grades as well. Therefore, our estimates are best viewed as characterizing the impacts of sequences of curriculum treatments. To allow for cleanly identified effects, we exclude districts that adopted more than one curriculum across Grades 1–3. To illustrate the assignment problem in such circumstances, consider a district that adopted Saxon in Grade 1 and SBG in Grades 2 and 3. In identifying the effect of Saxon relative to SBG, schools in this district are not well defined as either treatments or controls. We refer to districts that used the same curriculum in all three grades as “uniform curriculum adopters.” Restricting our analysis to these districts reduces our district sample size by 8% and our school sample size by 7% (see Appendix Table C.1). After restricting our sample, we are left with data from 213 districts and 716 schools. By a large margin, this makes our study the largest curriculum evaluation of which we are aware. 14
In Table 1, we report differences in means across the schools and districts that adopted different curricula prior to adoption (1997). There are only small differences in test scores and attendance across curriculum adopters. There are larger differences in school demographics, district size, and to some extent, median household income. But even in some cases where the differences are statistically significant, they are substantively small. Overall, the descriptive statistics in Table 1 are encouraging because the differences across curriculum adopters imply considerable overlap in the distributions of characteristics across treatment groups. This is a key condition for the successful implementation of our empirical strategy, which we discuss further in Section IV.
Average Characteristics of Schools and Districts, by Adopted Curriculum (1997 Values)
SBG = Silver-Burdett Ginn; SFAW = Scott Foresman–Addison Wesley. The propensity-score specification also uses italicized information from 1998—differences in means for these years are not reported for brevity.
Indicates statistically significant difference at the 10% level between Saxon and SBG adopters.
Indicates statistically significant difference at the 10% level between SBG and SFAW adopters.
Indicates statistically significant difference at the 10% level between Saxon and SFAW adopters.
A final data issue relates to the long duration of our study. Specifically, over the 17 years of our data panel, the composition of schools in Indiana changed to some degree (due to school closings). This issue will almost surely come up in other large-scale curriculum evaluations given the typically long duration of implementation. 15 The key issue is whether changes in the composition of schools are correlated with curriculum adoptions—if they are, they can introduce bias into the evaluation. We discuss this issue in detail in the appendix—we find no evidence to suggest that compositional changes in our sample over time bias our findings.
IV. Empirical Strategy
School-Level Matching Estimators
We use school-level matching estimators to estimate the curriculum effects. Matching is an increasingly common empirical technique, and the conditions under which matching will identify causal treatment effects have been well-documented (Heckman, Ichimura, & Todd, 1997; Rosenbaum & Rubin, 1983). The key benefits of matching relative to simple regression analysis are (a) matching imposes weaker functional form restrictions and (b) matching resolves any “extrapolation” problems that may arise in regression analysis by limiting the influence of noncomparable treatment and control units in the data (Black & Smith, 2004).
Briefly, the key assumption under which matching will return causal estimates of treatment effects is the conditional independence assumption (CIA). The CIA requires potential outcomes to be independent of curriculum choice conditional on observables. Denoting potential outcomes by {Y0, Y1, … YK}, curriculum treatments by Dє {0, 1, … K}, and X as a vector of (pre-treatment) observable school- and district-level information, the CIA is written as:
Conditional independence will not be satisfied if there is unobserved information that influences both treatment and outcomes. For example, if districts have access to information that is unobserved to the researcher, Z, such that P(D = k | X,Z) ≠ P(D = k | X), and the additional information in Z influences outcomes, matching estimates will be biased.
We match schools using an estimated propensity score (Rosenbaum & Rubin, 1983). Defining Pj as the probability of choosing curriculum j, we match schools by
We estimate average treatment effects (ATEs) for the three curricula using the pairwise-comparison approach suggested by Lechner (2002). For example, for a comparison between curricula j and m, where Yj and Ym are outcomes for treated and control schools, respectively, we estimate ATEj,m ≡ E(Yi − Ym | D ∈ {j, m}). We use kernel and local-linear-regression matching estimators (with the Epanechnikov kernel), which construct the match for each “treated” school using a weighted average of “control” schools, and vice versa. Prior research suggests that kernel matching should perform well in our context (Frölich, 2004). 16 We estimate ATEj,m by:
In equation (2), NS is the number of schools using j or m on the common support, Sp. I0j indicates the schools that chose m in the neighborhood of observation j, and I0m indicates the schools that chose j in the neighborhood of observation m. Neighborhoods are defined by a fixed bandwidth parameter obtained via conventional cross-validation (see Appendix A for details). W(j,m) and W(m,j) weight each comparison school outcome depending on its distance, in terms of estimated propensity scores, from the observation of interest. We omit a more detailed discussion of the matching estimators for brevity. More information about these and other matching techniques can be found in Heckman et al. (1997) and Mueser, Troske, and Gorislavsky (2007). 17
In addition to ATEs, average treatment-on-the-treated effects (ATTs) may also be of interest. ATTs can provide important information if the curricula differentially affect different subgroups of schools. For example, consider a case where θ j,m = 0. This could occur even if schools that chose j were better off for having chosen j, and schools that chose m were also better off for having chosen m. We allow for differential curriculum effects by estimating ATTs for all of the comparisons in both directions (i.e., ATTj,m and ATTm,j). We briefly discuss our findings in Section VI, but in general, we gain little additional insight from the ATTs.
Finally, it is important to emphasize what a “curriculum effect” means in the context of our study. Of course, differences in content, pedagogy, and presentation will be reflected in our estimates, but so will other systematic differences in implementation across curricula. For example, if one curriculum is more amenable to teacher implementation, say by offering a more detailed teaching guide or providing more publisher support, our estimates will reflect this difference. As another example, Agodini et al. (2010) report that the average teacher using SFAW spends 4.8 hours per week on mathematics instruction, whereas for Saxon the average teacher spends 6.1 hours. Our estimates will capture differences along these lines as well.
One way to describe our estimates is that they capture the “total treatment effects” of the curricula on mathematics instruction. In many circumstances, this is desirable, but in some cases, it may not be. For example, if more time on math instruction reduces time for other subjects, then there could be adverse consequences that would be missed by our estimates. In practice, we find little evidence to suggest that there are spillover effects, at least on reading scores, but conceptually it is important to recognize that our estimates will embody all of the systematic differences in math instruction that come with the adoption of one of these curricula.
Are Schools an Appropriate Unit of Analysis?
We perform our analysis at the school level throughout, despite the fact that the official curriculum orders come from districts. There are several benefits to our school-level approach over the district-level alternative. First, the sample of schools is much larger than the sample of districts. Matching is often described as a “data hungry” procedure, and a key benefit of the larger sample of schools is that it facilitates better matches (Zhao, 2004). An analogous district-level analysis could be performed in principle, and in fact, we do verify our findings are qualitatively similar if we match at the district level instead, but the school-level approach should result in higher quality matches and is conceptually preferred for this reason.
Second, the CIA requires that we condition on all of the factors that determine curriculum selection and outcomes. As discussed above, the curriculum-selection process is complicated—the fact that districts mechanically place the orders does not mean that schools are not involved in the process. Performing our analysis at the school level allows us to directly control for school- and district-specific features, whereas it is not possible to do the reverse—for example, when we match districts, it is not straightforward to control for disaggregated characteristics of schools.
Third, it is conceptually plausible that administrators focus on raising school-level achievement. Clearly, school-level administrators will have this focus, but district-level administrators may also evaluate district performance on a building-by-building basis. If nothing else, our focus on school-level performance is consistent with recent accountability targets at all levels (local, state, and federal). 18
Noting these benefits of the school-level approach, it is still important to acknowledge the role that districts play in the adoption process. Our matching procedure accounts for this role by matching schools in terms of district similarity as well (see Section V). In addition, the fact that schools within a district all move together creates a clustering structure in the data that cannot be ignored. Accordingly, we cluster our standard errors at the district level throughout the analysis.
The school-level data are the most disaggregated data available in Indiana, and for the above reasons, we argue that schools are the best units of analysis for our study. Still, the school-level variables are aggregated up from individual students, and the issue of aggregation bias merits attention. We test for the importance of aggregation bias in our study indirectly using the falsification exercise in Section VII. There, we show that our findings are not driven by aggregation bias. 19 A related and more general issue is in regard to cross-level inference about the efficacy of educational interventions (Burstein, 1980). In our study, the key concern is that curriculum effects may be different at different levels of analysis. For example, it would be inadvisable to use our estimates to gain inference about student-level curriculum effects. As a practical matter, our main findings can be replicated if we aggregate up to the district level, but the same may not occur if we could disaggregate down to the student level (although it is not clear that student-level inference would be conceptually desirable for curriculum interventions). 20
V. The Propensity Score
Specification
We use a multinomial probit (MNP) to estimate the propensity scores for schools based on pre-adoption characteristics. Noting that the curricula of interest were first used in schools in the fall of 1998, the MNP includes information from the 1996–1997 and 1997–1998 school years. At the school level, we include controls for enrollment, demographics (race, free lunch status, and language status) and outcomes (Grade 3 test scores in math and language arts, and attendance) from 1996–1997, and analogous controls for enrollment and demographics from 1997–1998. At the district level, we include enrollment, outcome, and finance controls from 1996–1997, and enrollment and finance controls from 1997–1998. We also use district-level zip codes to assign Year 2000 Census measures of local-area socioeconomic status to each school; namely, median household income and the share of adults without a high-school diploma. We treat these variables as fixed-area characteristics. The list of covariates from the MNP is shown in Table 1.
The propensity score model was constructed to include the relevant information available to schools and districts at the time of the adoption. For instance, we control for 1996–1997 school and district test scores to account for pretreatment differences in achievement. But because the adoption decision was made by the summer of 1998, it is unlikely that decision makers had access to spring 1998 test scores, and consequently we do not include these scores in the model (similarly, we omit annual attendance from 1997–1998). The model also reflects the variety of potential actors involved in the adoption process. In addition to including school- and district-level controls, for example, the Census controls are included in acknowledgement of the role the local community can play in the adoption process (see Section II).
Although it is impossible to verify that the matching procedure includes all relevant factors, two pieces of evidence suggest that matching performs reasonably well. First, our findings are not qualitatively sensitive to reasonable adjustments to the MNP, including the addition of the 1997–1998 outcome variables or the addition of more years of lagged test scores. Second, we perform falsification tests where we estimate curriculum “effects” for students who were not actually exposed to the curricula (see Section VII). If unobserved factors that are otherwise unaccounted for in our models are driving our findings, we would anticipate estimating nonzero curriculum “effects” for the cohorts of unexposed students. The falsification tests provide no evidence to suggest that our primary estimates are biased by unobservables.
Balancing
In each comparison, we match treated and control schools based on the pairwise propensity scores and test for covariate balance. Balancing tests are motivated by Rosenbaum and Rubin (1983) and determine whether X ⊥ D | P(D = K | X), a necessary condition if the propensity score is to be used to match schools. 21 Although achieving covariate balance is important for any matching analysis that relies on a propensity score, there is no clearly preferred test for balance. Furthermore, in some cases, different balancing tests return different results (Smith & Todd, 2005). Given this limitation, we consider two different tests. The first is a regression-based test suggested by Smith and Todd (2005) that we perform separately for each pairwise comparison and for each covariate in each year. In the comparison between curricula j and m, we estimate:
In equation (3), Xk represents a covariate from the propensity-score specification, ρjm is the estimated pairwise propensity score, and D indicates treatment. We test whether the coefficients β5–β9 are jointly equal to zero in each regression—that is, we test whether treatment predicts the Xs conditional on a quartic of the propensity score.
The second test measures the absolute standardized difference in observables after matching and was originally suggested by Rosenbaum and Rubin (1985). The formula for the absolute standardized difference for covariate Xk is given by:
The numerator in equation (4) is analogous to the formula for our matching estimators in equation (2), where we replace Y with Xk and take the absolute value (note the denominator is calculated using the full sample). A weakness of using standardized differences is that there is not a clear rule by which to judge the results, although Rosenbaum and Rubin (1985) suggest that a value of 20 is large.
Our MNP uses 32 school- and district-level covariates. Table 2 reports summary results from the balancing tests by comparison and year. From the regression tests, we report the number of covariates where the F test rejects the null hypothesis at the 5% or 10% level and the average p value across all F tests. We also report the average absolute standardized difference across all covariates. 22 Table 2 shows that our comparison between SBG and Saxon is particularly well balanced. For our other comparisons, the covariates are less balanced, although it is not clear that the levels of imbalance are cause for concern. For example, although the average absolute standardized difference is larger in the comparisons involving SFAW, compared to other methodologically similar studies, the averages in Table 2 are quite reasonable.
Balancing Details for the 32 Covariates Included in the Multinomial Probit Specification
SBG = Silver-Burdett Ginn; SFAW = Scott Foresman–Addison Wesley. Columns in italics are for years that are contiguous to the years from which the matching criteria are drawn. Results reported using the samples of treatments and controls that are on the common support in each year for the kernel-matching estimators. The numbers of covariates that fail the balancing tests at the 5% level are a subset of those that fail at the 10% level.
We also calculate the divergence between the densities of the estimated propensity scores for treated and control units. Intuitively, density divergence will affect the precision of the estimates and can be generally informative about the extent to which the data environment is favorable for matching (the key issue being overlap in the distributions of observables). Similarly to the balancing tests, our analysis of density divergence suggests that the data conditions are most favorable in our comparison of SBG and Saxon. See Appendix A for details.
VI. Results
Table 3 presents the estimated curriculum effects for all Grade 3 cohorts who were ever exposed to the curricula of interest. Each cohort is labeled according to the year of its spring test (e.g., the 1998–1999 cohort is labeled “1999”). All of the estimates are standardized using the distribution of student-level test scores. 23 In addition to the matching estimators, we also report OLS estimates where we regress test score outcomes on the covariates used in the propensity score model and indicator variables for curriculum adoptions, retaining the pairwise comparisons. The standard errors for the matching and OLS estimates are clustered at the district level and the matching-estimator standard errors are bootstrapped with 250 repetitions. 24
Estimates of Math Curricular Effectiveness on Grade 3 Math Test Scores for Partially and Fully Exposed Grade 3 Cohorts, All Comparisons
SBG = Silver-Burdett Ginn; SFAW = Scott Foresman–Addison Wesley; OLS = Ordinary Least Squares; LLR = Local Linear Regression. Bolded columns are for the fully-exposed cohorts. Matching estimators impose the common support restriction. Standard errors in parentheses are clustered at the district level for all estimates, and bootstrapped using 250 repetitions for the matching estimators. N(Saxon) refers to the number of schools in our sample that use Saxon, and similarly for N(SBG) and N(SFAW).
p ≤ .01. *p ≤ .05. †p ≤ .10.
Focusing first on our largest comparison between SBG and Saxon, and the estimates for the fully exposed cohorts (2001–2004), we find that SBG meaningfully outperformed Saxon. By averaging the kernel-matching estimates across these cohorts, we estimate that among the sample of schools that chose SBG or Saxon, the average effect of using SBG was roughly .13 standard deviations of the test. Our estimates are also consistent with SFAW outperforming Saxon. There, we estimate an average effect of .06 standard deviations, although only two of the four estimates are statistically significant and the estimate from 2004 is particularly small. Our results also suggest, at least weakly, that SBG outperformed SFAW, although the estimates are noisy enough that we cannot draw strong inference from this latter comparison.
We also briefly consider the possibility that treatment effects depend on treatment status. In unreported results, we estimate ATTs for each comparison and in each direction. In our comparison between SBG and Saxon, the treatment effects do not depend on treatment status, and similarly for our comparison between SFAW and SBG (although again, these estimates are noisy). Only in our comparison between SFAW and Saxon do we find any evidence of differential effects—Saxon appears to perform less poorly relative to SFAW at schools that actually chose Saxon. Nonetheless, even our estimates of ATTSaxon,SFAW suggest that schools that chose Saxon would have been better off had they instead chosen SFAW.
The magnitudes of the curriculum effects are economically meaningful, particularly when weighed against the marginal costs of choosing one curriculum over another. For instance, Fryer and Levitt (2006) show that between Grades 1 and 3, the black-white achievement gap grows at a rate of approximately .10 standard deviations per year. Contrasting this estimate with the results from our most compelling comparison suggests that choosing SBG over Saxon has an effect that is equivalent to roughly 1 year’s worth of expansion of the black-white achievement gap. 25 Given that curricula tend to be similarly priced (the texts from Saxon, SBG, and SFAW, averaged over Grades 1–3, were $23.08, $24.80, and $25.34, respectively), selecting a better curriculum appears to be a cost-effective way to improve student achievement. 26
Next, we turn to the partially exposed cohorts. One common theme is that the point estimates for the 2005 and 2006 cohorts are generally larger than for the 1999 and 2000 cohorts. An explanation is that there are familiarity issues related to curriculum implementation. For example, students who used the curricula when the curricula were first introduced may have had a different experience than the students who used the curricula toward the end of the adoption cycle. 27 The curricula may also have had legacy effects on instruction, which would have affected the 2005 and 2006 cohorts even as they transitioned out of the adoption cycle.
An issue with interpreting the estimates from the partially exposed cohorts is that these students were also exposed to other curricula in other adoption cycles, and this may attenuate the estimates. The degree of attenuation will depend on the extent to which curriculum quality is correlated across adoption cycles for treatment and control schools. We explore this issue to the extent possible in Table 4, where we compare curriculum adoptions for Grades 1–3 in the 2004 adoption cycle across uniform adopters from 1998 (we do not observe math curriculum adoptions prior to 1998; therefore, we cannot examine across-cycle adoptions in earlier periods).
Average 2004 Curriculum Adoptions in Math by District for the Four Most Common Curricula From the 2004 Adoption Cycle
SFAW = Scott Foresman–Addison Wesley; SBG = Silver-Burdett Ginn. N indicates the number districts where we observe a 2004 math-curriculum adoption and at least one Grade 3 math test score between 1998 and 2008. The “other” category includes all districts that did not adopt any of the “big three” curricula in any grade during the 1998 adoption cycle. Districts that adopted at least one of the big three curricula nonuniformly during the 1998 adoption cycle are included only in the “all” category.
Table 4 shows adoption shares in 2004 for the four most popular curricula from that adoption cycle. Saxon adopters in 1998 were much more likely to adopt Saxon in 2004, but adopters of the other two curricula are dispersed across alternative options. Without knowing the respective qualities of the different curricula adopted outside of the 1998 adoption cycle, it is difficult to form expectations based on the patterns in Table 4. Ultimately, given the potential for attenuation in the estimates for the partially exposed cohorts and the sizes of our standard errors, we cannot make strong inference about partial-exposure curriculum effects.
Another interesting aspect of Table 4 is that it shows the changing market shares of curriculum publishers over time in Indiana. Saxon, despite its relative underperformance in our analysis, maintained its near 50% market share in 2004. Although we found SBG was the most effective curriculum during the 1998 adoption cycle, it did not appear in 2004. The publisher of SBG was bought by Pearson Publishing, and Pearson phased out SBG in favor of SFAW, which it also publishes. SFAW’s market share fell from roughly 15% to 9%.
Overall, our most reliable estimates come from the four fully exposed cohorts. In our most compelling comparison, we find that SBG outperformed Saxon by a substantial margin. Both of these curricula share the same basic pedagogical approach (traditional). With researchers and policymakers placing so much emphasis on differences between the traditional and reform pedagogies, our findings serve as a reminder that other differences should not be overlooked. Our analysis also suggests that SFAW somewhat outperformed Saxon; and if anything, SBG outperformed SFAW, although inference from the latter comparison is clouded by statistical imprecision. Finally, we show that Saxon’s market share did not diminish in the next adoption cycle, despite our finding of relative underperformance during the 1998 cycle. One explanation is that educational administrators do not have reliable evidence on curricular effectiveness. 28
VII. Falsification Tests
Matching estimators will not return causal estimates if conditional independence is violated, and there are a number of ways that this could occur in our study. For example, there could be systematic differences in teacher or administrator quality across different curriculum adopters. If these differences are correlated with the curriculum adoptions and student achievement but poorly proxied for by the controls in the propensity-score model, they could introduce bias. Or if there are differences across adopters with respect to mathematics instruction—perhaps in districts’ general commitments to mathematics—that are not driven by the curricula themselves, this could bias our estimates. A third possibility is that curriculum adoptions in other subjects (like reading) may be correlated with math adoptions and math achievement. 29 Any of these factors, or many others, could potentially bias our findings.
While it is impossible to exhaustively consider all possible sources of bias, we provide evidence about the general reliability of our findings using two types of falsification tests. First, we estimate curriculum effects on test scores for cohorts of students who never used the curricula of interest. The logic of these tests can be illustrated with an example. Suppose that there are unobserved differences across adopters in terms of teacher quality and that these differences affect student achievement but are not well proxied for by any of our conditioning variables. This would lead to bias in the estimated curriculum effects. But the bias should not be unique to the years in which the curricula were actually used in schools. For example, if schools that chose SBG also have stronger teachers, then the effects of these teachers should be visible before the curricula that we evaluate were ever adopted. The confounding factor (teacher quality) will manifest itself in the form of nonzero curriculum “effects” even for cohorts of students who never used the curricula that we evaluate. In contrast, curriculum “effect” estimates that are close to zero for unexposed cohorts would suggest that it is the curricula themselves, and not other differences across curriculum adopters, that are driving our results.
We also provide a second set of falsification tests by estimating math-curriculum effects on reading achievement for students who were and were not exposed to the curricula of interest. For the out-of-cycle cohorts, we again expect to estimate “effects” that are statistically indistinguishable from zero if our main findings are unbiased. For students who actually used the math curricula of interest, timing does not rule out the possibility of causal spillover effects on reading scores. However, at most we would expect only small spillover effects. 30
We first estimate curriculum “effects” on math test scores for cohorts of Grade 3 students from 1992 through 1996, and 2007 and 2008. For brevity, we only report estimates using kernel matching (Epanechnikov kernel). The results are reported in Table 5, with the most convincing estimates coming from the 1992–1996 cohorts that passed through schools prior to the 1998 adoption cycle. All of the falsification estimates are small and statistically indistinguishable from zero, with the exception of the SFAW–Saxon comparison in 1992. Furthermore, the precision of the estimates is very similar to the precision of our main estimates in Table 3. These results do not provide any indication that our primary findings are biased by unobserved selection.
Falsification Estimates of Math Curricular Effectiveness, Estimated Using Math Test Scores for Grade 3 Cohorts Who Were Never Exposed to the Curricula of Interest, All Comparisons
SFAW = Scott Foresman–Addison Wesley; SBG = Silver-Burdett Ginn. Matching estimators impose the common support restriction. Standard errors in parentheses are clustered at the district level and bootstrapped using 250 repetitions.
p ≤ .01. *p ≤ .05.
The 2007 and 2008 cohorts in Table 5 were not exposed to the curricula of interest either; however, their outcomes are observed after the adoption cycle we study. This leaves open the possibility of nonzero treatment effects, which limits inference to some degree. But even so, none of the estimates from 2007 or 2008 are statistically significant.
In Table 6, we estimate math-curriculum effects on reading scores for all cohorts. Students in the cohorts from 1992 through 1996, and 2007 and 2008, were never exposed to the curricula of interest. The other cohorts were exposed, and it is unclear a priori whether we should expect any across-subject spillover effects. Although we do not have a strong prior about whether math curricula affect reading outcomes, one straightforward expectation is that their effects on math test scores should be larger than their effects on reading test scores. The results in Table 6 confirm this basic intuition: The point estimates are generally small, and only one is statistically significant (in the comparison between SBG and Saxon in 2002). 31
Estimates of Math Curricular Effectiveness, Estimated Using Reading Test Scores for All Grade 3 Cohorts, All Comparisons
SFAW = Scott Foresman–Addison Wesley; SBG = Silver-Burdett Ginn. Bolded columns are the fully-exposed cohorts. Matching estimators impose the common support restriction. Standard errors in parentheses are clustered at the district level and bootstrapped using 250 repetitions. †p ≤ .10
Finally, note that all of our falsification estimates use data aggregated to the same level as in our main analysis (schools). If our main results were subject to aggregation bias, this same bias should be reflected in the falsification estimates as well. We find no evidence of this, suggesting that our findings are not driven by aggregation bias. 32
VIII. Conclusion
We use a unique administrative data panel from Indiana to compare the effectiveness of three elementary-mathematics curricula. We measure curriculum effects using test scores on the Indiana state test (the ISTEP). Our results indicate that there are substantial differences in effectiveness across the three curricula and, in particular, between the two curricula that held the largest market shares in the state during our study.
Our research makes two contributions to the literature. First, we show that important differences in curricular effectiveness can exist between curricula that share the same pedagogical approach. Specifically, we find that during the 1998–2004 adoption cycle in Indiana, the SBG curriculum meaningfully outperformed the Saxon curriculum, and both curricula are best characterized as traditional in pedagogy. This suggests that other differences between these curricula, and other curricula more generally, are important determinants of achievement and merit attention from researchers and policymakers.
Our study also provides a template for how similar studies in other states could be performed. The thinness of the empirical literature on curricular effectiveness is striking, and the most prominent obstacle in the way of producing more studies is the lack of data. Currently, Indiana is one of only two states of which we are aware that collects and makes curriculum-adoption information available, and many states do not collect data at all. Such data would be cheap and easy to collect, particularly compared to other data elements in many state longitudinal systems, and could be used to learn much about this important educational resource.
One advantage of having more studies on curricular effectiveness is that they can be used to examine how different curricula perform in different contexts. Unlike some other educational interventions, curricula are used by virtually all students in all schools. This implies considerable heterogeneity in the contexts in which curricula are used; both in terms of the actors involved (students and teachers) and potentially the objectives of the intervention (e.g., differences in standards across states or school districts). The right curriculum in one circumstance may not be right in another. As a specific example, Agodini et al. (2010) find that Saxon outperforms SFAW, while our findings suggest the opposite (weakly). An important contextual difference is that Agodini et al. analyze schools where students are significantly more disadvantaged than students in the typical Indiana school. 33 It may be inadvisable for a disadvantaged school district to choose SFAW (or perhaps SBG) based on our study, or alternatively, for an advantaged district to choose Saxon based on the Agodini et al. study. 34 But both studies provide valuable information for administrators in the right context. More generally, the current sparseness of the literature makes it difficult for educational administrators to make informed curriculum-adoption decisions. But if the literature were expanded, patterns would emerge across multiple studies that would allow us to determine which curricula are most effective in which circumstances. 35
Footnotes
Acknowledgements
We thank Emek Basker, Julie Cullen, Gordon Dahl, Barry Hirsch, Josh Kinsler, David Mandy, Peter Mueser, Rusty Tchernis, many seminar and conference participants, and three anonymous reviewers for useful comments and suggestions. We also thank Karen Lane and Molly Chamberlin at the Indiana Department of Education for help with data. This work was not funded or influenced by any outside entity.
Notes
The Authors
RACHANA BHATT is an assistant professor of economics at Georgia State University. Her research examines the economics of education and school-based policy interventions.
CORY KOEDEL is an Assistant Professor in the Department of Economics at the University of Missouri;
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
