Abstract
Due to evidence linking education and development, funding has been invested in interventions relevant to getting youth into school and keeping them there. This article reports on a systematic review of impact studies of these school enrollment interventions. Reports were identified through electronic searches of bibliographic databases and other methods. To be eligible, studies (1) assessed impact on primary or secondary school enrollment outcomes; (2) used a rigorous design; (3) were conducted in a low- or middle-income nation; (4) included at least one quantifiable measure of enrollment or related outcomes; (5) were available before December 2009; and (6) included data on participants post-1990. A coding instrument extracted data on study characteristics from each report. Standardized mean difference effect sizes were computed for the first effect reported. The sample includes 73 evaluations. The average effect size was positive across all outcomes. However, the results varied. Studies that focused on building new schools and other infrastructure interventions reported the largest average effects.
Keywords
Introduction
Education is critical to economic development and social welfare in developing nations. The second Millennium Development Goal adopted by world leaders in 2000 is universal primary education for all boys and girls, and the third called for the elimination of gender disparities in education. Prioritizing education in such a way has several rationales. For one, investments in education are believed to yield returns in poverty reduction, improved health outcomes, and economic growth (Hannum & Buchmann, 2004; Herz & Sperling, 2003; United Nations Educational, Scientific, and Cultural Organization [UNESCO], 2007). In addition, greater access to education can lead to increased political participation and more equitable sharing of economic and political power (Birdsall, 1999). Education for girls is considered particularly critical, as improvements in the infant mortality rate, child nutrition, and school enrollment are closely associated with higher education among mothers (Birdsall, Levine, & Ibrahim, 2005; Herz & Sperling, 2003). Yet, more than 100 million primary school-aged children are not in school and, of those that are, many—49% in Africa, for example—do not complete primary school (Birdsall et al., 2005).
Low educational attainment—or the inability of students to complete their primary and secondary school education—in the developing world is the combined result of children who do not enroll, do not progress, or drop out (World Bank, 2004). Children may not enroll or complete their schooling for a number of reasons, often due to economic or structural barriers. For example, in some countries, such as India, Mali, and Burkina Faso, school enrollment is low due to the cost of schooling (both direct and opportunity costs), poor school infrastructure, teacher shortages, and safety and sanitation problems (Birdsall et al., 2005). In other areas, such as several Latin American countries, enrollment may be nearly universal but retention and completion may be low for a myriad of reasons, including those mentioned earlier as well as poor health of students or members of their households (Glewwe & Miguel, 2008; UNESCO, 2007), teacher absenteeism or malfeasance (World Bank, 2004), and curricula that do not match students’ needs (Glewwe, Kremer, & Moulin, forthcoming). Value systems held within the country may also diminish the importance of enrolling children (particularly girls) in school (e.g., Academy of Educational Development Global Center, 2010; Brembeck, 1962), or parents may prioritize their children working to earn much needed immediate funds rather than attending school (Hillman & Jenker, 2004).
Furthermore, developing nations face significant school enrollment and completion disparities between segments of the population, such as between lower and upper income households, boys and girls, urban and rural dwellers, and combinations of these factors (Birdsall et al., 2005). For example, in India the gap in enrollment between boys and girls from the richest households is only 2.5%, whereas the gender disparity for children from the poorest households is 24% (Filmer, 1999 as cited in Birdsall et al., 2005). In many African nations, rural rates of enrollment lag far behind the very modest national rates, particularly for rural girls, whose rate of enrollment is less than 15% in several countries (Birdsall et al., 2005). In addition, ethnolinguistic diversity, disabilities, and conflict situations in fragile states create further barriers to school participation in developing nations (Birdsall et al., 2005).
Governments in developing nations are, to varying degrees, making efforts to increase enrollment and equity, in part due to compelling evidence linking expanded education systems to socioeconomic development that highlights the importance of policies to offset inequality in access to schooling (Fiszbein & Schady, 2009; UNESCO, 2007). Building new schools to increase ease of access in remote areas is one intervention used in developing nations (e.g., Filmer, 2004). Other efforts include improving school infrastructure and safety and abolishing school fees as well as implementing targeted policies to reach the most marginalized children. Such policies include school feeding programs, flexible schooling models for working children, school-based health interventions, and various types of financial subsidies and conditional cash transfer systems. For example, several Latin American governments and nongovernmental partners have experimented with programs that transfer money directly to disadvantaged households—such as in rural, indigenous, migrant, or slum communities—in exchange for children’s school enrollment and attendance (Fiszbein & Schady, 2009; UNESCO, 2007). In Asia, such stipend programs encourage the transition of girls to secondary school (UNESCO, 2007).
In addition to expanding access to schooling and increasing enrollment and persistence, measuring learning achievement is an essential, although methodologically challenging, part of improving education in the developing world (Birdsall, Levine, & Ibrahim, 2005). Enrollment and persistence data are not necessarily good predictors of learning outcomes (Birdsall, Levine, & Ibrahim, 2005). Namely, it is not enough merely to fill school spaces, but children must also learn if economic and social priorities are to be achieved. Increasing enrollment and educational quality should go hand in hand as poor children drop out with greater frequency when the quality of schooling is low (Birdsall, Levine, & Ibrahim, 2005). Many interventions, including those mentioned earlier, as well as teacher training and incentives, textbook provision, and health inventions such as deworming and providing nutritional supplements are undertaken with the goals of improving both enrollment and learning. In this review, data on achievement outcomes (e.g., test scores) were also collected and analyzed, even if the primary goal of the intervention was to increase student enrollment.
Considerable funding for initiatives to improve school enrollment has brought with it a concomitant increase in accountability. Donor agencies and governments want to know whether the funds they have put toward such programs are having positive impact. This is not the only question they are asking of evaluation, but it is an important one. There is some frustration about the lack of knowledge about impact, as expressed in the 2006 Center for Global Development report
1
: After decades in which development agencies have dispersed billions of dollars for social programs, and developing country governments and nongovernmental organizations (NGOs) have spent hundreds of billions more, it is deeply disappointing to recognize that we know relatively little about the net impact of most of these social programs.
In recent years, there has been something akin to a “randomized revolution” in the developing world, as donors and governments are increasingly asking for impact evaluations that provide more credible estimates of effect (e.g., Duflo & Kremer, 2005; Kremer & Holla, 2009; Newman, Rawlings, & Gertler, 1994). Evaluations of some of these recent policies and programs to increase school enrollment and persistence in developing nations include a number of randomized field trials and rigorous quasi-experimental studies.
Given the importance of education, particularly to outcomes in the most economically challenged nations, the number of interventions that have been implemented to address education in developing nations, and the increase in relevant controlled impact evaluations, the need for a systematic review seems clear. To our knowledge, a systematic review of randomized controlled trials (RCTs) and quasi-experiments of strategies in developing nations to get children into school (enrollment) and keep them there (attendance, persistence, continuation) has not yet been reported although more focused reviews on narrow interventions have been reported (e.g., school fees, by Morgan, Petrosino, & Fronius, 2010; and teacher attendance by Cueto, Guerrero, Leon, & Sigamuru, 2010). By systematically gathering and analyzing rigorous research about the program effects of primary and secondary school enrollment and completion policies, our review aims to provide evidence to inform the next wave of funding, intervention, and evaluation efforts in this area.
Methodology
Research Questions
Two questions are addressed by this review: (1) What are the effects of interventions implemented in developing countries on measures of students’ enrollment, attendance, graduation, and progression? (2) Within those studies that report the effects of an intervention on measures of students’ enrollment, attendance, graduation or progression, what are the ancillary effects on learning outcomes as measured by students’ test scores, grades, and other achievement measures?
Eligibility Criteria
For this project, only evaluation studies that had the following characteristics were included:
1. Assessed the impact of an intervention that included primary or secondary school outcomes (corresponding to kindergarten–12th grade in the U.S. context or approximately age 5–18) relevant to the main research question;
2. Used a randomized controlled trial, or a quasi-experiment with evidence of baseline control on a main outcome.
Our review includes evaluations that randomly assigned entities (at any level) to intervention or control conditions. We also included regression discontinuity designs in which a predefined cutoff score determined program eligibility and then program impact around the cutoff score. Because randomization or regression discontinuity is not always possible for evaluations of certain policies or programs, we also included quasi-experimental designs (QEDs) that employed controls for baseline or pretest differences on a main outcome. We based this decision on prior research examining the alignment of estimates from quasi-experiments to randomized experiments, finding that controls for baseline differences via matching or other processes were most important for achieving closer approximations to estimates from RCTs (e.g., Bloom, Michalopoulos, Hill, & Lei, 2002; Glazerman, Levy & Myers, 2003).
Although randomized experiments are considered superior to quasi-experiments in terms of controlling all observable and unobservable confounders, there is conflicting literature on whether the estimates from QEDs approximate those from randomized experiments (e.g., Oliver et al., 2008). We include both types of studies and examine study design as a moderator in our later analyses.
3. Were conducted in a country classified as a “low- or middle-income nation” by the World Bank at the time the intervention being studied was implemented.
The World Bank determines low- and middle-income nation status by calculating the gross national income per capita, that is, the average citizen’s income. As of 2008, 151 nations were included in these categories. The categories are low income (US$975 or less); lower middle income (US$976–US$3,855); and upper middle income (US$3,856-US$11,905). Low- and middle-income nations are often referred to as “developing economies” and overlap considerably with the United Nations listing of “developing nations.”
We used the World Bank listing for the time period closest to the start of the intervention. For the most part, there was very little fluctuation in the list, with the exception of nations that were newly created.
4. Included at least one quantifiable main outcome measure of school enrollment, attendance, dropout or progression.
We had to be able to compute an effect size from the data reported in the evaluation or be able to acquire it from the principal investigators. One important question is whether interventions that enroll or otherwise bring more children to school have any impact on learning (i.e., more students now strain existing resources, etc.). Thus, as a supplemental measure, we also collected information on learning outcomes (e.g., grades, test scores). However, if the report did not include one of the main outcomes of getting children into school and keeping them there (enrollment, attendance, graduation, and progression), it was not included. Thus, the data on learning outcomes should not be viewed as being representative of all educational interventions in developing nations but only pertain to the programs and policies that included one of the main outcomes discussed earlier.
5. Published or made available before December 2009, without regard to language or publication type.
We searched for trials published up to and including December 2009. We attempted to find English and non-English studies and included published and unpublished studies (e.g., from conference papers, dissertations, technical reports). We also had some reports translated into English so that we could review them.
6. Included data on participants from 1990 or beyond.
So as to be as relevant to current policy contexts as possible, we focused on studies that included data on participants from 1990 or later. Some studies may have published articles that used data from large-scale administrative data sets generated decades earlier (e.g., Cutler, Fung, Kremer, Singhal, & Vogl, 2009) but were not included in our sample.
Search Strategy
Our goal for the literature search was to identify relevant reports in both published and unpublished literature. To accomplish our search, we used five major strategies:
Electronic searches of bibliographic databases: Appendix A provides a complete list of databases that were searched. Note that many of these databases include the fugitive or gray literature (e.g., Education Resources Information Center).
Hand searches of relevant journals: Because electronic searches can miss relevant studies, we “hand searched” (i.e., visually inspected the table of contents and the articles) five journals: Economic Development and Cultural Change, International Journal of Educational Development, Journal of Development Economics, World Bank Research Observer, and the World Bank Economic Review. These journals had been identified in our early searches as being the most prolific in publishing evaluative studies relevant to this review.
Citation chasing: The reference section of every retrieved report (whether eligible or not) was checked to determine whether any possible eligible evaluations were listed.
Contacting the “Informal College” of Researchers in this area: We identified the lead authors of such studies or relevant documents (e.g., reviews, nonevaluative studies), identified their e-mail addresses from a Google search, and e-mailed them query letters.
Internet searches and specialized holdings: We also used the “advanced search” options in Google and Google Scholar for broad searches of the World Wide Web. This was supplemented by specialized searches of specific websites that could reference relevant holdings such as the Center for Population Development and Activities (2001) the Massachusetts Institute of Technology’s Poverty Action Lab, Yale University’s Innovations for Poverty Center, and the National Bureau for Economic Research.
Key Word Strategies for Bibliographic Databases
Our search strategies were of two major types. First, and for most databases, we developed a long list of key words to identify three major study eligibility criteria: (1) key words relevant to developing nations; (2) key words relevant to the outcomes of enrollment, dropout, persistence, so on; and (3) key words relevant to experimental and quasi-experimental evaluations. These were used successfully in most databases; in a few instances, however, the yield was still so large that we instituted a (4) fourth criterion of key words relevant to youth. The second search strategy focused on databases that did not permit complex searches. In these, we searched by using one or a few key words at a time. The Campbell Collaboration review includes details on how each database was searched (see Petrosino et al., 2012, Appendix 2).
Retrieving and Final Screening of Studies
The search identified a large number of citations and abstracts. Many citations were easily excluded because they were not relevant to the proposed review. One of the WestEd coauthors (Petrosino, Morgan, or Fronius) reviewed citations and determined whether the cited study should proceed to a second screening, that is, was a potentially relevant study. If so, the full-text documents of those potentially eligible studies were retrieved. In most cases, two independent persons reviewed a study for eligibility at the final screening stage, except when the lead author identified the study and confirmed its eligibility.
Extracting Information From Each Study
We designed a coding instrument to guide us in extracting information from each study (see Appendix 5 in Petrosino et al. 2012 for the full coding instrument). The instrument contains items that describe the characteristics of the researcher (e.g., field or discipline), the publication (i.e., type of document and year published), the setting or context (country and classification of economy), the evaluation design (whether RCT or QED), methodological quality (i.e., how the study handled selection bias, the degree of attrition, and any program implementation compromises), the treatment condition, the control or comparison group, the participants (e.g., grade), and the outcomes (i.e., on enrollment and learning outcomes). Except for the coding reliability check discussed next, one person coded each study.
Coding Reliability
To ensure that we achieved good coding reliability, the first three coauthors read and recorded information from a random sample of reports (12 or 17% of the final sample). We assessed coding reliability (i.e., interrater agreement) by using the percentage of agreement for each item, rather than reporting a global interrater reliability statistic. Items with lower rates of agreement (less than 80%) were investigated to determine the source of conflict. The authors held discussions to resolve disagreements and discuss coded items. Differences among authors primarily stemmed from the varying levels of detail provided in study reports to respond to open-ended items. None of the items that are analyzed in this review had a rate of agreement lower than 80%.
Criteria for Determination of Independent Findings
Our criteria for handling possible statistical dependencies were as follows:
One effect size per analysis: Each study is represented by a single effect size in each analysis to prevent the analysis from being compromised by nonindependence (multiple effect sizes from one study). For this review, to maintain just one effect size per analysis, we kept the four major outcomes distinct. That is, we analyzed the main outcomes of enrollment, attendance, dropout, and progression separately. We kept separate our analyses of the supplemental learning outcomes. Few studies included more than one follow-up time interval. Consequently, we only report “first effects” and do not examine effects at additional follow-ups. When studies included three or more groups in the design (e.g., multiple treatment groups), we only computed effect sizes for the treatment group that represented the strongest contrast with the control condition. In every case, this was the group that received the most intensive intervention, that is, the most treatment components.
Our unit of analysis was the evaluation study, not the evaluation report: One issue we encountered in this review was the sheer number of reports and reanalyses using the same sample of data. In some instances, the same investigators, or other investigators obtaining the study data, published multiple articles on the effects of an intervention. Our unit of analysis is the individual evaluation study and not the individual research article.
An evaluation study was considered distinct if it used a different sample: Investigators sometimes published multiple articles or reports on the same intervention using different samples. Different study samples were coded as separate studies even if the same general intervention was being investigated. For example, Kremer, Moulin, and Namunyu’s (2003) study of decentralization occurred in two Kenyan cities (Busia and Teso). Because there were separate random assignments and separate samples, they are considered two different studies for purposes of this review. Although there may be a slight dependence in the estimates from the same multisite study, the normal handling in meta-analysis is to treat these as separate and independent studies (Lipsey & Wilson, 2001).
The primary analysis or study design was the “most rigorous,” or one that provided the most controls: If the same sample was being used in multiple designs, we focused on and coded the “strongest” or “most rigorous” design. For example, investigators may have conducted analyses using regression discontinuity design, regression controlled analyses, and difference-in-difference methods. To avoid dependencies in our analyses, only one of these designs should contribute to estimates of program impact. In this case, the strongest design methodologically was used. This was also true in those instances in which investigators reported intent to treat (ITT) analyses when using a randomized controlled trial but also reported instrumental variables (with randomization as the instrument) or other treatment on treated analyses. We always selected the more conservative ITT estimate.
Investigators sometimes reported on the results of multiple estimation models. We selected the model that included the most “controls” to compute the effect size estimates because they theoretically reduced statistical “noise” that may have come from chance fluctuations or randomization violations (in the case of well-implemented experiments) or uncontrolled variables (in the case of quasi-experiments).
Overall versus subgroup effects: A few studies only reported outcomes by specific subgroups such as gender (male/female), school level (primary/secondary), type of geographic area (rural/urban), or grade level (first–eighth grades). In those few instances in which this occurred, we averaged effects over the included grades or across both boys and girls to obtain an overall effect for the intervention.
Individual-level effects where possible: Some studies reported analyses at multiple levels (e.g., Schultz, 2004), that is, for schools or communities, households, and students. Our rule was to compute effect sizes for the analyses at the individual level, unless such data were not available in the original reports. In the latter instance, we computed effect sizes at the larger aggregate level (e.g., school) but conducted a post hoc methodological analysis to compare differences in effect sizes when computing them based on using the sample sizes of individual students or from the larger aggregate samples.
Statistical Procedures and Conventions
Standardized mean differences (Cohen’s d) were used as the effect size metric for all the main and supplemental outcomes of interest, which are appropriate for measuring group differences in mean levels of continuously measured outcomes (Lipsey & Wilson, 2001). All effect sizes were coded so that positive effect sizes represented better outcomes (e.g., higher enrollment, lower dropout). Standardized mean difference effect sizes were calculated as
where the numerator is the difference in group means for the intervention and control groups, and the denominator is the pooled standard deviation for those groups. The variance of the standardized mean difference effect size was calculated as:
Effect sizes and variances were calculated using David Wilson’s online effect size calculator (http://www.campbellcollaboration.org/resources/effect_size_input.php). For example, one common transformation procedure we used was the logit transformation of binary proportions to standardized mean differences (Cohen’s d). The formula computes odds ratios for these data and then estimates d as follows:
where A and B are the counts of successes and failures in the treatment group and C and D are the corresponding counts of successes and failures in the comparison group. 2
Because many of the included econometric studies used complex statistical models that adjusted for baseline and other covariates, the variances were rescaled when possible using the procedures outlined in Wilson (2011).
The data were entered into the Comprehensive Meta-Analysis (CMA) Version 2 software program. We used CMA algorithms to statistically combine results from the evaluations. Forest plots generated by Stata Version 12.1 are used to display the results from the effect sizes, including the effect size and 95% confidence intervals (CIs). Because of the presumed heterogeneity in the true effects across interventions, samples, countries, and outcomes, we used random effects models in our statistical analyses.
We report overall effects across all interventions on the four major outcomes (enrollment, attendance, dropout, and progression) and on the four types of learning outcomes (math, language, standardized assessment scores, and other achievement measures).
We then descriptively examine a number of moderators. These moderators are approached and interpreted descriptively rather than statistically, as they are often based on small numbers of studies (the “small cell” problem), and such analyses can be significant by chance if large numbers of variables are considered (the “capitalizing on chance” problem). Our analyses examined:
Broad intervention type: This can be risky, as some interventions can be classified into more than one group. However, these groupings can be very persuasive in persuading readers about which bundle of interventions “work” (e.g., Greenleaf & Petrosino, 2008).
Specific intervention type: An important policy question is whether developing nations and donor agencies are getting more “bang for the buck” using one particular approach or another. We compared the average effect sizes between the discrete types of interventions.
World Bank classification of economies: We examined effect sizes by The World Bank three tiers of developing economies (low income, lower middle income, and upper middle income).
World Bank classification of developing regions: We examined effect sizes by the World Bank classification of developing nations into six different regions of the world, East Asia and the Pacific, Europe and Central Asia, Latin America and the Caribbean, Middle East and North Africa, South Asia, and sub-Saharan Africa. Note that our review did not identify any eligible studies from Middle East and North Africa.
Type of evaluation design: This review examined the average effect size for the 52 randomized experiments and compared it to the average effect size for the 21 quasi-experiments in the sample.
Whether the intervention specifically targeted females or not: Many interventions are specifically designed to increase female school enrollment. We examined effect sizes for those eight interventions that had exclusively female samples and compared it to the vast majority of studies that focused on both boys and girls.
Whether the study included outcomes for primary or secondary school students: We report effects for interventions that focused on primary schools, secondary schools, or included both types of schools.
We also report results for three different analyses related to methodology. The first was to examine how the effect sizes reported in the studies varied across dimensions of study quality (as rated “low,” “moderate,” or “high” by coders). The second was to determine the impact of our decision to use the individual sample sizes from the studies instead of the aggregate cluster sample sizes. Finally, we examined the potential bias due to unpublished studies.
Results
Pipeline of Studies
One-hundred sixteen randomized experiments and quasi-experiments met our initial screening for eligibility. All of these were coded for study characteristics but not for effect sizes. We then conducted a second screening to again ensure that there was evidence of baseline control and that the original studies reported on participant data from 1990 or after. This left 81 eligible studies. Six studies, as mentioned earlier, did not report outcome data that we could use to create a quantifiable effect size, and three studies used overlapping samples of schools and students (we selected the study that tested the strongest treatment and removed two from our review). This left 73 total studies in our final sample.
Descriptive Statistics
The sample of studies was, as expected, very diverse. Studies were conducted in 27 different nations, with Kenya (N = 12), India (N = 9), Bangladesh (N = 6), Colombia (N = 5), and Jamaica (N = 5) the most common. Not surprisingly, most studies were conducted in the poorest developing nations (51% in lower income countries or LICs); 34% were conducted in lower middle-income countries (LMICs). All of the nations identified earlier except Colombia (which is defined as an upper middle-income country [UMIC] economy) fall into those two classifications.
Approximately 38 substantively different interventions were tested across these 73 studies; broadly, conditional cash transfers (N = 13), funding or grants to communities (N = 5), school breakfasts or lunches (N = 5), or remedial education or tutoring (N = 5) were the most common. Most of these programs targeted primary school-aged children (N = 44, 60%), with 10 focusing exclusively on secondary school-aged children (14%). A minority of studies involved interventions that included both primary and secondary students (N = 19, 26%). Nearly 9 (N = 65, 89%) of 10 studies included both boys and girls in the intervention; the remainder focused exclusively on girls (N = 8, 11%).
Of the studies, 52 (71%) used randomization to assign participants to groups and 21 (29%) used quasi-experimental procedures. Studies were published from 1995 to 2009, and as Figure 1 indicates, there has been a large increase in the number of eligible studies since the early 2000s.

Number of included studies by year of publication.
Most studies assigned individuals and larger aggregate units to treatment and control conditions. Most common were studies that assigned schools to treatment or control conditions (N = 31; 43%), followed by the assignment of individuals to treatment and control conditions (N = 14, 19%). Some studies assigned villages to treatment and control conditions (N = 13, 18%). Most of the studies included just one intervention and one control or comparison group (68%), but 14 (19%) used three or four groups in their designs.
Most relevant impact studies identified in this review were conducted by economists (82%). Indeed, only 21 (29%) studies were published in academic journals or books; the majority (N = 52, 71%) of reports were working papers or reports by international organizations such as The World Bank.
Most authors concluded that the intervention they studied had a positive impact on one of the main outcomes (N = 42, 57%), but a large minority indicated no impact (N = 21, 29%) and the remainder reported mixed results (N = 10, 14%).
Broad Intervention Types and Program Theory
Appendix B provides brief descriptions of the five major types of interventions and the underlying rationale about why these bundles of interventions were expected to influence a main outcome of interest (e.g., Rogers, Petrosino, Huebner, & Hacsi, 2000). Unfortunately, most reports did not include an explicit program theory. Therefore, we provide an implicit program theory based on the information in the report. This was a strategy used by Dutch researchers in examining police practices (e.g., van der Knaap, Leeuw, Bogaerts, & Nijssen, 2008). These five broad intervention groups are economic (N = 26); educational programs and practices (n = 19); health care and nutrition (n = 14); building schools and infrastructure improvements (n = 7); and providing information or training (n = 7).
Meta-Analysis
Using inverse variance random effect weights, we estimated the overall mean effect size d across studies separately for the different types of outcomes. Standardized mean differences (Cohen’s d) are scaled in the analyses as positive if there was a positive impact for intervention (e.g., if enrollment increased or dropout decreased), negative, such as −.10, if there was a negative impact (e.g., an increase in dropout or decrease in enrollment), and 0 if the effect for the intervention was identical for the treatment group and the control group (e.g., 95% enrollment rate in both groups). Generally speaking, an effect size estimate of .10 reflects 1/10 standard deviation improvement for treatment participants compared to control participants.
Note that in the text in which we discuss each of the analyses subsequently, we round effect sizes to two decimal places. Also note that for each analysis of overall intervention effects, we present heterogeneity data, including the I 2, the τ2 (between studies), and the Q-test, all indicators of how well the mean effect represents the sample of studies in the analysis.
Overall Intervention Effects, Main Analyses of Enrollment, and Other Outcomes
In this section, we report the average effects of the interventions on main enrollment and other relevant outcomes (attendance, dropout, and progression). In Figure 2, the results are presented for 34 studies that measured the effect of an intervention on an enrollment outcome. As mentioned earlier, this analysis includes only the first effect reported in the study; as indicated, the first follow-up measure ranged from 4 to 216 months. Collectively, the average treatment effect was positive (d = .18; 95% CI [.13, .24]) and ranged from −.14 to .82. Only five of the studies reported a negative effect on enrollment (to the left of zero). As expected, heterogeneity statistics indicate substantial variability across effect sizes (Q = 875.94, df = 33, p < .001; I 2 = 96.23; τ2 = .02).

Average effects on enrollment (n = 34).
There were 33 studies that included at least one quantifiable measure of attendance, measured on a wide range of time intervals, ranging from 1.2 to 41 months. Figure 3 presents the effect sizes for these 33 studies. Similar to Figure 2 and the enrollment results, the overall effect was positive (d = .15, 95% CI [.10, .20]), ranging from −.20 to .74. Four studies reported negative results on attendance (to the left of zero). There was considerable heterogeneity in the effect sizes (Q = 341.64, df = 32, p < .001; I 2 = 90.34; τ2 = .01).

Average effects on school attendance (n = 33).
There were 18 studies that included at least one quantifiable measure of school dropout in their analyses. Timing of follow-up for the dropout outcomes varied greatly across these 18 studies, ranging from 7 to 144 months. Figure 4 presents the effect sizes for these studies. Compared to enrollment and attendance, the overall effect was positive but smaller (d = .05, 95% CI [.01, .09]), ranging from −.17 to .74. Three studies reported negative effects, that is, an increase in school dropout (to left of zero). Again, there was considerable heterogeneity in the dropout effect sizes (Q = 61.08, df = 17, p < .001; I 2 = 71.17; τ2 = .003).

Average effects on dropout (n = 18).
There were 15 studies that included at least one quantifiable measure of progression in their analyses. Follow-up of outcomes for progression varied greatly across these studies, ranging from 7 to 60 months. Figure 5 presents the effect sizes for these studies. The overall effect was positive and similar to those reported for enrollment and attendance (d = .13, 95% CI [.08, .18]), ranging from −.01 to .69. Only one study reported a negative effect (to left of zero) on progression in school but again there was considerable heterogeneity in the effect sizes (Q = 77.66, df = 14, p < .001; I 2 = 80.46, τ2 = .01).

Average effects on progression in school (n = 15).
Overall Intervention Effects, Supplemental Analyses of Learning Outcomes
We also conducted meta-analyses of four distinct learning outcomes: math achievement, language achievement, standardized achievement tests, and other achievement measures (e.g., grades). Figure 6 presents the average effects for 25 studies that examined the impact of an intervention on math achievement. The average follow-up interval ranged from 1 to 144 months. The average effect of these interventions was positive (d = .16, 95% CI [.10, .23]), ranging from −.32 to .62. Six studies reported negative effects on math achievement (left of the zero), and there was heterogeneity in these math achievement effect sizes (Q = 273.50, df = .24, p < .001; I 2 = 91.22, τ2 = .02).

Supplemental effects on math achievement (n = 25).
Figure 7 presents the average effects for 25 studies that examined the effect of an intervention on language achievement. Such outcomes included tests on native language performance and, in some instances, scores on English language tests. These outcomes also included tests of reading or other comprehension measures. The average follow-up interval again had a wide range, from 1 to 144 months. As with the math achievement results, the average effect of these interventions was positive (d = .18, 95% CI [.12, .25]). Effect sizes ranged from −.09 to .66. Five studies reported negative effects for an intervention on language achievement (left of the zero), and there was considerable heterogeneity in the language achievement effect sizes (Q = 325.60, df = .24, p < .001; I 2 = 92.62, τ2 = .02).

Supplemental effects on language achievement (n = 25).
Figure 8 presents the average effects for 10 studies that examined the impact of an intervention on standardized achievement tests. These test scores were comprised of national or district tests and tended to include a range of subjects. The average follow-up interval was narrower than in prior analyses, ranging from 12 to 24 months. The average effect of the interventions on standardized achievement tests was positive but about one third the size of prior effects on math- and language-specific tests (d = .06, 95% CI [−.02, .14]; Q = 17.37, df = 9, p = .043; I 2 = 48.19, τ2 = .01). Effect sizes ranged from −.13 to .31. Three studies reported negative effects for an intervention on standardized achievement tests (left of the zero).

Supplemental effects on standardized achievement test scores (n = 10).
Figure 9 presents the average effects for five studies that examined the impact of an intervention on other achievement. This outcome included self-reported grades or achievement measures. Similar to the standardized achievement test analysis, the average follow-up interval was narrower than in prior analyses, ranging from 12 to 24 months. The average effect of the interventions on other achievement measures was positive but the smallest of all analyses reported so far (d = .05, 95% CI [−.09, .19]; Q = 16.49, df = 4, p = .002; I 2 = 75.68; τ2 = .01). Effect sizes ranged from −.07 to .36. One study reported negative effects for an intervention on other achievement (left of the zero).

Supplemental effects on other measures of achievement.
Moderating Variables
Given the large number of studies considered, and the diversity of interventions, samples, and settings, it is very likely that the effects varied across these different dimensions. This assumption was supported by the heterogeneity statistics presented in each of the prior analyses. In this section, we present analyses for seven moderating variables used to further explore and attempt to explain some of the observed heterogeneity in the effect sizes.
For moderator analyses, we relied on the 59 studies that reported enrollment or attendance outcomes (for the 8 studies that reported both, a mean effect size was computed). The weighted mean effect sizes for these two outcome categories were similar (d = .18 for enrollment, d = .15 for attendance) and there was no evidence that these two means were significantly different from each other (Q between = 3.97, p = .14).
Because not all included studies reported enrollment and attendance, and the means for enrollment and attendance were statistically similar, we combined the outcomes where needed to create a larger group of studies for moderator analysis, still retaining only one effect size per study for analysis. For each analysis, we report heterogeneity statistics testing for differences between groups (Q between) but caution that even within subgroups, there was large variation across studies in the types of interventions, participants, countries, designs, and other details.
Broad intervention type
For this analysis, we examined the five broad intervention categories discussed earlier. Figure 10 presents the results. The largest effects are reported across the five studies in the nschools/infrastructure group (d = .44, 95% CI [.40, .47]). This group of interventions was significantly larger than the next largest category of health care/nutrition (d = .23, 95% CI [.11, .36]). The smallest effects were reported for those interventions in the providing information/training group (d = .06, 95% CI [−.09, .05]) and the educational practices/programs group (d = .06, 95% CI [.01, .10]). Note that interventions classified as economic, health care/nutrition or providing information/training all had overlapping CIs. The heterogeneity test of between-group differences confirmed that these effects varied across intervention types, as there was more heterogeneity between groups than would be expected by chance (Q between = 195.25, df = 4, p < .001).

Effect size by broad intervention type (n = 59 studies).
Specific intervention type
Another important policy and practice question is whether effect sizes varied across more specific intervention types. Figure 11 presents the results across 31 interventions. 3 Most results were positive in direction. The five interventions reporting the largest average effects were asthma/epilepsy treatment (d = .74), early intervention (d = .61), malaria prevention (d = .59), road improvement (d = .50), and building new schools (d = .47). In this analysis, only 4 of the 31 specific interventions reported average negative effects (to the left of zero). These were providing returns on education (d = −.20), community participation and empowerment (d = −.09), family planning (d = −.01), and microfinance (d = −.02). A heterogeneity test of between-group differences confirmed that these effects varied across specific intervention types, as there was more heterogeneity between groups than would be expected by chance (Q between = 646.26, df = 30, p = .00). Again, these results must be interpreted cautiously, given that many of these subgroups only included a single study.

Effect size by specific intervention type (n = 59 studies).
World Bank classification of economies
In this analysis, we examined the average effect for interventions implemented in the three types of developing nations, as defined by the World Bank classification of economies (LIC, LMIC, and UMIC). The average effects for the 30 LICs and 21 LMICs were .16; UMICs had a mean effect size of .10. The CIs for all three categories overlapped, and there was no evidence of significant differences across the three categories.
World Bank classification of developing regions
The World Bank also groups nations by developing regions of the world. We used that grouping as a moderating variable to examine the effects for different regions. Figure 12 presents the results. The largest effects are shown for the two studies in Europe and Central Asia (d = .58) and the three studies in East Asia and the Pacific (d = 36, 95% CI [.25, .48]). Overlapping CIs and a test of between-group differences indicated these two groupings were statistically different from each other (Q between = 19.33, df = 4, p < .001).

Effect size by World Bank classification of developing regions (n = 59 studies).
Type of evaluation design
As discussed earlier, our sample was comprised of a large majority of RCTs, likely reflecting our stringent eligibility criteria and screening of QEDs. In this analysis, we compare the average effect size for RCTs versus QEDs. The average effects for the different study designs were identical (d = .16). A test of between-group heterogeneity confirmed there was no evidence of a difference in mean effect sizes for RCTs versus QEDs (Q between = .01, df = 1, p = .92).
Whether intervention specifically targeted females or not
In recent years, there has been a strong emphasis by donor agencies and the governments of developing nations on specifically targeting females for educational initiatives. In this review, eight studies tested interventions that specifically targeted girls (although some may have examined spillover effects on boys), including six that were scholarship/fellowship programs. The average effect for girl-focused interventions 4 was slightly larger in those 8 studies than for the 51 evaluations that included both boys and girls (d = .18 to d = .15), but there was no evidence of a difference in the two groups (Q between = .19, df = 1, p = .66).
Whether the study included outcomes for primary or secondary school students
Another aspect of the wide diversity of the included studies is that some target primary school students, some target secondary students, and others include outcomes for students at both school levels. The effect sizes for studies including only primary student outcomes was .14 and those including only secondary student outcomes was .19. The average effect size for those interventions that included both types of students is .17. There was no evidence of a significant difference between the three types of studies (Q between = .87, df = 2, p = .65).
Methodological Quality Checks
Evaluation quality scores
We also examined whether a rating of “moderate” or “strong” threat to the study’s conclusions on the four methodological items influenced the average effect size across the studies. For example, a study that had none of the 4 items rated as a moderate or strong threat to validity received a “0.” Likewise, a study that received a rating of a moderate or strong threat on all 4 of the methodological items was scored a “4.” For this analysis, as with the moderating variable analyses, the 59 studies that provided enrollment or attendance outcomes were included (the 8 studies that reported both outcomes were averaged). For more details on the evaluation quality scores, please consult Petrosino et al. (2012).
Figure 13 presents the results of this analysis. The methodological problems in the largest majority of studies in this moderating analysis were rated as presenting little or no threat to study conclusions (N = 57, 97%). 5 Only two (3%) studies had two or three methodological problems rated as moderate or strong threats to study conclusions, scoring 2 or 3 on the Method Quality score. These findings are likely due to the especially strong designs of the RCTs and QEDs screened into this review. The average effect size for studies that scored “0” was nearly identical (d = .16) to those studies that scored a “1” (d = .17). The larger average effect (d = .18) for studies scoring a “3” was based only on one study, as was the negative result for studies scoring a “2” (d = −.02). CIs overlapped for all categories except for the study that scored a “2.” Heterogeneity statistics indicate that at least two levels of this variable (as it was coded here) were significantly different (Q between = 23.79, df = 3, p < .001), which is clearly driven by the mean negative effect size for the studies with poorer methodological quality ratings.

Effect size by methodological quality rating (n = 59 studies).
Using individual versus aggregate sample sizes to compute effect sizes
As mentioned previously, when possible, we used sample sizes for the individual students in the studies rather than the sample sizes for the aggregate units that were randomly or quasi-experimentally assigned to conditions. So, for example, if a study randomly assigned 10 villages each to treatment and control conditions, and then reported analyses on enrollment using individual sample sizes, we used those individual sample sizes to compute effect sizes. All such studies took clustering into account when analyzing at the individual level, so no corrections for lack of clustering were applied to these data.
We conducted one post hoc methodological check to see how different the average effect sizes were when using aggregate units of assignment versus using the sample sizes of individual students in the studies. We compared 12 (16% of the total review sample) studies in which the effect sizes for both aggregate and individual sample sizes were computed. As the table indicates, the differences in the average effects and variances, at least in this analysis, were not substantial (d = .16 when using individual sample sizes, and d = .20 when using aggregate sample sizes). If this analysis holds true across all studies, the estimates of both effect size and variance using individual sample sizes would represent an underestimate for the studies compared to using aggregate sample sizes. It should be cautioned that these are unweighted effect sizes.
Publication bias
Publication bias refers to bias that can occur if unpublished or difficult to locate studies produce systematically different results than those reported in published or easy to locate studies. Presumably, smaller sample size studies are less likely to be published than those with larger sample sizes, as are studies with null (i.e., nonsignificant) or negative findings.
To assess the possibility of publication bias, we visually examined a funnel plot and conducted an Egger regression test for funnel plot asymmetry (Egger, Smith, Schneider, & Minder, 1997). The funnel plot (Figure 14) was relatively symmetric, due in large part to the fact that very few studies had large standard errors (regardless of the direction or magnitude of the effect size). However, there were no studies in the meta-analysis with large standard errors and null/negative effect sizes. Results from the Egger and colleagues’ (1997) test for funnel plot asymmetry indicated a significant positive association between the effect size and standard error (b = 1.89, p = .003, 95% CI [.68, 3.11]), indicating possible evidence of publication bias. Although there was some evidence of asymmetry in the funnel plot, we conclude that any possible bias is unlikely to have had an appreciable effect on the substantive conclusions of the meta-analysis. Most of the studies included in the meta-analysis were reported in published and unpublished formats, and there simply are not many small sample size studies in this field of study (regardless of effect direction/magnitude). Thus, it is likely that any observed asymmetry in the funnel plot is due to this literature being dominated by large-scale trials, rather than publication bias per se.

Funnel plot of standard error by standardized difference in means (n = 73 studies).
Conclusions
In this review, we identified 73 experimental and quasi-experimental studies that examined the impact of an intervention on at least one main outcome of enrollment, attendance, dropout, and progression in a developing nation. We also examined the effects of these interventions on supplemental learning outcomes of math achievement, language achievement, standardized achievement tests, and other achievement. In sum, average effects across the four main outcomes were all positive and statistically significant, although effects on enrollment, attendance, and progression were larger than those on dropout. Results indicated positive and statistically significant effects for two of the four supplemental outcomes (math and language), but there was no evidence of effects on standardized test scores or achievement outcomes. Although most of the findings indicated beneficial intervention effects on the outcomes of interest, these effects were relatively small in magnitude. The average intervention effect on enrollment (d = .18) is equivalent, all things being equal, to an approximate 9% improvement in enrollment in the intervention group. Thus, most of these effects are equivalent to 3–9% improvements in the intervention versus control groups.
Although we did not explicitly look at gains or losses in supplemental outcomes compared to main outcomes (i.e., did increases in enrollment result in decreases in test scores), the average effect sizes for the supplemental outcomes of language and math were as large or larger as they were for the main outcomes of enrollment, attendance, dropout, and progression. This provides some early evidence, at least, that increases in enrollment and attendance in the schools studied did not necessarily overwhelm resources and swamp the quality of teaching or inhibit learning. Nonetheless, there was no evidence of effects on other standardized assessment scores or achievement measures, although these analyses were based on fewer studies and may have been underpowered to detect such effects.
The average effects reported in this review should be tempered by noting the diversity of studies, samples, countries, interventions, and measures in the 73 studies synthesized here. Figures 10 and 11 presented the average effects for five broad groups and for 31 specific types of interventions evaluated in this sample of studies. They present some early indications, comparatively, about the effects of these interventions on the outcomes summarized here. Several cautions are in order, however. First, in many specific intervention categories, only one or two studies have been reported. Second, our analyses focused only on main outcomes of enrollment, attendance, dropout, and progression (and then examined supplemental outcomes of learning); there may be other very important outcomes for child employment, health, and other school outcomes (e.g., teacher attendance and efficacy) that we do not summarize here.
With those caveats aside, results from moderator analyses indicated variability in effects across different broad categories of interventions. Namely, interventions that are assumed to have a more direct pathway to addressing the underlying barrier to influencing school enrollment, attendance, dropout, or progression outcomes seem to be more effective. Programs grouped in this review as “new schools/infrastructure” had the largest average effects (d = .44; see Figure 10). These interventions dealt with improving school or community infrastructures in an attempt to boost enrollment, such as building new schools, repairing schools, and improving roads/access to schools. Other interventions grouped in the economic, health care/nutrition, or providing information/training categories showed smaller effects and were not statistically different from each other. To be fair, the direct goals of interventions in the categories such as providing information/training or educational programs/practices may have been improving school management or addressing student learning deficiencies; outcomes such as attendance and progression may have been more distal to program goals and thus just a few of many outcomes the investigators in those studies collected and reported on to assess impact. Despite the variability of effects across intervention types, it should be noted that no data were provided here about the possible additive effects of providing multimodal interventions.
Apart from the average effects for broad interventions, we also examined six other moderating variables. There was no evidence that World Bank classification of economy, school level, research design, or gender of participants moderated the effects of intervention on enrollment outcomes. However, there were differences in intervention effects according to the World Bank classification of developing regions. Results indicated that studies conducted in Europe and Central Asia and East Asia and the Pacific had the largest average effects, whereas those in sub-Saharan Africa and Southeast Asia exhibited the smallest (but nonetheless statistically significant) effects. It should be cautioned, however, that these two regions had the fewest yield of studies in this analysis (2 and 3, respectively).
Implications for Practice and Policy
One key question is what can policy and practice decision makers take from these results? For one, they can provide some early data on “best bets.” In other words, the average effect of the interventions represented here will likely hover around a 9% increase in student enrollment, a 7–8% increase in student attendance (or decrease in absenteeism), a 2–3% decrease in student school dropout, and a 6–7% increase in students making adequate progress in their education (all things being equal and making assumptions about the baseline). Likewise, the evidence indicated 9% increases in math and language achievement, and negligible improvement in standardized achievement tests and other achievement outcomes.
Whether these improvements are sizable enough to warrant further investments depends on more information than is currently available here, including costs of interventions and alternatives. Although we coded information on economic data when available (i.e., when it was contained in the main evaluation report or in another publication on the study), it was reported too infrequently and in such varying levels of detail that we could not synthesize it in any defensible fashion. Policy makers and practitioners should therefore consider the costs associated with implementing different types of interventions relative to the potential gains they can expect in educational outcomes. Other important considerations would be the effect of these interventions on other individual outcomes such as morbidity and mortality as well as larger community-level outcomes that may be improved from certain types of infrastructure-building interventions. Of course, decision making about investments in education also depends on the scope of the problem in the jurisdiction; for example, health care/nutrition treatments would only be feasible in those regions in which illness or nutritional deficiency is a significant barrier to student attendance. Researchers and practitioners interested in implementing such interventions must therefore be attuned to the needs of the community.
Another issue for practitioners to consider is the extent of the enrollment problem in the region, nation, and within the country itself. Enrollment rates vary by level of school (primary and secondary) and for different regions of the world. Furthermore, enrollment rates can greatly vary within a specific country, particularly by gender and socioeconomic status (Birdsall et al., 2005). The average improvement noted earlier has to be considered in light of these regional and sociodemographic differences in baseline enrollment rates. A 3% improvement might be considered more modest in a nation in which only 50% of youth enroll in primary school, as opposed to a nation in which 90% enroll.
Footnotes
Appendix A
Appendix B
Theory of Change for Broad Intervention Groups.
|
|
Underlying Barrier to School Participation | Type of Interventions | Key Mechanism | Primary Outcomes |
|---|---|---|---|---|
|
|
Costs of going to school; child needed at home or to work to supplement income; financial benefits of education not recognized | Conditional and unconditional cash and/or food transfers; fee reduction or elimination; vouchers; providing uniforms; microfinance loans; fellowships and scholarships | Removal of fiscal barrier to student participation; economic benefit is incentive to send youth to school; economic benefit outweighs youth income by working | Increases in enrollment, attendance, and progression; decrease in school dropout |
|
|
Student deficiencies or poor school quality inhibit student engagement and learning | Teacher incentives; teacher training; textbooks, flip charts; improving school management; school funding; dropout prevention program; English language teaching technology; computers; remedial tutoring | Improving school quality and/or addressing student deficiencies will lead to greater student engagement; parents will see more benefit to sending youths to school. | Increases in attendance, progression, and decreased dropout. |
|
|
Illness or nutritional deficiency keeps youth out of school | Vitamin A; malaria prevention/treatment; preschool nutritional supplement; school meals; menstrual cups; deworming; asthma/epilepsy treatment | Reducing illness and improving nutrition means youth are healthy enough to come to school; parents and youth have incentive for youth to go to school to receive meal (school breakfast/lunch) | Improved attendance and progression; reduced dropout |
|
|
Schools are too faraway or in disrepair, discouraging youth participation; parents afraid to send youth to school for safety reasons; inability to access markets provides economic barrier to household wealth | Building new schools in communities; providing infrastructure funding for new roads or school repairs | Schools close enough for students to attend easily; new roads improve parent access to markets and ability to get work and overcome fiscal barriers (see economic above); school conditions improve and parents/more inclined to send youth | Increased enrollment, attendance and progression; decreased dropout |
|
|
Parents/youth unaware of benefits of youth attending school; parents unaware of school performance; pregnancy inhibits female completion of school; communities disorganized and unable to hold schools accountable for education of youth | Providing livelihood skills; family planning; parent training; providing information on perceived benefits; providing report cards; organizing and empowering communities | Females who are empowered and not pregnant will be more likely to pursue education and employment; parents and youth that understand benefits to education may be more likely to ensure school participation; communities that work together well can hold schools accountable; parents and schools that receive information on school performance will be better able to push for quality improvements | Increased enrollment, attendance, progression, and decreased dropout |
Acknowledgments
This article is a revised version of a Campbell Collaboration review published in 2012. The authors thank 3ie for funding this project and for the assistance of Hugh Waddington, Howard White, Arun Virk, Anita Sircar, and Ami Bhavsar. We also thank Sandra Wilson of the C2 Education Group, Terri Pigott of the C2 Methods Group, and three anonymous peer reviewers for their comments on the C2 review. We thank Susan Mundry and Janet Phlegar of WestEd for their support. David Wilson of George Mason University served as a methodological consultant and was invaluable in helping us decipher study designs and compute effect sizes. We appreciate the assistance of staff at Bridgewater State University and the University of Pennsylvania in acquiring full-text documents. Many persons helped us find studies or responded to inquiries about their own work including Sajeda Amin, M. Caridad Araujo, Niaz Asadullah, Felipe Barrera-Osorio, Jere Behrman, Fernando Borraz, Eliana A. Cardoso, Mark Clayton, Jishnu Das, Esther Duflo, David K. Evans, Tazeem Fasih, Alison Girdwood, Steve Glazerman, Paul Glewwe, Peter Jay Glick, Tricia Gonwa, Louise Grogan, Jyotsna Jalan, Michael Kremer, Frans Leeuw, Dan Levy, John Maluccio, Reynaldo Martorell, Richard Murname, Muthoni Ngatia, Peter F. Orazem, Carolyn Petrosino, Laura Rawlings, Yasuyuki Sawada, Kartini Shastry, Andre Portela Sousa, Jos Vaessen, Emiliana Vegas, and Nick York. We also thank the following persons for their assistance or comments on various aspects of the research: Betsy Becker, Arild Bjørndal, Michael Borenstein, Suzanne Burns, Mary Cazabon, James Derzon, Mark Dynarski, Sarah Guckenburg, Sue Henderson, Gary Henry, Priya Joshi, Julia Lavenberg, Mark Lipsey, Dan Mello, Eamonn Noonan, Chad Nye, Pamela Serozynski, Will Shadish, and Eliza Spang.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research and/or authorship of this article: The International Initiative of Impact Evaluation (3ie) funded this project. In addition, WestEd provided some in-kind support.
