Abstract
Randomised Control Trials (RCT) are both widely used in development, reaching hundreds of millions through RCT informed policy, and highly regarded, receiving a Nobel Prize in Economics. Proponents, largely academic economists, position RCTs as a scientific and ideologically neutral way to get to the heart of ‘what works’ in development. However, this new and radical micro-experimental approach to poverty reduction has sat uneasily within the broader development sphere, including critical geographers. We contribute to this debate by empirically examining the practical manifestation of the methodology within ‘PNPM Generasi’ – an innovative Indonesian cash transfer program evaluated through the largest RCT to date. Drawing primarily on field interviews, our examination finds three departures from the gold standard proclamations of the RCT methodology in practice. First, we find that randomising aid delivery breached ethical guidelines and compromised the effectiveness of the program being studied, largely as a result of RCTs being run in secret from local communities and politicians. Second, we find applying the RCT methodology on the ground was technically complex and financially costly, requiring not only an excessive use of scarce program funds, but also undermining the validity of the evaluation results. Finally, we find RCT results were not widely used in practice. We demonstrate these findings are not restricted to our specific case, but rather, reflect systemic deficiencies with RCT evaluations. Our results suggest that RCTs may have value if applied judiciously, and as part of multi-pronged approaches, but we caution against their growing monopoly influence on poverty reduction.
Keywords
Introduction
The 2019 Nobel Prize in Economics 1 was awarded to a trio of researchers, Abhijit Banerjee, Esther Duflo and Michael Kremer from the Abdul Latif Jameel Poverty Action Lab (JPAL), for their work popularising the use of Randomised Control Trials (RCTs) in anti-poverty policy. The prize represented the latest achievement for a movement, led by JPAL, which has transformed RCTs from a niche methodology, regarded with suspicion (J.L.P, 2013), to the way to ‘do development’ (Webber and Prouse, 2018). Proponents of RCTs argue against traditional approaches to development, which focus on ‘big questions’ like the root cause of poverty and the place of free markets (Banerjee and Duflo, 2011: 3). These questions, they argue, are too complex, contested and difficult to measure. Instead, they advocate a shift to ‘small questions’ that can be definitively answered. The answers to these small questions, they argue, gradually build a bigger picture of ‘what works’ and, through this, more effectively reduce poverty. Central to this approach is the use of RCTs to rigorously – that is quantitatively and causally – identify what is effective in reducing poverty.
RCTs adapt the experimental design of randomised drug trials to estimate the effectiveness of anti-poverty programs (Duflo et al., 2008). For example, to evaluate the effectiveness of a Conditional Cash Transfer (CCT) scheme (like Generasi) in addressing malnutrition outcomes researchers would randomly allocate the scheme to some villages, the ‘treatment’ group, but not to others, the ‘control’ group. Researchers can then measure the difference in malnutrition outcomes between the groups to elicit the exact causal effect of the program on malnutrition. 2
The expert member of the Nobel Committee cited the enormous potential of this “scientific approach” to poverty reduction in justifying the Academy’s choice (Svensson, 2019). RCTs now play a central role in the fight against poverty: in total, more than 400 million people have been reached by programs that were scaled up after being evaluated by researchers affiliated with JPAL 3 (Royal Swedish Academy of Sciences, 2019: 32). As a result, the effective allocation of hundreds of billions of dollars for some of the world’s most disadvantaged peoples is increasingly reliant on the robustness and appropriateness of RCTs as a policy evaluation tool.
This paper empirically tests this robustness, in practice, through a case study of ‘Generasi’, 4 an innovative Indonesian CCT scheme evaluated with the world’s largest RCT. Our analysis draws on field interviews with high level actors, combined with a deep dive into the Generasi grey literature, RCT data and administrative documentation. We follow the full lifespan of the RCT methodology, examining: the process of implementing the Generasi RCT, the consequences of experimentation, and, how the RCT’s results were used for implementing poverty reduction policies. Through this exercise we seek to understand how RCTs operate in practice, what they entail on the ground and, in doing so, contrast the practicalities of RCT evaluations with narratives of their clinical, definitive nature in the literature. Our contribution is to show the complexities and contingencies of implementing RCTs in practice and the implications of experimentation for poverty reduction. While many of the shortcomings of the RCT methodology we identify echo the concerns of the critical literature, there is a dearth of empirical evidence and dedicated case studies to demonstrate how they manifest in practice.
We find, first, that incorporating an RCT to evaluate the Generasi program was financially onerous and technically challenging. However, even with unprecedented access to resources, program evaluators struggled to achieve the theoretical rigour that RCTs promise. Second, randomising aid delivery resulted in a number of negative consequences for RCT communities. These include denying approximately one million Indonesians access to an anti-poverty program that was known to be effective, and reducing the effectiveness of the program being studied. Despite this, and third, the RCT results were not widely used in practice due to the blunt information they generated and the dynamic nature of anti-poverty policy making. This was juxtaposed with the utility of substantially cheaper and less invasive qualitative evaluations. Not only, therefore, might RCTs not actually be the ‘gold standard’ for evaluating development policies, as they have come to be known (Stevano, 2020), but they may actively undermine – structurally and practically – the pursuit of poverty reduction.
Following an overview of our methodology and a brief literature review, we discuss our key empirical findings in three subsequent sections. We conclude by drawing on our findings to propose recommendations for future evaluation programs, and comment on the place of RCTs within the broader context of poverty reduction.
Methodology
Case study: The RCT evaluation of Generasi program
The Generasi program was first implemented in 2006 by President Joko Widodo as part of a national push to consolidate the country’s fractured poverty reduction approach, which in 2005 consisted of 52 poverty reduction programs, driven by 27 ministries (Friedman, 2014: 1). The Government of Indonesia [GoI] implemented Program Nasional Pemberdayaan Masyarakat 5 (PNPM) to coordinate these disparate and overlapping, yet siloed, poverty reduction efforts under one umbrella. Generasi 6 was one of 12 ‘sub-programs’ coordinated under this umbrella, focusing on improving rural child and maternal health and education through improved utilisation of existing frontline services (e.g. increasing school enrolment). It did this through the provision of innovative Conditional Cash Transfer (CCT) grants to rural communities.
The design of these grants built on the Kecamatan Development Project (KDP) – an existing program that provided rural communities with unconditional cash grants, which were then allocated by locally elected management teams. GoI found these teams mostly allocated KDP grants towards physical infrastructure, which they perceived to be ‘safe choices’, at the expense of human capital (e.g. health and education). Generasi sought to address this imbalance by specifying that funds must be used for activities targeting child and maternal health and education, and incentivised effective allocation through the addition of a conditional ‘performance bonus’.
In the style of traditional CCTs, this bonus made cash grants conditional on villages completing certain development activities (see, for instance, Peck and Theodore, 2010 for a critical assessment of CCTs). However, where Generasi differed was the competitive nature of the performance bonus. Under this new program design, villages were allotted an annual cash grant adjusted for population and poverty level; that is, more populous and poorer villages received larger grants. Villages directly received 80% of grant money, while 20% was siphoned into a ‘performance bonus pool’, shared by all Generasi villages in a kecamatan (sub-district). Each village competed for a share of this pool based on their relative performance in improving 12 indicators of infant mortality, child malnutrition, maternal mortality, and educational learning quality. In broad terms, the more villages were able to improve the 12 specified indicators, relative to other Generasi villages in the same subdistrict, the higher the share of the performance bonus pool they received. 7 The prospect of a higher performance bonus, then, would hypothetically provide an incentive for villages to effectively allocate money towards improving the specific indicators Generasi was targeting.
This performance bonus mechanism was devised by researchers affiliated with JPAL and was the first scheme of its kind in 2007 (Olken et al., 2014b: 5). As such, Generasi presented an opportunity to test the hypothesised benefits of incorporating performance bonuses into cash grants. To this end, JPAL affiliated researchers worked with the GoI and World Bank to integrate a large scale RCT into the implementation of Generasi. The RCT sought to answer two questions: How effective were cash grants in improving the program’s target indicators? And, did adding a performance bonus improve the effectiveness of grants?
The RCT commenced in 2007, encompassing 3100 villages in five provinces 8 across Indonesia. The evaluation study noted that “to the best of our knowledge this represents one of the largest randomized social experiments conducted in the world to date” (Olken et al., 2014b), and we have not encountered an RCT with more participants in our review of the literature. 9 Entire kecamatan were randomly allocated either a Generasi grant (treatment group) or no grant (control group). The treatment group was further randomly allocated either the incentivised or standard (non-incentivised) version of the cash grant. In total, the evaluation covered approximately 1.8 million people in 2100 villages in the two treatment groups and another ∼900,000 people in 1000 villages in the control group.
The RCT evaluation was initially comprised of three survey rounds over five years; a Baseline Survey (2007) and Mid-Term Impact Survey (2009), and Final Impact Survey (2011). Evaluators in each round conducted extensive field surveys in RCT villages to measure changes in the indicators of child and maternal health and education Generasi was targeting. 10 The results of these surveys concluded Generasi was effective, positively impacting all targeted indicators, particularly malnutrition (World Bank, 2011). The 2011 evaluation also concluded the incentivised grants were more effective than standard grants. Citing these positive RCT results, the GoI gradually scaled up the Generasi program to 499 kecamatan in 11 provinces by 2017.
However, the original treatment and control areas were left unchanged during this scale up – that is, Generasi was expanded to new provinces rather than existing control kecamatan. This maintained the original experimental design and allowed researchers to conduct another wave of RCT surveys in 2016–17 for a “virtually unprecedented” nine year long-term evaluation (World Bank, 2018b: 12). The results of this long-term evaluation were published in 2018, and were less positive, indicating that the differences between Generasi and control areas did not persist over the long term (World Bank, 2018b). Generasi has since concluded and, at the time of writing, the GoI is using the evaluation to determine which aspects of the program should be integrated into Village Law, a nationwide decentralisation effort currently underway.
In addition to the RCT, three qualitative evaluations of Generasi were conducted. The first published in 2011 by the Indonesian SMERU research institute conducted a comparative examination of Generasi and ‘Program Keluarga Harapan’, another CCT program targeting human capital in urban areas. Two further qualitative evaluations were published by the World Bank and University of Auckland affiliated academics, examining Generasi’s causal mechanisms (Grayman et al., 2014, 2018). These three qualitative evaluations aimed to provide more nuanced examination of the processes and impacts of Generasi, as opposed to the RCT's focus on the effectiveness of Generasi as a whole. However, they received substantially smaller budgets and were published separately to the main RCT evaluation.
Methodological approach
Our analysis draws on key-informant interviews with Generasi actors, a review of the extensive Generasi grey literature and a deep dive into administrative documents generated by Generasi’s bureaucratic organs. Fieldwork was conducted in mid-2019, when the findings of the Generasi evaluation were being integrated into Indonesia’s broader anti-poverty framework, presenting a valuable opportunity to study an RCT contemporaneously to its informing policy.
We conducted semi-structured interviews with eight key-informants from six Generasi organisations based in Jakarta. This sample of interviewees represents all major implementing, governmental and civil society institutions in the Generasi evaluation and encompasses the entire life cycle of the RCT, from design, implementation and application. Interviews were recorded and transcribed, then coded to identify emergent themes. To maintain the anonymity of interviewees, we do not provide a list of interviewees or organisations. This very strict anonymity was crucial to obtaining robust interview data. The Generasi RCT involved the allocation of large sums of money, and the interaction of many politically vested actors. As such, many questions regarding the evaluation were politically sensitive. Interviewees indicated that without complete anonymity, they would not be able to participate in the research or provide frank responses. However, we do provide contextual information when drawing on interview data.
The political sensitivity of the program represented a methodological limitation as participants had an incentive to ‘pass the buck’, claim ignorance or present their experiences in a manner which positively reflected their host institution and themselves. We were able to mitigate this to some extent by asking open ended questions about the same aspects of Generasi across interviews and then triangulating these responses. Through this, we were able to ‘fill in the blanks’ and get a sense of key events from different positions. The political sensitivity of the program also presented a surprising methodological boon. The tendency to ‘pass the buck’ meant we were able to get good insight what the problems in the program were, even if we were not able to precisely understand the dynamics of decision-making at the root of these problems.
We further enhanced the robustness of our analysis by marrying interview data with information from secondary sources. This included the extensive official Generasi grey literature, which detailed the specific technical details of the evaluation and program, and the Generasi RCT dataset (Olken et al., 2014a), which published the raw evaluation survey results (and STATA code used for data analysis) under an open data initiative by JPAL. Crucially, we also accessed publicly available administrative documents generated by the PNPM Support Facility. 11 These documents included meeting minutes, funding requests and financial reporting generated as by-product of the administration of Generasi. A deep dive into these documents provided an invaluable resource for us to situate interview responses since our status as outside observers meant we were unable to conduct participant observation of the RCT itself.
RCTs, in theory
The ascendancy of RCTs in development has been accompanied by a spirited debate regarding their use in the academic literature. Centring around the work of JPAL affiliated economists, Abhijit Banerjee, Esther Duflo and Michael Kremer, the orthodox economic literature generally positions RCTs as the ‘gold standard’ of evaluation methodologies, placing their results at the top of the hierarchy of evidence to guide development policy. This includes Banerjee and Duflo’s (2011) popular book Poor Economics, which present RCTs as a way to “radically rethink” the fight against poverty through a scientific focus on ‘what works’. The work of JPAL and its affiliates has inspired a vibrant literature that uses experiments to study poverty, comprising 902 RCT evaluations in 2017 alone (Donovan, 2018: 385). Their use is likely to continue to grow given the credibility and publicity associated with a Nobel prize.
An emergent and multidisciplinary heterodox literature is more critical of RCTs and their ‘gold standard’ status. Within economics itself, methodological shortcomings have been identified. Angus Deaton, also a Nobel winner, and Nancy Cartwright (2018) list a number of statistical misunderstandings regarding the capabilities of RCTs, such as beliefs that RCTs are intrinsically precise and unbiased. Perhaps the most prominent critique in this literature is external validity – that is, how well RCT results transfer over spaces and scales (Donovan, 2018: 45). Others criticize the limited use of RCT results in addressing current knowledge gaps in economics and the rigidity of RCT designs (Ravallion, 2018). This critical economic literature provides a thorough analysis of the technical contradictions of the RCT methodology, but fails to examine their consequence on the practical conduct of poverty reduction. As such, this paper extends these critiques by providing empirical evidence regarding how (or if) the theoretical shortcomings raised in the economic literature arose in Generasi and what impact this had for the program.
Within the critical social sciences, researchers have also engaged with the rise of RCTs. This literature generally argues against the decontextualization of development problems that comes with ‘thinking small’ and the use of RCTs to further ideologically driven policies under the guise of scientific objectivity (Donovan, 2018; Souza Leão and Eyal, 2019; Woolcock, 2009). Our analysis builds on these insights by exploring the relational nature of poverty reduction and the ethical and political implications 12 of randomised evaluations.
In doing so, we follow a nascent economic geography literature that seeks to place RCTs into their political economic contexts. For example, Christian Berndt and colleagues have examined the role of RCTs in the neoliberalisation and marketisation of development (Berndt, 2015; Berndt and Boeckler, 2016; Berndt and Wirth, 2019). Berndt argues that the reframing of development into small technical questions ascribes the structural failings of the “market” onto “market subjects” (Berndt, 2015: 569) and focuses development policies towards correcting supposed ‘irrational’ behaviours through paternalistic development policy. Webber and Prouse also analyse RCTs vis a vis neoliberalisation and instead find the methodology itself “somewhat more ambivalent” (2018: 49). In response they advocate disentangling the RCT methodology from the ends to which it is employed. Berndt (2015) also notes a need for more empirical research, as we attempt here, on how RCTs play out on the ground (see also Strauss, 2008; Webber, 2015).
RCTs also form a constituent part of what Peck and Theodore (2010) diagnose as ‘fast policy’ – that is the “social practices and infrastructures that enable the complex folding of policy lessons derived from one place into reformed and transformed arrangements elsewhere” (p. xvii). Peck and Theodore briefly discuss RCTs in their examination of PROGRESA/Oportunidades, 13 observing they acted as a “fast policy accelerant” (p. 228) by providing an air of objectivity to an inherently ideological program (p. 167; see also Webber and Prouse 2018). However, while Peck and Theodore only deal with RCTs tangentially, we make them our empirical focus. In addition, while Peck and Theodore – among many others – focus on the processes that enable the proliferation of fast policy, this paper focuses on the consequences of fast policy. That is, where Peck and Theodore may have focused on how an RCT was integrated into Generasi, here, we want to understand the effect of integrating an RCT into Generasi.
Implementing RCTs: Think small, spend big
We begin our empirical findings with an examination of the process of implementing the RCT methodology in practice. We find that implementation of the RCT was a mammoth undertaking, requiring substantial financial and technical resources. However, despite these resources, evaluators still struggled to satisfy the onerous demands of the RCT methodology, and this limited the rigour and validity of the RCT results. We note these issues are not specific to Generasi, but rather, are a function of the size and complexity of RCT evaluations.
The size and complexity of the Generasi RCT came across in interviews, eliciting epithets like “huge”, “elaborate” and “crazy”. The high financial cost this entailed was a consistent theme across interviews. An individual involved in the collection and analysis of data throughout the lifespan of the evaluation commented: I think it’s quite unique, right, like this large RCT is … it shouldn’t be this large because it’s very expensive in terms of the funds that are needed to conduct this level of survey. We should aim for smaller and more effective size. We [should] still keep Impact Evaluation but not of this size. I think it’s enormous. We don’t endorse this big work in every evaluation – it will end up that you have to put a lot of money into evaluation [rather than poverty alleviation].
Despite this substantial budget, the long-term evaluation of Generasi ultimately struggled to capture the causal impact of the program. Though not explicitly stated, this is evident in a close reading of the main result of the long-term evaluation: “Since 2009, the overall health and education environment in Generasi IE [Impact Evaluation] districts has improved dramatically, even in control areas” (World Bank, 2018b: 17). As one interviewee translates, this meant “the difference between the treatment and the control is now not big enough that [the RCT evaluator] can see a difference”. The Generasi RCT then struggled to detect the difference in outcomes between the treatment and control groups required to estimate the causal effect of the program.
This was a result of the small absolute and relative value of the causal effects the long term RCT was attempting to measure. The absolute value of Generasi grants decreased significantly from a high of IDR300 million (AU$30,000) per village in 2009 to approximately IDR75 million (AU$7,500) per village between 2010 and 2016, the period the long-term evaluation studied (World Bank, 2018b: 17). This meant the RCT was attempting to capture the impact of AU$2-3 per person, per year 14 on broad health and education indicators like malnutrition. Even with the resources and expertise available, isolating such a minor causal relationship proved a difficult task for evaluators.
The relative value of these grants was also minor compared to other development activities occurring in Generasi villages. This resulted in the presence of many ‘confounding variables’. 15 Generasi was the smallest sub-program among the 11 others in the broader PNPM umbrella program. While none of these other programs directly targeted demand for frontline services, as Generasi did, they had strong indirect effects on Generasi outcomes. One such program, PNPM Rural, allocated villages AU$33 per person, per year (an order of magnitude higher than Generasi) for infrastructure such as clean water systems and public toilets (PNPM Support Facility, 2012: 16).16 This infrastructure spending indirectly influenced the health and education outcomes Generasi was attempting to measure. For example, a road financed by PNPM Rural grants meant “villagers in Kalibentak, Blitar [village in East Java] are now better connected to the outside world, enabling them to access schools, markets, hospitals and other basic facilities” (ibid, p.22). These activities acted as statistical ‘noise’ for evaluators attempting to isolate the causal effect of Generasi on health and education.
This ‘noise’ was amplified by the onerous data collection requirements of the RCT methodology. The evaluation relied on field survey data for its analysis. One interviewee, who had significant experience with survey research in rural Indonesia, raised doubts about the veracity of data coming from these surveys: Have you ever observed any of these questionnaires implemented in practice? I think it’s a bit of a fantasy to expect good data collection when the questionnaire is so detailed and so long and so – think about it for a second from the respondent’s point of view – seemingly repetitive. Because the questions are going for these fine shades of difference. … I know what it’s like when people are confronted with long questionnaires. I see people’s eyes glaze over, I see interest get lost, I see people giving answers that sound like saying something because something needs to be said.
Analysing the large quantity of data coming out of these surveys, while maintaining the rigour required of the RCT methodology, also proved a challenging task. The quantitative analysis of the RCT results, for example, involved tens of thousands of lines of intricate statistical coding. 18 An interviewee provided one example of survey data being miscoded when merging survey answers across waves, resulting in evaluators finding causal impacts in the data which were not truly present in the field. 19 Left unchecked, this would have significantly skewed the estimate of casual effect and could have resulted in the RCT misreporting Generasi’s effectiveness. The interviewee's account suggested this error was detected by chance, rather than any regular review processes. Here we see one way seeking increased statistical rigour (through larger sample sizes and longer study lengths) can actually compromise the overall rigour of the evaluation's conclusions.
Despite devoting substantial economic skill and financial largesse for the Generasi RCT, the difficulties of implementing the evaluation in practice undermined the findings of the study. Amidst the everyday realities of survey implementation, coding challenges, confounding variables, and the expense of alleviating and measuring poverty in thousands of Indonesian villages, the RCT evaluation struggled to produce significant results. Despite unprecedented access to resources, the highly trained and distinguished evaluators were not able to recreate the theoretical rigour of the RCT methodology. But the issues with implementation illuminated in this specific case do not appear unique; indeed, the ‘noise’ of everyday life exists in all RCT evaluations. These findings highlight the tradeoff between statistical power with the practical disadvantages of implementing large and long RCTs.
Consequences of experimentation: The global south as laboratory
At the same time that the practical implementation of RCTs challenges the supposed superiority of their findings, that there were these more limited and questionable results in turn raises questions about the political and ethical implications of experimentation. As such, in this section we seek to understand what it means to turn the Global South into a ‘laboratory’ for policy experimentation. 20 We find the RCT evaluation of the Generasi program breached ethical principles used to guide randomised experimentation and reduced the efficacy of the program being studied. We draw on meta-studies of RCTs to demonstrate these represent systemic issues with how RCT evaluations are conducted, beyond our singular case study.
Proponents of RCTs respond to the inherent ethical concerns of randomly allocating aid to one group, but not another (Barrett and Carter, 2010), with the argument that “until we know policies work, we should not assume that the group being treated is better off” (Kaufman, 2016). This justification is modelled on the ethical principle of ‘equipoise’ in clinical randomised research, which requires “a state of genuine uncertainty on the part of the investigator regarding the comparative therapeutic merits of each arm in a trial. Should the investigator discover that one treatment is of superior merit, he or she is ethically obliged to offer [the control group] that treatment” (Freedman, 1987: 141; see also Rayzberg, 2019). In the development context, this is supplemented by an argument that existing shortages of resources often mean there are “natural ways to create a control group” (Kaufman, 2016). As one interviewee involved in the funding of Generasi described “we’ll never have the resources to work in every single subdistrict. … [An RCT is] just another way of making a geographical decision [regarding where to allocate funds] and then learning something while you’re doing it”.
However, in the case of the Generasi RCT, both the principles of equipoise and ‘existing shortages’ were violated. For at least the period 2011–2017, implementers both had a genuine belief Generasi was effective, and had resources to provide Generasi to control villages. With respect to the principle of equipoise, the Generasi RCT consisted of two phases: a pilot evaluation conducted in the period 2007–2010 and a long-term evaluation in 2017. The results of the pilot evaluation concluded the program was effective: “After 30 months of program implementation, Generasi had a statistically significant positive impact on average across the 12 indicators it was designed to address” (World Bank, 2011: 4). As such, from at least 2011 (when these initial evaluations were completed), evaluators had statistical evidence, let alone genuine belief, the intervention was more effective than the control. Yet the experiment continued – restricted from control villages – until the last wave of the RCT evaluation in 2017.
With respect to the existing shortages principle, a scale up proposal generated by Generasi’s bureaucratic organ – the PNPM Support Facility [PSF]– shows there were resources available to treat control villages from at least 2011 onwards: “With evidence that PNPM Generasi is improving priority health and education indicators, the GoI and PSF have committed to scaling up the program from 2010 onwards. A projected US$105 million will be provided through the PNPM Support Facility trust fund for this purpose” (PNPM Support Facility, 2011). These funds were not used to provide the program to control areas, however, but to expand the program to new provinces: by 2018 Generasi’s coverage had expanded from five to eleven provinces (Figure 1). In this case, the evidence gained from the experiment was used to secure funding to expand the program, but the control villages in the experiment were excluded from these funds.

Generasi grant coverage. Red areas represent provinces which were involved in the RCT evaluation. Blue areas represent provinces involved in the scale up of the Generasi program which did not participate in the RCT (PNPM Support Facility, 2016).
Moreover, because the experiment continued after evaluators were reasonably certain the program was effective, and when they had funds to treat the control group, the effectiveness of the program itself was compromised. This was particularly evident in the case of malnutrition, the area which saw most significant gains during the Generasi pilot. As described in the World Bank (2011: 3) evaluation report: “[the] reduction in malnutrition was strongest in areas with a higher malnutrition rate prior to project implementation, most notably in the Nusa Tenggara Timur (NTT) Province”. This is unsurprising: NTT Province has the highest stunting rate across Indonesia. However, during the expansionary phase, the program was not rolled out into control areas in NTT, but instead to seven provinces with significantly lower rates of stunting (Titaley et al., 2019: 4). Under the randomised allocation mechanism, in which the treatment and control groups are functionally identical, evaluators could have been reasonably certain that control villages in NTT would have experienced the same strong improvements – stronger than in the seven new provinces – in malnutrition that had occurred in treatment villages. This was also true for the program overall. The Generasi RCT was run in areas with particularly low baseline health and education indicators and the RCT evaluation found the program “had the greatest impact in [these] areas” (World Bank, 2011: 4). Sustaining the treatment and control distinctions, as such, directly reduced the effectiveness of the cash grants being studied.
The decision to exclude control villages from the scale up was overwhelmingly recognised as erroneous by interviewees. Actors close to evaluation described: “I think it’s actually very sad, when we heard for the first time, everyone was very sad that the control area didn’t get the treatment”. Another was more damning: “they just left them for 10 years, not receiving the program while they’re politically eligible as citizens of Indonesia – that is not fair”. While interviewees recognised excluding control communities was a mistake, they were less definitive on why control communities were excluded. Interviewees tended to shift blame from their institution to others, or deny knowledge of decision-making. Interviewees suggested that the decision to not extend the treatment to control groups was most likely due to an oversight by project implementers rather than a conscious decision to maintain the randomised allocation for the longer-term study 21 : “I think [Organisation A] also thought you should treat the control area. But I think it was just missed by everybody. … I think we found it was because [Organisation B] didn’t check about it”. Nonetheless, the evaluators did take advantage, after the fact, of the program expanding in a way that maintained the treatment/control distinction.
This oversight can be traced, at least partially, to the fact that the RCT was run without the knowledge of RCT communities and their elected local representatives. When asked if control communities were aware of the randomised allocation, a field evaluator responded: No [they were not aware of this experiment] and we didn’t tell them that we visit their village because they [did] receive or [did] not receive the grant. I think that is also risky for us if we tell them we come to their village because they did not receive the program while the other village is receiving [the grant]. … I think, the government and project implementers intentionally hid that information from many people. Not just the local people. Because they will face so much trouble if they open up that information. Let alone to the politicians; this [would] be a good source of political movement [for local politicians to campaign on].
A qualitative evaluation of the program records at least one instance of this occurring. Grayman et al. (2018, p.49) quotes an informant in a control subdistrict: “Looking at the neighbouring subdistrict get support, such as posyandu [local health clinic] from Generasi, communities here also demanded the same thing from their village heads, who then pass[ed] it to the head of the subdistrict and the community empowerment agency”. As a result of this advocacy, the control subdistrict in question received additional financial assistance. Enhancing political accountability, through this mechanism, could have prevented the failure at the heart of the program expansion.
Running experiments without the consent of local communities is common practice. A meta-review of RCTs found only 46% of studies mention whether participants were aware they were part of a study, “78% of authors do not discuss informed consent, 12% state that participants were intentionally left ignorant, and 10% indicate informed consent for some sort of study. No study indicated whether participants were explicitly aware they were being experimented upon” (Peters et al., 2016). Hoffmann (2020) further disaggregates this data by region and finds participant awareness is discussed in 65% of experiments conducted in Europe and the United States, compared with only 34% of experiments conducted in Africa, Asia and Latin America. This is a troubling difference in ethical standard regarding informed consent considering Hoffmann also finds 85% of RCT evaluations were authored by researchers based in the Global North.
This secrecy, then, represents a systemic issue with the way the way the RCTs are implemented. While the Generasi program was rooted in the model of ‘community driven development’, 22 the Generasi evaluation ultimately excluded control communities and their elected representatives from key development decisions. This lay at the heart of many of the issues raised in this section: excluding local communities from the decision-making process in Generasi meant key decisions were instead relegated to the foreign technocrats implementing the RCT. These foreign technocrats erroneously excluded close to one million people in control communities from Generasi funds and, in doing so, ultimately did not look out for the best interests of the mostly poor and rural Indonesian citizens they were studying.
Using RCT results: As good as gold?
We conclude our empirical findings with an examination of the practical application of RCT results in poverty policy. We find several factors constrained the usefulness of the Generasi RCT results in practice. The narrow, and isolated, scope of the RCT results did not provide the information policymakers required. Because the RCT prioritised scientific rigour and objectivity, it was rigid, restrictive and unable to adapt to a changing policy environment. Again, we argue these reflect inherent issues with RCTs. This, in turn, raises questions about the supposed ‘gold standard’ superiority of RCTs for testing policy effectiveness.
The Generasi program, including its RCT evaluation, wrapped up in 2018. The results of the evaluation are currently being used to inform Indonesia’s broader anti-poverty strategy – primarily the rollout of ‘Village Law’, a massive decentralisation initiative which provides direct cash transfers to all Indonesian villages. In a manner similar to Generasi, the allocation of these grants is managed by recipient villages. The effectiveness of the Generasi program on stunting is also being studied to inform NatStrat 23 and INEY, 24 two key pieces of Indonesia’s anti-stunting strategy.
However, interviewees noted that, despite the expense and expertise afforded to Generasi evaluators, the results have not been particularly useful for this task. The most salient factor constraining the policy applicability of the evaluation was the so called ‘black box problem’ (Webber and Prouse, 2018). The results of the Generasi RCT evaluation examined if Generasi worked, but did not provide any information on how it worked: When I came in and was [using the results to design policy], the question I was looking at was not should we scale up, because that wasn’t feasible. The question that I had was which parts of Generasi are important, which parts of Generasi should we try and institutionalise and replicate? [But] the RCT was centred around this question of: ‘is this program good or not?’ I think in hindsight it would have been interesting to tweak the project to understand what parts of it were the most effective and worth [transferring to other projects] rather than just answering the yes/no question on the project.
This issue arose because the Generasi RCT only measured the Average Treatment Effect (ATE) of the intervention – i.e. the difference in average outcomes between the treatment and control groups. The ATE provides policymakers with an estimate of the effectiveness of a program as a whole. This is useful when a policy is being applied wholesale: “[for the scale up, the RCT] was the right thing to do, the question [we were asking then] was: should we scale up this pilot? The RCT could answer this question. … [And] the positive results from the first round of evaluation were critical to getting the donors to agree to put in another 120 mil [sic] or so”. For the scale-up, the ATE was able to provide evidence on the efficacy of Generasi and thereby attract funding from donor partners for rolling out the program more broadly. However, as critics have long noted, the ATE only measures effect rather than cause (Heckman and Smith, 1995) and this is of limited use when formulating future policy. Anti-poverty policies are rarely transferred wholesale in practice and are more often mélanges of best practice models gathered by policymakers and then adapted and mixed to reflect the specific context in which they are applied (Webber, 2015: 44). For this purpose, policymakers are more interested in knowing which components of a program are effective, and the causal mechanisms behind this effect. The ATE of the Generasi program, as noted by the interviewee, was not useful for this purpose.
In contrast, multiple interviewees noted that it was ultimately the qualitative evaluations of the program which were able to get into the ‘black box’ and provide the sort of rich and contextually specific information required for policymaking. An interviewee provided one example: In the original [RCT] study we had the locations that had the grants and then we had locations that didn’t. Initially the performance grant locations performed better [than the control areas] but by the second round they had converged. … What the [2013] qualitative [evaluation] made clear [was that] the villages themselves didn’t understand the bonus. The incentive of getting the grant wasn’t the main driver of why the performance grant mechanism worked. It was more the targets helping villages prioritise and perform better in order to reach the targets, and the fact they were being compared to other villages … Based on those findings when we scaled up [another program], we didn’t advocate for implementing the performance grant. … We focused on non-financial incentives.
The success of the qualitative report was set amongst a broader support for methodological eclecticism in evaluation from Generasi actors. Interviewees consistently expressed the sentiment that RCTs were only “one tool in a toolkit”, which could not provide answers for every development question. This included a senior executive at JPAL, who noted: Not everything can be answered by randomisation, but if it can answer [a development question], it’s a powerful tool. … If the RCT will not be powerful, we are also open to [partner] governments [that an RCT may not be appropriate]. Sometimes governments, after their [RCT] training, they are very excited to do RCTs, they want to randomise everything. Then we say you can’t randomise everything. … Randomised evaluation is the second thing because it is [just] a methodology. What we want to actually promote is rigorous research.
Interviewees also questioned the length of time taken to conduct RCT evaluations and the rigidity of their research questions: You design it, and now you have to stick with it over all these years and you really can’t introduce too many changes otherwise you mess up the control that you’re trying to measure over time. But when you start realising something else might be going on that we didn’t anticipate or something that we can’t even capture in our survey, the RCT becomes very brittle. It can’t help you anymore.
Despite initial enthusiasm, those involved in Generasi were surprisingly vocal in their scepticism of RCTs. One interviewee who worked in managing poverty reduction programs felt this reflected a broader shift in development: “For a while there was a period where people thought it was a gold standard. Now I feel like perceptions [are] starting to change a little a bit on RCTs”. When asked to speculate on the factors behind this change, the interviewee listed many of the issues discussed throughout our analysis: They don’t always answer the question that policymakers have. They take a long time. They’re very rigorous but on a narrow set of questions. Sometimes it’s not what the policymakers are looking for and, particularly given the time taken, if it was a question we wanted [answered] at the design stage, by the time you get the results … the policy landscape has changed; it becomes a bit outdated. Also, in general in Indonesia … [policymakers] know what the problem is, [but] they want to know how to fix it. It is more and more the government’s own spending and systems [budgets and apparatuses] rather than these [foreign donor funded] projects [like Generasi]. I think, in my view, RCTs – for someone … [who] wants to know ‘how’ – they’re not necessarily the right tool for answering those questions.
Think big
This paper reveals a number of serious shortfalls with the application of the RCT methodology; namely, that it was financially costly, ethically ambiguous, yet provided comparatively little benefit to inform poverty reduction. At a minimum, these findings show that RCTs should be used more judiciously, and alongside other evaluation methodologies, rather than as a ‘gold standard’ to which all development evaluations should aspire. More broadly, these findings show that RCTs are not benign hypothesis testing instruments, but rather, ‘operative environments’ (Petryna, 2010: 57) which factually alter the pursuit of poverty reduction. We see in our examination of Generasi the entanglement of evaluation and implementation negatively altered the very anti-poverty program researchers were aiming to improve, and without significant policy-applicable insights.
If RCTs are to remain a central tool of development evaluation, our findings show at least three ways they can be improved. First, evaluators should place a greater emphasis on mixed methods approaches when designing impact evaluations. This could be done, for example, by using an initial qualitative evaluation to hypothesise potential causal pathways, and then using an RCT to test these different hypotheses. Second, evaluators should try, as far as possible, to reduce the scale, and therefore cost and complexity, of RCT evaluations. Stratifying the randomisation of aid delivery around characteristics of interest to evaluators presents one avenue to reduce scale while mitigating reductions in statistical power (Ravallion, 2018: 17). Third, evaluators should not run RCTs in secret from local communities and politicians. This could introduce practical difficulties as control communities may push back against participating in evaluations while receiving no benefit. However, these could be mitigated by utilising a a ‘phase in/pipeline’ design (e.g. Miguel and Kremer, 2004), where those in the control group are prioritised as more funds become available.
Beyond improving the manifestations of RCTs in practice – an important exercise due to the power and influence this methodology holds in development economics and policy – there remain broader implications of RCTs holding monopoly influence on poverty reduction approaches. Most saliently, the ‘think small’ approach advocated by proponents of RCTs marks a retreat from addressing the broader entrenched and interconnected problems that lie at the heart of long-term poverty reduction (Chang, 2013; Chernomas and Hudson, 2019).
In the Indonesian context, thinking small ignores the structural and systemic causes of poverty. These include those that can be traced back to the extractive colonial systems of the Dutch East Indies Company – Vereenigde Oostindische Compagnie (VOC). The VOC, the world’s first multinational corporation, extracted incredible wealth through the export of natural resources (primarily spices) from nominally independent Indonesian client states (Fathimah, 2018). This wealth fuelled a ‘golden age’ in the Netherlands, while most of the Indonesian colonial subjects remained in abject poverty (Oostindie, 2003). Though colonialism is ostensibly gone, the current global economic system reproduces many of its extractive characteristics. RCTs, meanwhile, provide little insight into these macro-level drivers of uneven development.
For example, resources and goods flow from the Global South, while the ultimate benefit from these resources is still largely concentrated in a few financial centres in the Global North (Sheppard, 2012). This is seen in Indonesia's trade good today, palm oil exported by foreign multinationals (Warburton, 2017). A focus on RCTs and aid effectiveness cannot theorise arrangements to provide the, mostly poor, communities which resource and manufacture objects of western consumption a liveable share of wealth from their economic activity.
By conducting this resource extraction and heavy manufacturing in the Global South, globalisation also exports the environmental degradation resulting from western consumption to poor communities (Jorgenson et al., 2009). In Indonesia, the production of palm oil has caused widespread air & water pollution, deforestation and biodiversity loss (Hayashi, 2007; Wilcove and Koh, 2010). The response of RCT proponents has been to randomly evaluate which policies are most effective at incentivising Indonesian palm oil farmers to plant native trees (Rudolf et al., 2018). In this manner, RCTs shift the point of failure from structural issues to the behaviour of individuals in the Global South (Berndt, 2015).
An overreliance on RCTs to guide poverty reduction then risks neglecting the root causes of geographically uneven development. Replacing these systems and structures is essential to effectively and justly reduce poverty over the long term. As Banerjee and Duflo note when advocating for small thinking, it is difficult to answer these ‘big idea’ questions definitively (Banerjee and Duflo, 2011: 3). Nonetheless, addressing a big problem like poverty necessitates big thinking and big ideas – messy and difficult these big answers may be.
Footnotes
Acknowledgements
We would like to thank the Department of Geography at Universitas Indonesia for their invaluable help with fieldwork, especially to Nurul Sri Rahatiningtyas, who went above and beyond to assist with fieldwork. Thanks also to the generous reviewers (including for the new title). We would also like to express our deep gratitude to all interview respondents for their participation, time and insights.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the University of Sydney.
