Abstract
Background:
Even a well-designed randomized control trial (RCT) study can produce ambiguous results. This article highlights a case in which full sample results from a large-scale RCT in the United Kingdom differ from results for a subsample of survey respondents.
Objectives:
Our objective is to ascertain the source of the discrepancy in inferences across data sources and, in doing so, to highlight important threats to the reliability of the causal conclusions derived from even the strongest research designs.
Research design:
The study analyzes administrative data to shed light on the source of the differences between the estimates. We explore the extent to which heterogeneous treatment impacts and survey nonresponse might explain these differences. We suggest checks which assess the external validity of survey measured impacts, which in turn provides an opportunity to test the effectiveness of different weighting schemes to remove bias. The subjects included 6,787 individuals who participated in a large-scale social policy experiment.
Results:
Our results were not definitive but suggest nonresponse bias is the main source of the inconsistent findings.
Conclusions:
The results caution against overconfidence in drawing conclusions from RCTs and highlight the need for great care to be taken in data collection and analysis. Particularly, given the modest size of impacts expected in most RCTs, small discrepancies in data sources can alter the results. Survey data remain important as a source of information on outcomes not recorded in administrative data. However, linking survey and administrative data is strongly recommended whenever possible.
The primary virtue of random assignment as a method of evaluating public policy interventions is that it allows for causal inferences owing to the strong internal validity of the design. Because the treatment and control groups are defined at random, statistical equivalence in average characteristics—both measured and unmeasured—is ensured by the design. As several commentators have pointed out, however, issues in data collection can undermine the ability of a randomized control trial (RCT) to estimate the true impact of an intervention (see Barnow & Greenberg, 2014). Data collection problems are exacerbated by the fact that the impacts an RCT is designed to detect are sometimes small, and relatively small levels of bias introduced by data collection practices can alter the overall conclusions.
There are two reasons why impact estimates based on survey data may differ from those based on administrative data. The first, which is considered in more detail in this issue by Moore (in press ), is that the two data sources may differ in what they measure. This may be because they capture qualitatively different concepts or because they differ in the imperfections with which they measure the same concept. For example, when considering earnings impacts, survey and administrative data may differ in the range of employment types for which earnings are recorded. The second reason is that survey respondents may differ in some way from the full sample. This article focuses on this second reason. Noncomparability in characteristics across samples is a more fundamental issue than noncomparability of outcomes recorded by administrative data and survey data in the sense that it raises potential concerns over the causal basis of the RCT. If the RCT can be viewed as supporting causal inference among the survey-respondent subgroup for administrative outcomes, it gives more confidence when estimating impacts on outcomes that differ in their definition from those in the administrative data or, more generally, do not exist in the administrative data.
When considering differences between estimates based on the respondent sample and those based on the full sample we explore two potential explanations. The first is that there are no treatment-control differences in survey response, but that respondents are more likely than the full sample to have those characteristics associated with a higher impact. In this case, estimates based on survey respondents can sometimes still be viewed as causal, just for a particular subgroup, so the difference from the full sample estimates is explained by treatment effect heterogeneity. The second explanation is that there may be treatment-control differences in survey response due to unobserved characteristics. If these unobserved characteristics are correlated with outcomes, they will influence the estimated impact, which can no longer be regarded as causal. This is the case of differential selection into the sample of survey respondents.
The main contribution of this article is to investigate the potential of administrative data for testing the validity of estimates based on survey data. Where administrative data contain a primary outcome and can be linked to survey data, the estimated impact on the primary outcome for survey respondents can be compared to the full sample estimate of the primary outcome. If these estimates agree, or perhaps can be reconciled through appropriate weighting, we can be more confident that nonresponse does not undermine the ability of survey data to provide causal impact estimates for other outcomes.
The analysis in this article is based on the United Kingdom (UK) Employment Retention and Advancement (hereafter, ERA) Demonstration. ERA was the largest social trial of its kind in Britain. The last wave of survey data, which was intended to be a main source of data for the final evaluation, produced impact estimates that were substantially larger than those obtained using administrative data for the full sample. This article provides a detailed account of the administrative and survey data used in the ERA evaluation and shows how the estimated impacts on earnings for survey respondents are much higher than those for the full sample when earnings measures taken from the administrative data are used to derive the estimated impacts for both samples. We consider the question of whether these higher impacts are due to treatment effect heterogeneity or to differential selection bias and, while not conclusive, suggest that the latter is the more likely explanation. As a broader point, our analysis shows that, had the survey been relied upon as the main source of data, our understanding of the effectiveness of UK ERA would have been substantially overstated, assuming the administrative data fully cover the entire experimental sample and have accurate data. It is hoped that by exposing vulnerabilities in data collection the article will highlight a number of issues to guard against in collecting data for future RCTs.
The ERA Evaluation Design
ERA tested the effectiveness of a new method of improving the labor market prospects of low-income people relying on various government cash transfers. Hendra et al. (2011) provide a full account of ERA and the context within which the evaluation took place. Here, we outline the main features of ERA.
ERA operated in six regions of Britain from 2003 through 2007. For this article, we focus on one of the three target groups of ERA, namely out of work single parents on welfare 1 who volunteered for the New Deal for Lone Parents (NDLP) welfare-to-work program. 2
Under ERA, individuals received preemployment welfare-to-work assistance from Jobcentre Plus, the public employment service in the UK. The design of ERA allowed for a 9-month preemployment period. Those who found work became eligible for postemployment services. These included a combination of (caseworker-provided) advice and financial incentives to remain employed and advance in work. Participants who entered and remained in full-time work received substantial cash bonuses (covering up to 24 months of employment), help paying for training courses, and cash rewards for completing training while employed. Under ERA, caseworkers had access to a fund to help avert minor financial emergencies that threatened to prevent a participant from continuing to work. All support under ERA lasted a maximum of 33 months after randomization.
Since there was a limited number of available slots, ERA was implemented as an RCT demonstration, meaning that individuals who volunteered for the program were assigned at random—regardless of their background characteristics—to a treatment group that was enrolled in ERA or to a control group that was not enrolled in ERA. The control group continued to receive the standard NDLP services as well as any other services normally available to them. Individuals were recruited when they came into Jobcentre Plus offices. Caseworkers recorded basic demographic information and informed individuals of the possible advantages of participating in the ERA program. The caseworkers then invited them to enter the demonstration “lottery,” told them they had a 50% chance of being selected for the program and asked them to sign an informed consent form.
Enrollment of families into the experiment lasted a little over a year. Using the background information collected just prior to randomization, 3 the characteristics of the treatment and control groups can be compared in order to assess how well randomization worked. The first two columns of Table 1 relate to the full experimental sample (subsequent columns will be discussed later). From these columns, it is clear that randomization succeeded in creating two groups that, within sampling variability, are observationally equivalent. The assumption then is that they are also likely to be similar with regard to unobserved characteristics, allowing differences between the ERA group and the control group to be viewed as unbiased estimates of the causal impact of ERA eligibility
Descriptive Statistics for Full Sample and Wave 3 Fielded and Respondent Samples.
Note. ERA = Employment Retention and Advancement.
As described in Hendra et al. (2011), the evaluation used outcomes taken from both administrative data and survey data. In that report, however, the survey results were deemphasized based on the finding that the impacts for the survey sample were significantly stronger than the results for the same outcomes in the full sample. These divergent findings, reported in Hendra et al. (2011), are what motivate this article. Because of these divergent findings, it is appropriate to consider how the details of the survey in order to understand selection into the respondent sample.
The Office for National Statistics (ONS) carried out the survey, using administrative records of benefit receipt to help update survey contact information (Ashton & Portanti, 2011). One concern might be that the treatment itself influenced the probability of response to the survey, for instance, if greater contact with bonus recipients resulted in their records being more up-to-date. However, it is not clear how often bonus payment information was used in practice to update contact records. 4 Another possibility is that the financial incentives had a “priming effect” in which an increased likelihood to receive payments for meeting one condition (e.g., working stably) makes one more likely to seek an incentive for another behavior (e.g., filling out a survey). 5 Although speculative, this pattern of higher than expected survey responses among recipients of program-related financial incentives has been seen in other studies. 6 A related possibility is that recipients of bonuses felt a sense of obligation to the program which increased their propensity to respond to the survey.
Method
This article uses simple estimation approaches to explore its key questions. These are briefly summarized in this section. In addition, the data are described and their strengths and weaknesses are critically assessed.
Estimation Approach
We ran a series of regression analyses using the administrative data to estimate the extent to which differences in impacts on the primary earnings outcomes for the Wave 3 (60-month) survey-respondent sample and a random sample drawn from administrative records (the “fielded” sample, defined below) can be attributed to several possible sources. Impact models were run for both groups. These models had the following specification:
where Yj is the administrative data outcome measure for sample member i, Pi = 1 for treatment group members and 0 for control group members, Xi is a set of background characteristics for sample member i, ∊ i is a random error term for sample member i, β is the estimate of the impact of the program on the average value of the outcome, α is the intercept of the regression, and δ is the set of regression coefficients for the background characteristics. 7 Several logistic regressions were also run which tried to predict treatment status or survey response status. These regressions had the following specifications:
In Equation 3, Ri is a dummy variable which indicates survey response status, Ri = 1 for respondents and 0 for nonrespondents. We also used these regressions to create inverse probability weights discussed later in this article.
Data Sources
The ERA evaluation reported in Hendra et al. (2011) used both administrative records and survey data. Administrative data originated from the UK Department for Work and Pensions’ (DWP’s) Work and Pensions Longitudinal Study (WPLS) database. This database has grown in importance as a resource for program evaluations (see Dorsett, Smeaton, & Speckesser, 2013, for an example of another application to the case of a labor market experiment). It provides information on welfare spells (durations and amounts), employment spells, and tax year earnings. 8 A key advantage of these data relative to survey data is that they are available for the full experimental sample.
Within the WPLS, administrative data on welfare receipt and amounts are taken from DWP’s payment records and are generally regarded as accurate and reliable. Administrative records on employment and earnings in the WPLS originate from the UK tax department (Her Majesty’s Revenue and Customs [HMRC]) and are derived from three forms: P14—employers submit this form at the end of each tax year, showing earnings and taxes for each employee. The form covers both employees still with the employer and those who left during the tax year. P45—This form has multiple parts. Employers are required to submit one part to HMRC when an employee leaves. The form gives details of the leaving date, earnings in the tax year, and the amount of income tax deducted from earnings. The departing employee keeps other parts of the form and must give it to his or her next employer. P46—Employees without a P45 (perhaps because they have not had a previous job or because they are starting a second job) are required to submit a P46 to HMRC. The P46 also provides HMRC with the date of starting employment.
The employment and earnings data require quite substantial cleaning before they are suitable for analysis. For instance, precise start or end dates of employment spells are not always available. Where it is known that a job started (or ended) in a given tax year, but not the precise date, this is recorded on the system as 6 (or 5) April, the first (or last) day of the UK tax year. To improve upon this, part of the data processing for the official evaluation (Hendra et al., 2011) was to randomly imputed such dates within the relevant tax year. In fact, the range for the imputed dates was further narrowed by other available data, such as the file date and the dates of benefit spells. Imputation was used extensively since about one fifth of all employment spells were missing start dates.
In addition, there may be inconsistencies arising from forms not being submitted or being incorrectly completed. When individuals change employer or hold multiple jobs simultaneously, there is scope for disagreement in recorded dates or earnings. Furthermore, submission of forms is only required for employees earnings above the tax threshold. Despite this, some employers will submit forms for all workers, regardless of their pay, perhaps because batch processing of forms for their higher earning workers means that it is more efficient to treat lower earning workers the same way. In addition, these forms do not capture self-employment and self-employed earnings. The same applies of course to informal work.
In addition to the administrative data, ONS carried out a survey approximately 12 months after the individual’s date of random assignment, again at their 24-month anniversary, and finally at their 60-month anniversary. The survey was administered by phone or in person to slightly less than half of the sample of those treatment and control group members randomized between December 2003 and November 2004. 9 The key advantage of the survey data over administrative data is that they provide information that was tailored to the case of ERA. They provide much richer data than the administrative records and allow individuals’ experiences with ERA to be assessed as well as key outcome information not otherwise observed—wages, hours of work, type of job, and so on.
Administrative Records and Surveys, Advantages, and Disadvantages
Large RCTs of public policy interventions often rely heavily on administrative records to quantify the difference that a policy or program makes on key outcomes such as earnings, test scores, or public assistance receipt (see Riccio et al., 2013, for a typical example). The strengths of administrative data are well-known and include wide coverage, no recall bias, and low marginal costs of data collection.
A disadvantage of administrative records is that they typically do not cover all jobs or public assistance. For example, in the U.S. context, state records will not have information on what happens outside the state, and employment records will not have information on jobs in the informal sector (Kornfeld & Bloom, 1999). While the conventional wisdom suggests that undercoverage in administrative records should be equivalent across study groups in an RCT, it is easy to imagine cases where undercoverage can interact with intervention strategies to produce bias (Barnow & Greenberg, 2014; Yang & Hendra, in press ).
Surveys are also an important data source for many RCTs because they provide information that administrative records and other data fail to capture. Without survey data, it would be difficult to quantify program dosage, or the extent to which a person actually engaged with a program. 10 These data also provide valuable insight on certain behaviors, beliefs, program experiences, participant or household characteristics, and other issues that may influence outcomes observed in administrative records. In addition, summarized earnings data from administrative records can be better understood with survey data, which provide information about work schedules, rates of pay, and job changes. Finally, in many domains, administrative records are not available and some evaluations have to depend almost completely on surveys (see Lundquist et al., 2014; Banerjee, Duflo, Glennerster, & Kinnan, 2015, as examples of studies in which only survey data were available for key outcomes).
Unlike administrative records—where data are obtained for the full study sample—it is relatively expensive to collect survey data. Typically, surveys attempt to collect information from a subset of the full sample, often with the expectation that they will represent the full sample. When a survey fails to be representative—through nonresponse, for example—it is considered biased. Traditionally, the main safeguard against survey bias has been to obtain a high survey response rate. Recent work has shown, however, that obtaining a high response rate is no guarantee of survey data quality and it is not hard to find examples of surveys with high response rates afflicted with survey nonresponse bias (e.g., Nuñez, Verma, & Yang, 2015). Several studies have shown nonresponse bias does not vary substantially with response rates (Groves, 2006, Groves & Peytcheva, 2008). Internal research conducted as part of the U.S. Employment Retention and Advancement study of 16 surveys found no correspondence between survey response rates and survey response bias. The implication of these findings is that high survey response rates do not guarantee that survey-based results will generalize to the full sample.
A nonrepresentative survey sample presents an issue of external validity. With an RCT, if nonresponse affects the treatment and control groups equally, the resulting estimates can still be regarded as causal. The issue in that case is that the results apply to the selected sample of respondents and may not hold for the full population. A more serious issue arises when there is a difference in the response behavior of the treatment and control groups. This differential response behavior results in respondents having different characteristics from the control group respondents, so that treatment-control differences in outcomes can no longer confidently be attributed to the program. In other words, differential nonresponse has the potential to undermine the internal validity of a randomized trial.
Results
Estimated Impacts Using Administrative Data
As described earlier, the main focus of this article is on the extent to which estimated impacts on outcomes from the same data source differ between the subsample of survey respondents and the full experimental sample. To examine the effects of using different samples, we must consider outcomes from the administrative data because this is the only source available for both survey respondents and nonrespondents.
Table 2 shows impacts on earnings as recorded in the administrative data for the 2007–2008 tax year and the 2008–2009 tax year. The “fielded sample”—that is, those individuals for whom a survey was attempted—shows an impact of £343 which is not statistically significant. The fielded sample was drawn from those randomized between December 2003 and November 2004, while the intake period for the experiment ran from October 2003 to December 2004. Also, the sampling fraction varied by region, resulting in the fielded sample having a different geographic distribution from the full sample. For these reasons, we would not expect the fielded sample results to necessarily agree with the full sample results (also reported in Table 2).
The Estimated Impact of ERA for Different Samples, Using Administrative Records on Earnings.
Note. t statistics in parentheses. Asterisks indicate statistical significance of the estimates: * significant at the 90% level, ** significant at the 95% level, *** significant at the 99% level. Estimates control for region, cohort, sex, age, qualifications, number of months employed in the 3 years before randomization, number of months on welfare in the 2 years before randomization and whether their youngest child is under the age of 5 at randomization. ERA = Employment Retention and Advancement.
The impact for the respondent sample is much larger (£623 for Wave 3 respondents) and is statistically significant. 11 To address the question of whether the estimate for the respondent sample differs significantly from that of nonrespondents, we augmented the regression for the fielded sample to include one dummy variable indicating response at Wave 3 and another dummy constructed as the interaction of the response dummy with the ERA dummy. The interaction term had a p value of .097, indicating that the estimated impact for respondents differs from the impact for nonrespondents at the 10% significance level. For 2008–2009 earnings, the impact (£320) is again higher than the impact estimated for the fielded sample (£40) but is not statistically significant. Furthermore, estimating the augmented regression again indicates that the difference in impacts between the respondent and nonrespondent sample is not statistically significant at the 10% level (p value of .134).
The Nature of Nonresponse in the Survey Data Collected for the Evaluation
The main potential problems with survey data are that nonresponse can harm external validity and treatment-control differences in nonresponse can harm internal validity. Table 3 shows that there were 6,787 individuals in the full sample. Of those, 2,995 were selected to be in the fielded sample. Interviews were achieved for 87%, 77%, and 62% of the fielded sample in Waves 1, 2, and 3, respectively (i.e., 1, 2, and 5 years postrandomization). We note that response rates are higher among the ERA group than the control group and return to this point below.
Survey Response Rates.
Note. ERA = Employment Retention and Advancement.
As already noted, overall sample nonresponse can harm external validity. In other words, the achieved survey sample may not be representative of the population from which it is drawn. Unless we make the assumption that impacts do not vary across individuals, survey nonresponse means that ERA may affect respondents differently from the full sample. Table 1 provides an indication of the extent to which the background characteristics (observed at the time of randomization) of the fielded sample differ from those of the full sample. For the reasons already given above, we see that there are differences in the geographic distribution of individuals and also in the distribution of randomization timings. In other regards, the fielded sample looks rather similar to the full sample. Table 1 also shows the background characteristics of those individuals who responded to the Wave 3 survey (including both treatment and control group members). In the absence of systematic differences in response, Wave 3 respondents should resemble the fielded sample.
Table 4 highlights the differences between survey respondents and nonrespondents. Since some of the characteristics seemingly influencing response may be correlated— for example, education and weekly earnings—logistic regression is used to determine which differ across respondents and nonrespondents while taking other characteristics into account. Table 4 shows the results of regressing an indicator of response status on the characteristics shown in Table 1, as well as an indicator of research group, in order to better understand the process affecting response. The “odds ratio” column captures the effect of each characteristic on the probability of responding to the survey; asterisks denote the significance level of these relationships.
Descriptive Statistics for Wave 3 (60-Month) Respondent and Nonrespondent Samples, and Odds Ratios From a Logistic Regression of Survey Response.
Note. Asterisks indicate statistical significance of the estimates: *Significant at the 90% level, **Significant at the 95% level, ***Significant at the 99% level. ERA = Employment Retention and Advancement.
Survey respondents differ from nonrespondents in several characteristics. Those who, at baseline, were from Wales, those who were unmarried and living alone, and those with no qualifications were less likely to respond. These differences suggest that the survey sample may not be representative of the fielded sample. Selective response due to a nonrepresentative survey sample can result in different impacts if the sample that is more likely to respond has a different pattern of impacts compared to the full sample. Impacts from the selected sample may still be internally valid and therefore provide valid causal estimates but, with treatment effect heterogeneity, these impacts will not generalize to the full fielded sample.
More worrisome is the possibility that attrition results in estimated impacts that are no longer internally valid. Attrition bias arises when treatment group respondents differ from control group respondents with regard to unobserved characteristics correlated with outcomes. This includes the possibility that the program itself may influence survey response (for reasons described earlier in this article). Table 4 indeed shows that those in the treatment group are more likely to respond than those in the control group. 12 While not a sufficient condition for internal validity to be undermined—it is straightforward to see that a randomly lower response rate among the control group will not bias impact estimates—it raises a note of caution. Thus, if nothing else, differential nonresponse serves to reduce the credibility, or face validity, of the experimental design.
A common practice when assessing whether survey nonresponse may have introduced bias is to compare baseline characteristics of respondents in the treatment and control groups. Just as differential nonresponse is not a sufficient condition for bias, neither is balanced response a sufficient condition for unbiasedness (Barnow & Greenberg, 2015). In other words, having equivalent response rates by research group does not preclude the possibility of compositional differences under the surface. Again, Table 1 is informative and suggests, at first glance, that treatment and control group respondents are similar. To explore treatment-control comparability, we estimated a logistic regression to determine the extent to which baseline characteristics could predict whether a respondent was a member of the treatment group (among Wave 3 respondents only). Table 5 shows that none of the baseline characteristics is statistically significant as a predictor. In other words, among the survey-respondent sample, there are no differences between treatment and control group respondents in these background characteristics.
Baseline Characteristics as a Predictor of Treatment Status, Among Wave 3 Survey Respondents.
Note. Asterisks indicate statistical significance of the estimates: *Significant at the 90% level, **Significant at the 95% level, ***Significant at the 99% level. ERA = Employment Retention and Advancement.
Attempts to Reconcile Earnings Impacts Estimated on the Respondent Sample With Those Estimated on the Fielded Sample
Tests of response bias often focus on baseline data. However, when suitable administrative data are available, it is also informative to investigate differences in outcomes subsequent to baseline. One pattern of note in ERA was that treatment group members who worked stably were disproportionately likely to respond to the survey compared to stably employed control group members. This differential survey response is consistent with the fact that respondents tended to have higher earnings than nonrespondents. The extent of this differential survey response varied across treatment status; in the control group, mean income among nonrespondents was 84% of the mean income for respondents while, in the treatment group, this fell to 72% (these percentages were stable across both 2007–2008 and 2008–2009 earnings). Furthermore, even when administrative data are not available for both research groups, it may be possible to explore how response to the survey is correlated with program take-up among the control group. With ERA, treatment group respondents were over 7 percentage points more likely to receive the work retention bonus compared to treatment group nonrespondents.
Reflecting these findings about differences in employment stability and bonus receipt outcomes across samples, we explored several reweighting strategies intended to bring the survey-respondent sample earnings impact estimate into alignment with the fielded sample impact estimate. The results of these efforts are summarized in Table 6. The first two rows show again, for convenience, the estimated impacts for the fielded and respondent survey samples. The next row gives the results of a conventional weighting strategy using weights based on the inverse of the probability of responding conditional on a set of background characteristics. This conventional weighting strategy had little effect on aligning earnings impacts across the samples (the estimated impact is £546). It is not surprising that this approach was ineffective given the finding (discussed earlier) that there was no observable bias based on background characteristics. Nonetheless, it is still noteworthy that the common approach of using weights defined on the basis of background characteristics does little to bring the respondent sample impact estimates closer to the fielded sample impact estimates.
Exploring Weighting Approaches to Reconcile 2007–2008 Earnings Impacts Across Fielded and Respondent Wave 3 (60-Month) Samples.
Note. Asterisks indicate statistical significance of the estimates: *Significant at the 90% level. **Significant at the 95% level. ***Significant at the 99% level. Estimates control for region, cohort, sex, age, qualifications, number of months employed in the 3 years before randomization, number of months on welfare in the 2 years before randomization and whether their youngest child is under the age of 5 at randomization. ERA = Employment Retention and Advancement.
We also attempted nonexperimental weighting strategies that control for postrandomization outcomes. Analyses of RCTs are usually careful to condition only on prerandomization treatment-control differences since randomization itself ensures that unobserved characteristics balance postrandomization. Controlling for postrandomization outcomes runs counter to prerandomization control and represents a significant departure from standard practice. However, as illustrated in this article, nonresponse can undermine the statistical properties of an RCT to the extent that the basis for causal interpretation of estimated impacts is eroded. In such a scenario, it may be appropriate to consider weighting individuals according to their outcomes. In the case of ERA, such an approach is successful in reconciling the impacts estimated for respondents with those for the fielded sample.
As discussed above, survey respondents were disproportionately more likely to receive the employment retention bonus. Weighting based on a combination of baseline characteristics and bonus receipt rates brought the survey-respondent sample impact estimate into approximate alignment with the fielded sample impact estimate (£394 compared to £343). It was also noted above that treatment group members who worked stably were more likely to respond to the survey compared to control group members who worked stably. Weighting based on employment stability also brought the impact estimate for the survey-respondent sample much closer to the impact for the fielded sample estimate (£334 compared to £343). 13
Conclusion
To summarize the evidence presented in this article, UK ERA was a well-executed social experiment and there is every indication that two statistically equivalent groups were obtained from the random assignment procedure. A survey carried out 5 years postrandomization achieved a response rate of 62%; 64% for the treatment group and 60% for the control group. As documented in Hendra et al. (2011), the estimated earnings impact using survey data for the survey-respondent sample was greater than the estimated earnings impact for the full sample using administrative data, raising the question of how to interpret this difference. Access to administrative data allows an assessment of the degree and nature of nonresponse bias that would not otherwise be possible and we have explored this in this article. Had the survey been the only source of earnings data, the estimated impacts would have overstated the effectiveness of the program, assuming the administrative data represent the truth about the sample. 14
Also important is that the nonresponse appears to bias the estimated earnings impact despite the treatment and control groups in the respondent sample being similar with regard to observed characteristics. Demonstrating such similarity is often used as part of the evidence to argue that impacts estimated on respondent subsamples retain their causal interpretation. However, the results in this article demonstrate that this is not a sufficient condition, and raises the possibility that postrandom assignment factors including differential access to the research groups or even aspects of the intervention can cause bias that would not be evident by examining characteristics at baseline.
It might still be the case that the impacts estimated for the respondent sample are causal, but that impacts are heterogeneous in the population and are different for respondents compared to nonrespondents. In line with this, the results confirm that respondents and nonrespondents have different characteristics. If impact heterogeneity were the sole reason for the difference in estimated impacts, one would expect that reweighting the respondent sample to resemble the full sample using background characteristics would bring estimated impacts closer. Attempts to do this were unsuccessful, so we conclude that respondents differ from nonrespondents in some unobserved way. 15 It is still conceivable that the impacts estimated for the respondent sample are causal. This would rely on there being an unobserved characteristic that was positively correlated with both survey response and impacts. However, an alternative possibility is that the estimated impact for the respondent sample is no longer causal due to unobserved treatment-control differences. We have no obvious way of distinguishing between these two scenarios. However, the fact that we have no strong theory to suggest the former is the case leads us to regard the latter as being the more likely explanation.
Our findings highlight the usefulness of administrative data for exploring the reliability of impacts estimated using survey data. Of course, for outcomes available in administrative data, there may be no need to rely on the survey-respondent subgroup. However, the real value of such an exploration derives from its implications for the analysis of outcomes not present in administrative data. An important reason for carrying out surveys is to collect information that is not available elsewhere. If respondent sample results can be satisfactorily reconciled with full sample results of an outcome available in the administrative data, we can be more confident that impacts on outcomes only available in the survey data can be credibly estimated. 16 As an aside, we note that the issues discussed here are not unique to experiments. It is however true that experiments make the potential problems more visible.
Recent methodological developments provide some hope of dealing with possibly biasing nonresponse. A simulation study by Puma, Olsen, Bell, and Price (2009) shows that implementing multiple imputation to fill in missing survey data using information from administrative records can substantially address missing survey data problems. In addition, development of estimators suited to the case of nonrandom subsamples remains a live issue in econometric theory (d’Haultfoeuille, 2010; Ramalho & Smith, 2013). Furthermore, improvements in survey data weighting, notably through the use of so-called survey paradata, have shown some progress in improving alignment across the data sources. 17
While these approaches might help, they inevitably complicate the estimation of impacts and reduce the transparency that is an attractive feature of RCTs. Indeed, some approaches rely on assumptions that imply it is no longer appropriate to regard the resulting estimates as truly experimental. While such approaches hold promise, a pragmatic approach is to employ an ensemble strategy, using multiple data sources to estimate impacts and attempt to achieve a clear understanding of the uncertainties inherent in the ability of any particular data source to capture the true impact.
Footnotes
Acknowledgments
We are grateful to the editors of this special edition, Burt Barnow and David Greenberg, to the editor, Jacob Klerman, and to three anonymous reviewers for their helpful comments.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Dorsett acknowledges support from the Economic and Social Research Council (Grant Number ES/J003581/1).
