Abstract
Background:
Impact evaluations draw their data from two sources, namely, surveys conducted for the evaluation or administrative data collected for other purposes. Both types of data have been used to estimate program impacts. This is an introductory essay to a Special Issue entitled “Do the Estimated Effects of Social Programs Depend on the Source of Data Used to Measure Them? Survey Data Versus Administrative Data.” In addition to this essay, the Special Issue contains six articles, which appear in Volume 42, Issue 5–6 (October–December 2018) and in this issue (Volume 43, Issue 5 (October 2019)) of Evaluation Review.
Objective:
To describe and summarize each of the six papers and draw lessons from them. The papers investigate the relative strengths and weaknesses of survey and administrative data for estimating the impacts of policy interventions.
Results:
This essay first describes a simple model of the mechanisms that can cause impacts estimated with survey data to differ from those estimated with administrative data. It then describes and summarizes each of the papers appearing in this Special Issue and uses the model described to interpret the findings when it is applicable. The final section draws general lessons from the papers.
Conclusions:
The decision on whether to use survey or administrative data to estimate program impacts can be highly consequential because the estimates can differ considerably. All the papers in this Special Issue point to the importance of using both survey data and administrative data whenever possible.
Keywords
The six papers in this Special Issue of Evaluation Review all involve investigations into the relative strengths and weaknesses of survey and administrative data for estimating the impacts of policy interventions. 1 The papers vary along several dimensions. For example, five involve policy interventions tested by random assignment experiments, but one is a quasi-experiment. The interventions themselves are varied (the formal education system, incentives for encouraging formal education, training programs that take place after formal education is complete and financial incentives and the provision of services for entering and retaining employment). The interventions take place in three countries (the United States, the United Kingdom [UK], and Canada), and the survey and administrative data are, of course, also collected in these countries. Perhaps most importantly, four of the studies directly compare impact estimates produced by survey data with those produced by administrative data and examine whether these impacts differ and the reasons why they may differ, one uses survey data to mimic administrative data and thereby determine whether the two data sources will produce similar results, and one uses administrative data to examine the importance of obtaining a high response to surveys in conducting impact analyses.
Our interest in the relative merits of using survey and administrative data for impact analysis stems from our work on a recent article published in this journal (Barnow & Greenberg, 2015). That paper investigated differences in estimates of impacts on earnings when both survey and administrative data are used to estimate program impacts in random assignment experiments. In doing this, we developed a simple model of the mechanisms that can cause impacts estimated with survey data to differ from those estimated with administrative data. Then, drawing parameters from previous experiments that used data from both sources, we simulated the circumstances that would cause the differences in impact estimates. In this essay, we use this model when it is applicable to help interpret the findings of the six papers in this Special Issue.
The following section presents a summary of this model. The next section briefly describes and summarizes each of the papers appearing in this Special Issue and uses the model described below to interpret the findings when it is applicable. The final section draws general lessons from the papers.
Our Model
The Barnow–Greenberg model assumes that earnings and employment tend to be overstated in surveys and understated in administrative data. It also assumes that that employment and earnings tend to be higher among survey respondents than nonrespondents. Although these assumptions are consistent with evidence we present for low-income persons in Barnow and Greenberg (2015), who were the target population of the experiments we examined there, as will be seen, they do not necessarily hold for other target populations or other outcomes. However, as will be seen, this does not invalidate the usefulness of the model. As shown below, our model can be summarized in two equations, one focusing on reporting errors that result in bias and the other on survey nonresponse bias. 2 Assuming the absence of survey nonresponse bias, the absolute difference between a survey-based and an administrative-based impact is:
where T is the treatment group, C is the control or comparison group,
Assuming the absence of reporting bias, the absolute difference between the survey- and the administrative-based impact, which is also the nonresponse bias in the survey data, is equal to
where Rj
is the survey response rate and fj
is the ratio from the administrative data of the value of Ej
for nonresponders to the value of Ej
for responders. If RT
= RC
= Rj
and fT
= fC
= fj
, then Equation 2 reduces to
Viewing Each Paper Through the Lens of the Model
Table 1 presents an overview of each paper by indicating the types of policy interventions that were evaluated, the outcome variables of interest, the population targeted by the interventions, the administrative and survey data used in the evaluations of the interventions, and the key findings. Each paper is discussed below. Four papers that compare impact estimates produced by survey data with those produced by administrative data are discussed first, followed by a paper that uses administrative data to determine the importance of achieving high survey response rates, and finally by a paper that uses national survey data to examine potential weaknesses in using state administrative data to estimate the returns to higher education.
Overview of the Papers.
Note. UI = unemployment insurance; ITA = Individual Training Account; WIA = Workforce Investment Act.
All dollar figures are US dollars.
Yang and Hendra
Yang and Hendra (2018) compare survey-based impact estimates with administrative data impact estimates, relying on administrative data from state unemployment insurance (UI) systems for two experimentally tested programs: Family Rewards and Work Rewards. Both programs provided financial incentives for work and for participating in other activities. Yang and Hendra focus mainly on impacts on employment, although they briefly examine earnings impacts as well. Although the survey-based impacts are positive and statistically significant for both programs, the UI data-based impacts are much smaller and statistically insignificant. As shown in Table 1 of Barnow and Greenberg (2015), similar patterns in other welfare-to-work experiments are fairly common.
This article investigates the reasons for these differences, which differ between the two programs. Family Rewards mostly increased employment in jobs that are not covered by UI wage record data, particularly because the program encouraged participants to become self-employed. Income from self-employment is not captured by UI wage record data, pointing to the importance of conducting surveys when employment resulting from an intervention is likely to be disproportionately in jobs not covered by the administrative data that are available. 4 The UI-based impact estimates for Work Rewards were also subject to some bias due to noncoverage of self-employment, although this was less important than in Family Rewards. In addition, Yang and Hendra found that the survey-based estimates were subject to nonresponse bias. Importantly, they find that survey response was highly correlated with likelihood of receiving a financial award.
As suggested by the estimates appearing in Table 2, these findings are consistent with the model presented in the first section of this article. On the one hand, the response rate (R) is high for the Family Rewards program, and more importantly, the ratio of employment for nonrespondents to respondents (f) is around 1, suggesting that nonresponse bias is unlikely to be important. Furthermore, while R and f are both unbalanced, this imbalance is in opposite directions and similar in magnitude, and thus to some extent offsetting. On the other hand, both R and f are well below 1 in the case of the Work Rewards program. Moreover, f is considerably larger for controls than for the treatment group, causing nonresponse bias to be larger. However, this is offset, but only partially, by the slightly higher R for the treatment group than for the control group. These differences between the treatment and the control groups possibly occur because survey response was correlated with receiving a financial award and receiving an award required working. Thus, responders in the treatment group are likely to include proportionately more workers than nonresponders in the treatment group.
While the
Key Parameters From the Administrative and Survey Data for Yang and Hendra (2018).a
Note. T = treatment group; C = control group.
aThe estimates in the first three rows are based on only survey respondents. The estimates in the last two rows are based on the full sample.
Dorsett, Hendra, and Robins
The UK Employment Retention and Advancement (ERA) Demonstration was run on a pilot basis in six locations within the UK and was evaluated as a random assignment experiment using both survey data and administrative data obtained from national government tax records. The program provided members of the target group financial incentives for work and education and also provided casework intended to help participants obtain jobs and advance in them. The administrative-based earnings impacts for the full sample were much smaller and less often statistically significant than the survey-based earnings impacts for survey respondents. Focusing on one of the three groups targeted by the experiment, participants in the UK’s New Deal for Lone Parents, the paper by Dorsett, Hendra, and Robins (2018) examines the source of this difference in some detail, concluding that it is most likely attributable to nonresponse bias.
This conclusion is consistent with the Barnow–Greenberg model. While the
Dorsett, Hendra, and Robins attempted to eliminate nonresponse bias by weighting the observations. They found that using inverse probability of response weights does not correct the problem, but that weighting by post-RA outcomes performs better. They note, however, that this approach risks introducing new biases as the weights are based on endogenous variables.
Moore, Perez-Johnson, and Santillano’s Paper
Individual Training Accounts (ITAs) were used to deliver training vouchers to program participants under the Workforce Investment Act (WIA, which has been replaced by the Workforce Innovation and Opportunity Act). To assess alternative versions of the voucher, the U.S. Department of Labor funded a random assignment evaluation, which was conducted by Mathematica Policy Research. One version, the “Status Quo,” approximated the ITA operations at the time of the evaluation and thus serves as the control situation, while another structure for the vouchers, “structured choice,” required additional counseling, allowed counselors to constrain choices as to types of training undertaken, and doubled the potential maximum size of the voucher. 7 The impact on earnings of the structured choice voucher versus the status quo voucher was estimated with both UI data and survey data. The paper by Moore, Perez-Johnson, and Santillano (2018) attempts to determine the factors that cause the survey data-based impact estimate, which is statistically significant, to be much larger than the administrative data–based impact estimate, which is not statistically significant. In addition, they examine why the level of reported earnings differs between the two data sources. In this summary, we focus on their findings concerning impacts.
Moore, Perez-Johnson, and Santillano first make a case that survey nonresponse bias causes little of the differences between survey- and administrative-based impact estimates. That this is the case is suggested by the fact that the response rate was identical for the two voucher groups at 68%, and the ratio of nonrespondent earnings to respondent earnings (f) is actually slightly higher for the structured choice group than the status quo group (.75 vs. .72), thereby reducing nonresponse bias a bit. Because the values of R and f are well below unity, there would be some balanced nonresponse bias, but there would be no unbalanced nonresponse bias and, as previously indicated, the latter is potentially much more important than the former. Given the apparent unimportance of nonresponse bias, most of the analysis abstracts from nonresponse bias by examining only those observations that responded to the survey.
Presumably, if the difference in impact estimates from the two data sources is not due to nonresponse bias, it results from reporting errors. That these cause bias is implied by the fact that the
To investigate reporting bias, Moore, Perez-Johnson, and Santillano develop a decomposition framework that allows them to attribute differences between the data sources to either differences in whether individuals are reported as employed or, if they are reported as employed in both data sets, differences in their reported earnings. A key finding from this framework is that 63% of the difference in impacts between the two data sources is attributable to differences in reported earnings among the employed, with the remainder due to differences in reported employment rates. This suggests that jobs that are not covered by the UI system and interstate mobility, which causes individuals with jobs to be missed in the UI data, are perhaps less important defects in UI data than is sometimes suspected. Indeed, further investigation by the authors found that only some, and far from all, of the difference in impacts due to differences in reported employment rates can be attributed to these factors. Some (but again far from all) of the difference in impact attributable to difference in reported earnings among the employed was attributed by the authors to some individuals having multiple jobs with at least one job not covered by the UI data. Moore, Perez-Johnson, and Santillano also present evidence that some survey respondents overstated their earnings or some employers understated the earnings of some workers in their reports to the UI system. They had no way of telling, however, whether respondent overreporting or employer underreporting was more important. However, most underreporting of earnings by employers appears to result from employer failure to acknowledge the presence of certain employees, often because they treat them as independent contractors when they are arguably an employee rather than understating the earnings of those they do acknowledge (Abraham, Haltiwanger, Sandusky, & Spletzer, 2013; Blakemore, Burgess, Low, & St. Louis, 1996; Hotz & Scholz, 2001). Thus, this would mainly contribute to differences in employment rates between the two data sources.
To the extent that the authors succeed in explaining the causes of the difference in impact estimates between the two data sources, most of the difference seems attributable to what is reported as earnings in the UI data (e.g., the exclusion of self-employment, out-of-state earnings, and under-the table earnings). However, none of these explanations suggest why the
Ford, Grékou, Kwakye, and Hui
The paper by Ford, Grékou, Kwakye, and Hui (2018) on the Canadian Future to Discover Project differs in several important respects from the three papers summarized above. First, because it is concerned with the treatment in a randomized experiment that attempted to encourage postsecondary enrollment among high school students in New Brunswick, the outcomes of interest are not employment and earnings. Instead, they are ever enrolled in a 4-year college or university, ever enrolled in college (in the United States, this would be a community college or proprietary school), and ever enrolled in postsecondary education (PSE)—that is, a university, college, vocational institute, or apprenticeship program. Three treatments were tested: Explore Your Horizons (EYH), Learning Accounts (LA), and a combination of the two (EYH + LA). EYH provided programming during Grades 10–12 that was intended to facilitate student development of plans for PSE, while LA consisted of financial subsidies covering PSE expenses.
Second, instead of examining why the administrative- and survey-based impact estimates differ as the authors of the three papers summarized above do, Ford et al. focus instead on whether the impacts differ in the first place, concluding that they are “relatively robust to the data sources chosen.” This conclusion is based in part on t tests used to determine whether differences between impacts estimated with the two data sources are statistically significant. This issue was especially complex to investigate within the context of the evaluation of the Canadian Future to Discover Project because a large number of impacts were estimated due to there being three treatments, three outcome variables, and five subgroups, with each subgroup further divided between the two major linguistic groups in New Brunswick (Francophones and Anglophones).
Third, with both the administrative data and the survey data, all individuals in the experimental sample were counted as either having enrolled or not having enrolled. If the survey data either showed not enrolled or were missing, the coding was “not enrolled.” If the administrative data were missing, the coding was also “not enrolled.” Although missing administrative data are generally treated in this way in evaluations, survey nonrespondents who are missing are much more commonly dropped from the sample in estimating survey-based impacts. That is the source of nonresponse bias. Because no observations are dropped due to not responding to the survey, there can be no measured nonresponse bias; but just as with administrative data, treating missing observations as not enrolled can cause reporting bias. 8
Thus, enrollment rates will likely be understated in both data sources. With the survey data, this will be due to counting all nonrespondents, who comprised 32% of the sample, as not enrolled. 9 The administrative data, which was provided by the Maritime Provinces Higher Education Commission, the New Brunswick Community College, and the New Brunswick College of Craft and Design, covered only Canada’s three Maritime Provinces (New Brunswick, Nova Scotia, and Prince Edward Island) and did not cover students who only attended private career colleges or vocational institutes. 10 Because all survey nonresponders are treated as not enrolled, some erroneously, nonresponse causes the survey-based impacts to be biased downward to the extent nonrespondents did actually enroll, just as noncoverage causes the administrative-based impacts to be biased downward to the extent those who were not covered enrolled. Because the bias is in the same direction for both data sources, this may help account for the finding that the differences between the two data sources were not statistically significant. The finding might have been different had Ford et al. followed the more conventional procedure of dropping survey nonrespondents in computing the survey-based impacts and, as a result, survey nonresponse bias had occurred. Nonresponse bias would cause the survey-based impact estimates to be biased upward if the true impacts were larger for respondents than for nonrespondents, as has occurred in other randomized experiments. 11 Most studies that follow the conventional approach do seem to find substantial differences between survey- and administrative- based impact (Barnow & Greenberg, 2015).
Table 3 shows the
O/U Ratio (Computed for the Overall Sample) for Ford et al. 2018.
Note. T = treatment group; C = control group; PSE = postsecondary education; EYH = Explore Your Horizon; LA = Learning Account.
The fact that the
The estimates in the Ford et al. paper are generally consistent with these observations. For example, Table A.1 in the paper, which combines the five subgroups, presents 36 impacts (two data sources, three treatments, three outcome measures, and two linguistic groups). The administrative-based impacts are larger than the survey-based impacts for the EYH treatment in five of six instances, while exactly the opposite is true for the LA treatment. The situation is somewhat more mixed for the EYA + LA treatment, but again closer to the LA treatment (four of six instances).
Thus, there does appear to be a pattern to the impacts, one that can be explained reasonably well by the Barnow–Greenberg model. However, the impacts based on the two data sources are similar enough that they probably would not cause contrary conclusions concerning the effectiveness of the tested treatments. To illustrate, we compare the pairs of impacts for university enrollment. The impact of the EYH treatment for Francophones is over 10 percentage points, but only about 1 percentage point for Anglophones. 13 In both cases, however, the difference between the survey-based and administrative-based impacts is less than 1 percentage point and statistically insignificant. The survey-based impact of the LA treatment was 5.23 percentage points and the administrative-based impact is 7.66 percentage points for Francophones, while the impacts for Anglophones are −3.35 and 0.02 percentage points, respectively. Again, the differences between the impacts produced by the two data sources are statistically insignificant. The survey-based impact of the EYA+LA treatment for Francophones is 7.66 percentage points, and the administrative-based impact is 7.69 percentage points, while the impacts for Anglophones are 2.36 and 3.79 percentage points, respectively, and the differences in impacts are statistically insignificant.
Hendra and Hill
This paper and the following one by Scott-Clayton and Wen (2019) differ from those summarized above because they do not compare impacts estimated with survey data to those estimated with administrative data. Instead, the Hendra and Hill’s (2019) paper uses administrative data from state UI wage records to examine potential nonresponse bias in survey data, while the Scott-Clayton and Wen paper uses survey data to investigate several limitations of state administrative data for estimating returns to higher education.
Using UI data on earnings for 13 tested interventions targeted at disadvantaged persons from the multisite random assignment U.S. ERA evaluation, Hendra and Hill examine the relationship between survey response rates and survey nonresponse bias. They find that survey nonresponse bias at most only weakly declines as response rates increase, a finding that is consistent with earlier research by Groves (2006) and Groves and Peytcheva (2008). They also find that the balance in baseline characteristics between the treatment and control groups only weakly improves as the response rate increases. These results are encouraging in terms of internal validity for survey-based studies in which response rates are low. They also suggest that funding increases intended to increase response rates may not be an appropriate allocation of scarce research funds.
It is important to stress that there is considerable variation across the tested treatments in the relation between nonresponse bias and survey response rates (see figure 2 in Hendra and Hill): In some cases, nonresponse bias declines as nonresponse rates increase, in other cases, nonresponse bias increases, and in still others, there is no discernable relationship. Taking the 13 interventions as a whole, however, Hendra and Hill find little change in nonresponse bias as the response rate increases. Still, it is important to keep in mind that the response rate may matter in evaluating a specific policy intervention.
To discuss the Hendra and Hill finding further, we turn to Equation 2, which for convenience is repeated below:
Because it is assumed under Equation 2 that there are no reporting errors,
Equation 2 implies that survey nonresponse bias decreases as R, the response rate, increases. However, simulations by Barnow and Greenberg (2015) imply that this decrease is quite modest so long as fT and fC are within the bounds found in previous random assignment evaluations (i.e., over 0.7) and fT does not differ greatly from fC (see figure 2 in Barnow & Greenberg). This is quite possible for many experiments, especially if financial incentives are not involved. 14 Thus, as a practical matter, response rates may not be strongly related to nonresponse bias in the absence of factors, such as financial incentives, that cause fC to exceed fT .
As previously mentioned, fC is especially likely to exceed fT when positive financial incentives are tested and this lack of balance will cause nonresponse bias. When fT < fC and the difference is substantial, the simulations in Barnow and Greenberg (2015) imply that there can be a rather strong relation between response rates and nonresponse bias. Interestingly, the treatments tested in four of the U.S. ERA sites incorporated financial incentives, and Hendra and Hill’s figure 2 indicates that there is a negative relationship between the response rate and nonresponse bias for three of these (Chicago, Fort Worth, and Corpus Christi), although not for the fourth (Houston). This is possibly because the financial incentives were found to have had little impact in Houston but had substantial impacts in the other three sites (Hendra et al., 2010).
Given Hendra and Hill’s key finding that in general survey nonresponse bias to impact estimates does not diminish as survey response rates rise, we consider several hypotheses, which, if valid, could keep nonresponse bias from falling as the response rate increases.
15
The ratio of nonrespondent earnings to respondent earnings (f) declines sufficiently as the response rate increases to more than offset the increase in R. As Equation 2 implies, this would cause nonresponse bias to increase as the response rate increases. This seems to be a plausible possibility because it is likely that as the response rate increases, the average earnings of those who remain in the nonrespondent group declines. This would occur if persons with higher earnings are more likely to be responders than those with lower earnings. Those individuals without earnings, some of whom are highly mobile, are especially likely to remain nonrespondents. The ratio of nonrespondent earnings is smaller for members of the treatment group than for members of the control group (i.e., fT
< fC
), but this gap shrinks as the response rate increases. According to Equation 2, the wider the gap between fT
and fC
, the greater the nonresponse bias ceteris paribus. The validity of this hypothesis is unclear, but it is possible that the gap between fT
and fC
could diminish in accordance with the hypothesis, for example, if individuals who were positively affected by a treatment are the most likely persons in the treatment group to respond to a survey when the response rate is relatively low, but those in the treatment group who were unaffected or even negatively affected only respond when the response rate is high. In this case, fT
may shrink by less than fC
falls as the response rate increases. This argument is rather tenuous however. The response rate is larger for the treatment group than for controls (i.e., RT
> RC
), but the gap narrows as the response rate increases. Equation 2 implies that a smaller gap will result in a larger nonresponse bias, everything else equal. A narrowing gap seems plausible as the response rate increases because if RT
> RC
, there will be fewer persons to draw from in the treatment group than in the control group in order to increase the rate. Indeed, the response rates for the two groups must converge as the overall response rate approaches 100%.
In summary, under certain conditions, it appears theoretically possible for little or even a positive relation to exist between nonresponse bias to impact estimates and response rates. The absence of a strong negative relation appears most likely when survey response is close to being balanced but can occur even when there is unbalanced nonresponse.
Scott-Clayton and Wen
Numerous studies have estimated the returns to higher education. One source of data that have been used for this purpose is single-state administrative files that link postsecondary enrollees at state colleges with their earnings after they leave college. These data are subject to several important limitations, however. For example, the earnings of college graduates typically cannot be compared to those of individuals who do not enroll in college, but only to college dropouts; there is limited information in state administrative records about family background and precollege ability; and the postcollege earnings of persons who work at out-of-state jobs are usually not included in the state administrative data.
Scott-Clayton and Wen (2019) cleverly use survey data (the National Longitudinal Survey of Youth) to mimic these limitations to determine their importance. In doing this, the authors also produce a “best estimate” of the dollar returns to college. The study examines the dollar returns to certificates, some college but no degree, associate degrees, and BAs. For brevity in this summary, we focus on the last of these outcomes.
Scott-Clayton and Wen’s key findings concerning the limitations of using state administrative data to estimate the return on a BA are summarized in Table 4. As discussed below, this table indicates how the dollar return estimates vary as the basis of comparison changes. The estimates in the table are only for persons with positive earnings in 2010 who were not enrolled in education that year or afterward.
The Returns to College in 2015 Dollars (Conditional on Positive Earnings) for Scott-Clayton and Wen (2019).
aIndividuals with some education beyond high school are used as the comparison group.
Note. All dollar figures are US dollars.
The estimate that is based on all the relevant information in the survey is US$17,942. This estimate is for 2010 but is adjusted to 2015 dollars. As the Xs in Table 4 indicate, the estimate controls for family background and student ability, uses high school graduates as the comparison group, and relies on observations in all the states. Each of the three remaining estimates in Table 4, which are described below, comes closer to the information that is available in state administrative data. They should each be compared to the $17,942 estimate.
Because the $17,942 estimate is not based on data from a random assignment experiment, as is common in studies of the returns to graduating from college, the comparison group consisted of persons with lower educational achievement—specifically, individuals with a BA are compared to those with a high school education but no PSE. Scott-Clayton and Wen investigated two issues concerning such comparison groups when relying on state administrative data instead of survey data.
First, individuals with different levels of educational achievement are unlikely to have the same ability. The survey data used by Scott-Clayton and Wen contain measures that allow researchers to control to some extent for differences in family background and ability (e.g., household size and net worth, high school grade point average, and Armed Services Vocational Aptitude Battery scores), but state administrative records usually do not. Without these controls, but using high school graduates as the comparison group and data from all states, the estimated return on a BA increases to $24,355.
Second, another limitation of analyses using state administrative files is that they are typically limited only to individuals who receive some education beyond high school. Without the controls on family background and ability and using the earnings of college noncompleters (instead of high school graduates) as the comparison group, but continuing to use data from all the states, the estimated return on a BA is $20,519. Thus, as Scott-Clayton and Wen point out, the absence of controls for family background and student ability and the lack of data on high school graduates tend to have offsetting effects on estimates of the returns to college.
Another important limitation of single-state administrative records on PSE is that they do not contain earnings information on individuals who have left the state. As previously mentioned, the administrative data used in the evaluation of the Canadian Future to Discover experiment are subject to a similar problem, as are (American) state UI data when they are used to evaluate training programs. In estimating the returns to college attainment, there are two possible approaches for dealing with this issue.
Scott-Clayton and Wen call the first of these approaches, which is sometimes used in estimating the returns to education, the “naive approach.” Under this approach, it is simply assumed that all missing earnings are zero. Without the controls on family background and ability and using college noncompleters as the comparison group, but not conditioning on employment, the estimated return on a BA is $10,649 when the naive approach is taken. This estimate is nearly half that of an estimate $20,939 that is also not conditional on employment, but controls for family background and student ability, uses high school graduates as the comparison group, and does not set the earnings of out-of-state workers to zero. Unfortunately, the latter estimate, which can be considered the Scott-Clayton and Wen’s best estimate of the total effect on earnings of receiving a BA, cannot be obtained from state administrative data, but only from survey data.
Scott-Clayton and Wen suggest that a “superior approach” to the naive approach is to calculate the returns for only those individuals known to have positive earnings and hence living within the state. They suggest that even though the resulting estimate necessarily ignores both nonworkers and out-of-state workers (who potentially have higher earnings than in-state workers), it is nonetheless the best that can realistically be obtained from state administrative data. Without the controls on family background and ability, conditional on employment and thus living within the state of college attendance, and using college noncompleters as the comparison group, the estimated return on a BA is $15,129 when the superior approach is used. This estimate is well above the naive estimate and reasonably close to the $17,942 estimate that uses full information from the survey and is conditional on employment.
Interestingly, the papers summarized earlier all adopted the naive approach with administrative data. When a key objective of a program is to change status (e.g., increasing employment or educational enrollment) and the outcome of interest is not available for some individuals (e.g., because they work or attend school out-of-state), there is little alternative but to use this approach. When the key program goal is to increase employment, it would be better to use the National Directory of New Hires (NDNH) or Social Security data instead of state UI records because they provide information on the earnings of out-of-state workers.
Although state administrative data on college enrollees, such as state UI records, do not contain the earnings of out-of-state workers, it is still useful to estimate the returns from college attainment on conditional earnings because whether earnings increase among workers due to college enrollment is of considerable interest. Use of state administrative data means that returns to education must be limited to conditional earnings, however. Consequently, any positive effects of college attendance on working will be missed. Based on full information from the survey, for example, Scott-Clayton and Wen find that the unconditional earnings that result from a BA are about $3,000 higher than the resulting conditional earnings estimate ($20,939 vs. $17,942) and almost $6,000 higher than the best estimate that can be obtained from state administrative data ($20,939 vs. $15,129).
Lessons From the Papers
The papers in this Special Issue vary greatly in terms of geographic setting, outcomes examined, and types of administrative and survey data used. Thus, it is not surprising that the lessons from the papers vary. This section summarizes the key lessons from the papers.
Survey Nonresponse Bias
It is well known that surveys miss individuals, often because they cannot be located or refuse to be interviewed. 16 This is related to various socioeconomic characteristics such as age, gender, marital status, household structure, education, and income (Groves and Couper, 2001). Nonresponse is especially likely if the evaluation follow-up period is lengthy. If nonresponse is disproportionately high among nonworkers or workers in the sample with relatively low earnings, which is likely, this will bias mean survey-reported earnings upward. Survey response rates that differ systematically by program subgroups or between the treatment and control/comparison groups can lead to biased impact estimates. 17 As previously discussed, this is particularly likely if there is reason to expect a relation between response and treatment (e.g., when a treatment incorporates financial incentive), and consequently, response rates are unbalanced. Several techniques are available to correct for survey nonresponse bias. Groves (2006, p. 653) notes that the literature recommends approaches including weighting, calibration, and propensity models, but he cautions that all these approaches require assumptions that are generally untestable.
Some diagnostic tests can be conducted if only survey data are available. Those tests include comparing the respondents’ and nonrespondents’ characteristics at baseline. This information can then be used in weighting the observations to minimize survey nonresponse bias. However, if only survey data are available, it typically will not be apparent whether nonresponse bias even exists.
Groves (2006) notes other diagnostic tests that can also be performed, such as comparing respondent means and variances to population data or another high-quality sample. Although census data or aggregate administrative data can be used for this purpose, having administrative data on individuals in addition to survey data permits much better tests. One can then see whether estimated impacts are the same for survey respondents and nonrespondents by using the administrative data. One can further use administrative data on individuals to determine whether schemes such as weighting, calibration, and propensity matching successfully correct for nonresponse bias.
Several of the articles in this Special Issue used administrative data to explore whether the survey data had survey nonresponse bias. 18 Dorsett et al. found, for example, that nonresponse bias, especially of the unbalanced kind, rather than reporting bias, appears to be the major source of the difference in impacts between the survey and administrative data in the UK ERA experiment. This seems to be the case even though the observed characteristics of the treatment and control groups in the respondent sample were similar. Yang and Hendra, who evaluated two similar programs in New York City, found that survey nonresponse bias was important in the Work Rewards evaluation but not for the evaluation of Family Rewards.
Hendra and Hill found that response rates are unrelated to bias for their evaluation. This important finding suggests that substantial resources could be saved in studies using surveys if high response rates are not chased. They further suggest that, when it is present, nonresponse bias can be monitored and then adjusted for as the survey is being run. Hendra and Hill’s finding matches the conclusions in Groves (2006) and the meta-analysis of Groves and Peytcheva (2008). As all these studies note, additional evidence is clearly needed.
Improved weighting strategies and imputation for missing observations are two approaches that have been suggested in the literature for coping with nonresponse bias (Puma, Olsen, Bell, & Price, 2009). Weighting was attempted by Dorsett et al., but they concluded that weighting was not effective in removing nonresponse bias. 19 This was perhaps due to unobserved characteristics that influenced decisions on whether to respond. Yang and Hendra mention that the original evaluation of the programs they analyze (Riccio et al., 2013) tried several weighting schemes to deal with nonresponse bias but were unsuccessful.
Survey Response Bias
Survey measurement error can potentially lead to bias in studies using survey data. Note that survey respondents can provide erroneous responses either because they do not understand the questions or because they want the interviewer to perceive them as a “good” person; the latter phenomenon is referred to as “social desirability bias.”
Mathiowetz, Brown, and Bound (2001, p. 186) review the literature on survey measurement errors for the low-income population, concluding that the literature is “limited.” The papers in the special issue either do not pay much attention to survey measurement errors or note that it does not appear to be a major problem; for example, Moore et al. examine whether the difference between administrative and survey data increases when the length of the recall period is extended, and they find no evidence that recall error increases as the period is increased.
Noncoverage in Administrative Data
There is a tendency for analysts to assume that administrative data are complete and accurate. This need not be the case. Several of the articles in the special issue relied on earnings data collected by state agencies to determine eligibility and the level of benefits for UI, often referred to as UI wage records. As Hotz and Scholz (2001, p. 292) point out, UI records typically miss certain types of earnings: “State UI systems typically do not cover the employment of self-employed persons, most independent contractors, military personnel, federal government workers, railroad employees, some part-time employees of nonprofit institutions, employees of religious orders, and some students employed by their schools.” Hotz and Scholz (p. 303) estimate that these gaps in coverage account for at least 13% of all jobs. Wallace and Haveman (2007, p.738) indicate that 9% of the workers in Wisconsin are not covered by UI wage records. In addition, UI wage record data miss informal, “off the books” earnings, where the employer does not report the earnings (nor pay appropriate taxes) to the government. Because such earnings are illegal, one can argue that they should not be counted in impact evaluations or cost-benefit analyses, but it can also be argued that the earnings do provide benefits to those who receive them.
Finally, most state UI systems miss individuals who work in a different state (e.g., see the Moore, Perez-Johnson, & Santillano paper in this volume). As Moore et al. mention, this becomes a more serious problem the longer the follow-up period because an increasing number of people move. 20 As Scott-Clayton and Wen point out in their paper for this volume, state administrative records on college enrollees similarly miss individuals who move after completing their education and thereby work in a different state than the college or university they attended. They find in their analysis that such persons have higher earnings than those who work in their state of origin, leading to systematic underestimates of the impact of college on earnings when estimated with state administrative data. In a somewhat analogous situation, the Ford et al. study sought to see whether the Future to Discover Project in Canada led to increased enrollment in higher education. Unfortunately, the administrative data they used only captured enrollment in colleges in the Maritime Provinces where the intervention took place. Thus, the administrative data failed to capture enrollments of students in other parts of Canada or in other countries.
Fortunately, in the case of state UI wage record data, there are alternatives that avoid incomplete coverage issues. The Office of Child Support Enforcement in the U.S. Department of Health and Human Resources compiles earnings data based on UI wage records from all states, members of the military, and federal workers on a quarterly basis to produce the NDNH to assist states in setting appropriate child support orders, and NDNH data are sometimes made available to researchers. The law establishing NDNH indicates that the primary purpose of the program is to assist in setting appropriate child support orders, but the statute permits the data to be available for research in some circumstances. In recent years, after the papers in this volume were written, access to NDNH data for research and evaluations has increased. NDNH data can overcome coverage issues due to individuals working in a different state, the military, or the federal government. NDNH data do not capture self-employment data; sources such as Internal Revenue Service (IRS) or social security data are the best sources of administrative data for the self-employed, 21 but data from these sources are infrequently available for program evaluations.
An exception is found in a recent study by Mastri, Rotz, and Hanno (2018). As part of a random assignment evaluation of programs operating under the WIA, these authors were able to compare impacts estimated with administrative data from both the NDNH and IRS with those estimated from survey data. 22 Consistent with several of the papers included in the Special Issue, the researchers found that while both WIA intensive services alone and intensive services combined with training produced positive impacts on earnings, these impacts tended to be larger and more often statistically significant when estimated with survey data than with administrative data (Table III.2 and Figures III.6 and III.7 of Mastri, Hotz, and Hanno, 2018). Importantly, because the IRS data included earnings on jobs that were not covered by the NDNH data, the IRS data produced impacts that were closer to those based on the survey data than those resulting from the NDNH data, especially when the IRS data included earnings from independent contracting. Using the NDNH data, Mastri, Rotz, and Hanno also found that WIA-funded training increased the likelihood of out-of-state employment, a result suggesting that state UI data are subject to unbalanced reporting errors when used to evaluate training programs. They also found evidence that their survey data were subject to serious recall errors, especially as the follow-up period became longer.
Importance of Financial Incentives and Other Program Features
The structure of the program being evaluated can sometimes lead to biases in the survey or administrative data. Yang and Hendra’s paper uncovered some interesting potential biases due to program structure. The Family Rewards program encouraged participants to become self-employed, and because self-employment income is not included in the UI wage record data they used, these data understated the earnings of the treatment group. This led Yang and Hendra to point out the importance of conducting surveys when employment from an intervention is likely to be disproportionately in jobs not covered by administrative data.
The Family Rewards and Work Rewards programs, the two programs Yang and Hendra examine, included financial rewards for employment. They found that receipt of a program financial reward was positively correlated with survey response, thereby leading to higher response rates for successful members of the treatment group. The Ford et al. paper also provides evidence of higher survey responses when financial incentives are provided. Dorsett et al. also note in their paper that the payments made to the treatment group may have led to unbalanced nonresponse between the treatment and control groups. Thus, it is particularly important to have administrative data when financial incentives are part of an intervention because unbalanced nonresponse bias is especially likely.
The Importance of Including Data From Multiple Sources
All the papers in this Special Issue point to the importance of using both survey data and administrative data whenever possible. Administrative data are often limited in content and may not include all outcomes and characteristics desired or for the time frame of interest. Hendra and Hill show that low response rates do not necessarily lead to nonresponse bias, and administrative data can be used to diagnose for and possibly correct nonresponse bias; without the high cost of attempting to reach nonrespondents, the cost of survey data is much more reasonable. Survey data can demonstrate shortcomings in administrative data due to lack of coverage, as illustrated by Yang and Hendra, and administrative data can be helpful in detecting measurement problems and nonresponse bias in surveys. 23 Indeed, Dorsett et al. suggest that the impacts of the UK ERA on earnings might have been substantially overstated if only survey data were used to estimate them. Although in some instances one data source may be clearly superior to the other, including both survey data and administrative data in an evaluation enables the evaluator to spot potential biases in each source and may offer solutions to overcome the biases.
However, the cost of developing and administering a survey generally dwarfs the costs of administrative data. In studies where the expected impact is fairly small, such as providing a single dose of job search assistance to UI claimants, the costs of fielding a large enough sample could run into the millions whereas administrative data usually have a very low cost if it is already being collected for other purposes. Thus, evaluators need to pay careful attention to determine whether the extra costs of a survey are warranted. For example, Yang and Hendra knew that the Family Rewards program encouraged participants to start their own business, so relying on UI administrative data alone was likely to produce biased estimate due to coverage issues.
Differences Between Outcomes and Impacts
The Ford et al. article makes the important point that when administrative data and survey data are both used for an evaluation, there can be differences in the levels of the outcome variable with or without differences in the impact of the intervention. Ford et al. emphasize that, when an observation is missing from the administrative data, the outcome value is generally assumed to be zero, even though the missing data may result from noncoverage or a data error. This can occur with education outcomes considered by Ford et al. and for earnings outcomes analyzed in most of the other papers in the Special Issue. In surveys, however, missing data are generally not automatically assumed to be zero. In addition, outcome values can differ between survey and administrative data for other reasons, such as survey respondents misinterpreting questions about an outcome.
In the context of program evaluation, differences in impacts are generally more important than differences in outcome levels—it is the program impact that determines whether a program’s benefits exceed the costs. Nevertheless, outcome levels can also be important. For example, Scott-Clayton and Wen not only examine the impact of higher education on earnings, they also examine the impact of college attendance on the probability of having annual earnings exceed a minimum level and a living wage level. Analyses of welfare to work programs sometimes look at how programs affect the probability of remaining in poverty or remaining eligible for benefits after participation.
Limitations of Administrative Data in Absence of Random Assignment
When random assignment is used to determine which individuals receive a treatment (or more formally, are eligible to receive the treatment) and which are relegated to a control group, and is properly implemented, the process usually assures that the treatment and control groups are well matched on all prerandomization characteristics that can affect the outcome of interest. However, random assignment is often not feasible and often not conducted even when feasible. Under these circumstances, it is necessary to statistically control on factors that affect the outcome of interest, typically through a statistical technique, such as regression analysis, or a matching process, such as propensity score matching.
Evaluations of programs not using random assignment that rely on administrative data face two problems. First, because they include only information needed for administrative purposes, background data (and prerandomization outcomes) available in the administrative data are often extremely limited. For example, although UI wage record data can include several years of preprogram employment and earnings data, they are often very limited in terms of the basic demographic data they provide, and in most states, they do not include hours worked or the hourly wage rate. This makes it difficult to apply such statistical control methods as regression analysis and propensity score matching.
The second problem is that appropriate comparison group members may not be present in the administrative data. For example, Scott-Clayton and Wen stress that state administrative data on higher education typically do not include persons who did not receive PSE. Thus, such data can only be used to compare the effect of receiving a degree in comparison to those who enrolled but did not complete college; they cannot be used to estimate the impact of college completion relative to receiving no PSE.
Closing Thoughts
The decision as to whether to use administrative data, survey data, or both for an evaluation is ultimately a judgment call, but an important one. In Barnow and Greenberg (2015), we found several cases where the two sources yielded different conclusions about program impacts, and the general finding that surveys produce higher levels of earnings than administrative data can alter findings in a cost-benefit framework. In the design stage, researchers should carefully weigh considerations that might indicate whether a survey is likely to be worth the extra cost. Factors to consider include: whether the administrative data are likely to have full coverage; whether nonresponse is likely to be large or vary by treatment-control status or population characteristics, likely causing survey-based impacts to be plagued by response bias issues; and whether the characteristics of the program being investigated might cause its estimated impacts to be especially subject to biases resulting from noncoverage in the administrative data or survey nonresponse.
The papers in this Special Issue and the articles and reports reviewed in Barnow and Greenberg (2015) make it clear that while the decision on whether to use survey or administrative data can be highly consequential, there is much that remains to be learned. The Editor in Chief of Evaluation Review has assured us that he is very interested in publishing other articles that shed light on this important issue.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
