Abstract
Background:
Impact evaluations draw their data from two sources, namely, surveys conducted for the evaluation or administrative data collected for other purposes. Both types of data have been used in impact evaluations of social programs.
Objective:
This study analyzes the causes of differences in impact estimates when survey data and administrative data are used to evaluate earnings impacts in social experiments and discusses the differences observed in eight evaluations of social experiments that used both survey and administrative data.
Results:
There are important trade-offs between the two data sources. Administrative data are less expensive but may not cover all income and may not cover the time period desired, while surveys can be designed to avoid these problems. We note that errors can be due to nonresponse or reporting, and errors can be balanced between the treatment and the control groups or unbalanced. We find that earnings are usually higher in survey data than in administrative data due to differences in coverage and likely overreporting of overtime hours and pay in survey data. Evaluations using survey data usually find greater impacts, sometimes much greater.
Conclusions:
The much lower cost of administrative data make their use attractive, but they are still subject to underreporting and other problems. We recommend further evaluations using both types of data with investigative audits to better understand the sources and magnitudes of errors in both survey and administrative data so that appropriate corrections to the data can be made.
Introduction
There are two possible sources of data that can be used to estimate the impacts of social programs: data obtained from government agencies that maintain the data for use in administering their programs and data collected from surveys of a research sample for the specific purpose of measuring program effects. Each type of data has its own advantages and disadvantages. Most importantly, they can produce different impact estimates. To investigate this issue, we focus on findings from previously conducted randomized social experiments. Because social experiments are often referred to as the “gold standard” for measuring the impacts of social programs, one might expect the impact estimates they produce to be valid and precise, regardless of whether administrative or survey data are used. However, as will be seen, there is reason to doubt this. Some large social experiments have used data from both sources to estimate the same impacts and produced divergent estimates. The issues raised here apply to nonexperimental evaluations as well—indeed, as is well documented in the literature, nonexperimental evaluations frequently rely on strong, untestable assumptions, and they are subject to other threats, in addition to the data issues raised here. Thus, social experiments remain the gold standard, and it is essential to learning about whether programs and policy innovations that they continue to be conducted and used to estimate program impacts; but evaluators need to be aware of the threats that can occur even when random assignment is used.
In the remainder of this article, we first review previous research that compares administrative and survey data and briefly discuss the advantages and disadvantages of each. We then look at eight previous experiments that have used both administrative data and survey data to estimate the same impacts and examine the extent to which these estimates differ and how these differences are treated in evaluations of the programs tested by social experiments. Following that, we develop a simple model of the mechanisms that can cause impacts estimated with administrative data to differ from those estimated with survey data. We then use the model in investigating the sources of the differences between survey- and administrative-based impact estimates in the eight experiments mentioned earlier. The implications of these differences in impacts, and some ideas for addressing them, are discussed at the end of the article.
Our discussion focuses mostly on impacts on earnings because most previous comparisons of impacts produced by administrative and survey data have emphasized this measure, although impacts on other outcomes have sometimes been compared as well. Moreover, estimates of program impacts on earnings usually drive the findings from cost–benefit analyses of many of the programs tested by social experiments. 1
Differences Between Survey and Administrative Measures of Earnings
Some Trade-Offs
Conducting surveys to evaluate a treatment being tested by a social experiment is much more expensive than using administrative data. Administrative data that cover earnings include the records that employers submit quarterly to state governments for the purpose of administering unemployment insurance (UI) programs, which are the administrative data most widely used for evaluations in the United States, as these data would exist in the absence of the experiment. Indeed, according to Kornfeld and Bloom (1999, p. 193) survey data can be over 100 times more costly than UI earnings data per observation, although another source (Baj. Fahey, & Trott, 1992, p. 41) estimated survey data to be only 9.5 times as costly. For that reason, relatively few surveys are likely to be conducted during the follow-up to an experiment, while administrative data can often be obtained for each month or calendar quarter during the follow-up period. Longer follow-up periods are also more feasible with administrative data. The use of administrative data for impact analysis has greatly reduced the cost of conducting social experiments and allowed some experiments to be conducted that otherwise would have been prohibitively expensive. Because survey data are much more costly than administrative data, but, as discussed next, can provide richer data in terms of the questions that can be asked, a small number of evaluations have used the two data sources in combination.
An important characteristic of surveys is that they can be tailored to the specific needs of an evaluation and subsequently used to estimate the key impacts of interest, whatever they may be. For example, surveys can collect information not only on employment status and earnings but also on hours and wage rates. Administrative data such as UI records, in contrast, can be used to determine employment and earnings but not (in most states) the number of hours worked or the hourly wage rate. However, survey data are subject to recall errors, as well as simple misreporting, especially in the case of earnings from irregular work, work in the informal sector, and illegal activities. As will be seen, there is evidence that some survey respondents report implausibly high hours and earnings, especially pertaining to overtime work. On the other hand, some respondents may fail to recall brief informal jobs or correctly remember their hours and earnings in occupations that tend to have irregular hours, such as construction and housekeeping in private residences. 2 Also, there is evidence that survey respondents tend to understate transfer payments (Hotz & Scholz, 2001), either intentionally or inadvertently. Administrative data may contain errors as well, but these errors are more likely to be random with respect to treatment and control status. Errors in survey data, in contrast, may not be random, especially if the experimental treatment involves transfer payments or financial incentives.
For example, in the income maintenance experiments, which tested negative income tax programs (i.e., income conditioned transfer programs) for low-income families, participants in the treatment group self-reported their hours and earnings (Greenberg, Moffitt, & Friedmann, 1981 and Greenberg & Halsey, 1983). Because of the nature of the treatments tested, treatment group members could increase the cash payments they received by underreporting their work effort and earnings. Although the information used to determine the payment amounts was collected separately from the survey data used for the evaluation, one might expect that many members of the treatment group were consistent in what they self-reported, especially if they thought the latter might be used to audit the former. Members of the control group did not have similar incentives. To the extent program participants engaged in underreporting, the impact of the treatment on reducing their labor supply was biased upward. We present some evidence below that such an upward bias did exist in the income maintenance experiments. There are other experiments that test programs with financial incentives that reward participants for employment. In such experiments, there is an incentive for members of the treatment group to overstate their hours and earnings, although again collection of the payment data and survey data is usually conducted separately. Of course, many experiments do not involve programs that provide incentives to misstate earnings, but it is important to keep in mind that some do.
Depending on the source of the administrative data, certain individuals in the research sample may not be included. 3 For example, if either employers or the evaluation organizations have incorrect social security numbers for some sample members, then there will be a failure to match UI administrative data with some members of the research sample. Employers may also fail to report some earnings to avoid the tax used to finance UI benefit payments. 4 Moreover, data maintained by state welfare or UI agencies usually exclude sample members who moved out of the state after they were randomly assigned. UI data also do not cover most individuals who work for the federal government, on small farms, for railroads, for selected nonprofits, at out-of-state jobs, in casual or irregular employment, or who are self-employed or independent contractors. Hotz and Scholz (2001, p. 303) suggest that these gaps in coverage account for at least 13% of all jobs. 5 Wallace and Haveman (2007, p. 738) indicate that 9% of the workers in Wisconsin are not covered by UI records.
Of course, surveys also miss individuals, usually because they cannot be located or refuse to be interviewed. This is related to various socioeconomic characteristics such as age, gender, marital status, household structure, education, and income (Groves & Couper, 2001). Nonresponse is especially likely if the evaluation follow-up period is lengthy. If nonresponse is disproportionately high among nonworkers or workers with low earnings, which is likely, this will bias mean survey-reported earnings upward. However, various techniques, such as paying respondents for submitting address cards when they move and for participating in interviews, are available to reduce the loss of survey respondents. 6
Missing observations in survey and administrative data must necessarily be treated differently when estimating the impacts of a social experiment. When members of a sample are missed by a survey, they obviously cannot be used in estimating impacts. If the nonrespondents are not random with respect to the treatment outcomes of interest (e.g., if in a training program experiment in which the program’s effect on earnings is a key impact, nonworkers are more likely to be nonresponders than workers), then the impact estimates will be affected by “survey response bias.” When individuals are missing from administrative earnings records, such as UI or Social Security Administration (SSA) data, they are usually treated in evaluations as if they are nonworkers and, hence, have zero earnings. Many are actually nonworkers, of course, but others are simply missed for the reasons discussed earlier. This causes earnings to be understated and is one of the several factors that causes impact estimates to be affected by “reporting error bias.”
Administrative data usually cover a month, calendar quarter, or year. Random assignment, however, is likely to take place somewhere in the midst of this time period. Thus, it is difficult to line up the receipt of a treatment being tested by an experiment with the time period covered by the administrative data. For example, if an individual is randomly assigned in the middle of the time period, interpretation of the impacts estimated for the period becomes problematical. This is obviously more of a problem the longer the time period. Survey questions, in contrast, can be asked for specific dates either prior to or after receipt of the treatment. However, the responses may be subject to recall error about exactly when events took place, especially as the duration between the events and the survey lengthens.
Previous Comparisons of Survey and Administrative Data
Relatively few studies have used both survey and administrative data to evaluate social programs experimentally. These evaluations are examined in some detail later in this article. However, there are numerous studies that have matched survey and administrative data and compared the two (e.g., Abraham, Haltiwanger, Sandusky, & Spletzer, 2013; Baj & Trott, 1991; 2013; Bound & Krueger, 1991; Duncan & Hill, 1985; Kapteyn & Ypma, 2007; Kreiner, Lassen, & Soren, 2013; Mellow & Sider, 1983; Pedace & Bates, 2001; Rodgers, Brown, & Duncan, 1993; Wallace & Haveman, 2007). 7 Because these studies do not evaluate specific programs, but instead focus on comparisons of survey and administrative data, they do not directly consider whether the two types of data are likely to produce different impact estimates, the subject of this article. Still, they have findings that are germane to our analysis. These findings are briefly described in this section.
Studies that compare administrative- and survey-based earnings almost always find substantial differences between the two for substantial percentages of individual observations, although mean earnings are not necessarily very different (e.g., Abowd & Stinson, 2005; Bound & Kruger, 1991; Hotz & Scholz, 2001; Monti & Gathright, 2013; Rodgers et al., 1993; Wallace & Haveman, 2007). Wallace and Haveman (2007), who focus on welfare recipients in Wisconsin, find that less than 10% of the difference between administrative- and survey-based earnings in their comparison can be explained by differences in the type of earnings (e.g., tips or commissions) or type of job (irregular work or out-of-state work) or recall error from intervals without work prior to the survey. Bound and Kruger (1991) obtain similar findings.
One common finding in previous comparisons of administrative and survey data is a negative correlation between administrative-based earnings and the absolute difference between administrative- and survey-based earnings (Bound, Brown, Duncan, & Rodgers, 1994; Duncan & Hill, 1985; Hotz & Scholz, 2001; Pedace & Bates, 2001; Rodgers et al., 1993), suggesting that the difference between survey- and administrative-based earnings tends to be positive for persons with lower earnings and negative for those at the upper end of the earnings distribution. In addition, Abraham et al. (2013) find evidence that differences in employment status between the Current Population Survey and UI data are most likely for marginal workers and workers in marginal or nonstandard jobs. As will be seen in the third section, these findings have important implications for earnings impacts estimated with the two data sources because the target groups for most social programs that have been evaluated through random assignment tend to be at the lower end of the earnings distribution. However, as also indicated in the third section, the possibility that the size of the differences in measured earnings between the two data sources may differ between treatment and control groups is potentially even more important.
Smith (1997) found that a survey decomposition approach that asked about earnings on individual jobs resulted in higher reported earnings than an approach that asked directly for total annual earnings. Thus, reported survey-based earnings depend on the survey instrument used to collect them. The decomposition approach also resulted in earnings that varied more from those obtained from employer administrative records. Of course, the survey approach that was the more accurate one depends on whether the administrative-based earnings were accurate or understated earnings.
Baj, Trott, and Stevens (1991) explored the advantages and disadvantages of switching from telephone survey data to UI wage-record data for assessing performance and program impacts in the Job Training Partnership Act (JTPA). Similar to the findings discussed later regarding results from social experiments, they found evidence that there were differences in postprogram survey response rates related to employability. For example, survey response rates were lower for men (62%) than women (67.5%), for high school dropouts (60.8%) than high school graduates (66.4%), and for those unemployed when they left the program (49.6%) than those employed (70.2%). A later study by two of the same authors (Baj et al., 1992) reached similar conclusions. The 1992 study included some additional observations. The authors found that while in about 80% of the cases examined the survey and administrative data sources agreed on postprogram employment status, each source found roughly 10% of the observations to be employed when the other source characterized them as being unemployed. In an analysis of Illinois data, the authors found that the most common reasons for postprogram employment in the survey but not in the UI wage-record data were employment out of state, followed by self-employment.
Several studies have compared administrative- and survey-based regression estimates of the determinants of earnings, sometimes finding that they are similar (e.g., Bound & Krueger, 1991; Wallace & Haveman, 2007) and sometimes finding that they are rather different (e.g., Duncan & Hill, 1985). Findings that they are similar might suggest that estimates of the earnings impact of social programs would be insensitive to the data source used to obtain them. As will be seen in the following section, however, this does not seem to be the case.
Experiments That Have Compared Earnings Impacts Estimated With Administrative and Survey Data
Relatively few social experiments have used both administrative and survey data to estimate the same impacts, thereby allowing the estimates to be compared. In this section, we describe the findings from eight experiments that did permit such comparisons 8 and discuss how the evaluators treated differences in findings based on the two alternative data sources. Impacts on earnings were of critical interest in all eight experiments, and the comparisons focused on this impact and to a lesser extent the impact on employment status. We emphasize the earnings impacts. 9
Findings From Eight Previous Experiments
Table 1 lists the eight experiments we examine, briefly describes the treatment or treatments tested in each experiment, and compares earnings impacts estimated with survey data for each experiment with impacts estimated with administrative data. Many of the differences between impacts estimated with the two types of data are not statistically significant. Indeed, some of the estimated earnings impacts are themselves not significant. Still, the administratively based impact estimates for the two income maintenance experiments are consistently less negative (or more positive) than the survey-based impact estimates, while almost all the administratively based impact estimates for the remaining six experiments are less positive than the survey-based impact estimates. Moreover, these differences are often very substantial. The directions of the differences between the impacts are in the direction implied by the model presented in the fourth section.
Social Experiments that Have Compared Survey- and Administrative-Based Earnings Estimates.
Note. a Standard errors are in parentheses. “UI” refers to employer-reported quarterly earnings data to state unemployment insurance offices. “SSA” refers to employer-reported annual earnings data to the federal Social Security Administration.
b Data are for the 10th quarter after random assignment. Adapted from Greenberg and Halsey (1983), table 3. Findings held when SSA data used instead of UI data (Robins & West, 1980).
c Impacts are averaged over the eight quarters covering approximately the middle 2 years of the experiment. Adapted from Greenberg, Moffitt, and Friedmann (1981), table 3.
d All quarters for which each sample member is not missing in either data sets are pooled. The authors do not provide standard errors. * < .1. ** < .05. *** < .01. Adapted from Kornfeld and Bloom (1999), table 3.
e Impact estimates were computed for four quarters after application for unemployment insurance benefits for both the Pennsylvania and the New Jersey experiments. The estimated impacts of the different tested programs reached a peak or close to it in the second quarter and these are presented in Table 1. Quarters in the survey data are defined relative to the UI benefit application date, while quarters in the administration data are defined as full calendar quarters after the quarter of UI benefit application. Thus, the second quarter in the survey data actually occurs about a month and a half earlier on average than the second quarter in the administrative data. For this reason, the administrative-based impact estimates presented in Table 1 are an average of the first and second calendar quarter impact estimates. These two calendar quarters bracket the second quarter in the survey data and provide the most appropriate comparison with the impact estimates based on the survey data. Adapted from Corson, Decker, Dunstan, and Gordon (1989).
f Two different sources of administrative data were used to estimate earnings impacts, namely, annual social security earnings records and quarterly UI earnings records. The former overlapped with the survey data during three of the follow-up years, while the latter overlapped during only two calendar quarters. Impacts reported in the table that are based on the former are for 3 or 4 years after random assignment (i.e., for 1997 and 1998) or 1–3 years after enrollment in the Job Corps, while those based on the latter are for the 15th and 16th quarters after random assignment. The author do not provide standard errors. *< .1. ** < .05. ***< .01. Adapted from Schochet, McConnell, and Burghardt (2003), tables 1 and 2.
g Impact are for single mothers who participated in the New Deal for Lone Parents (NDLP) program and who were not employed at the time of random assignment. The earnings impact estimates from the survey are for the fifth year after random assignment, while the administrative data are for the 2008–2009 tax year, a period that partially, but not entirely, overlaps the fifth year after random assignment. (A tax year in the United Kingdom begins on April 6th and ends on April 5th of the following year.) The administrative database was assembled by the British Department of Works and Pensions from several difference sources, was national in scope, and contained data on earnings, employment status, and the receipt of government transfer payments. The data on earnings were obtained from information submitted by employers to the Treasury for purposes of administering income taxes. The major limitation of the earnings data is that they excluded informal work, self-employment, and some low-paying jobs in the formal sector, especially those at which workers are employed for 15 or fewer hours a week. A comparison of earnings impacts from the survey and administrative data is only possible for one of the program’s three target groups. Adapted from Hendra et al. (2011), Tables 4.1 and 4.3.
h Impact estimates pertain to 22 quarters after random assignment. The evaluation report does not provide information on the statistical significance of the impact estimates. Adapted from Perez-Johnson, Moore, and Santillano (2011), appendix table 3.1.
How the Biases Were Treated by the Evaluators: Implications for Cost–Benefit Analysis
Differences between survey- and administrative-based earnings impacts were treated in various ways by the evaluators of the experiments listed in Table 1. One reason this is important is that their decisions on this matter had important implications for cost–benefit analyses of the tested programs. These topics are discussed next for each experiment listed in Table 1.
The New Jersey and Pennsylvania reemployment experiments
Three of the four authors of the final reports of the New Jersey and Pennsylvania reemployment experiments were the same, the two reports were separated by only a little over 2 years, and impacts were estimated with both survey and UI administrative data in both evaluations. Nevertheless, the report for the experiment in New Jersey emphasizes findings based on the survey data, while the report for the experiment in Pennsylvania focuses on findings that rely on the UI data. The decision for New Jersey “was based primarily on the fact that, because the [UI] wage records data applied to calendar quarters, it was not possible to use the wage records data to focus on impacts that occurred immediately following the date of claim” when most of the impacts seemed likely to happen (Corson, Decker, Dunstan, & Gordon, 1989, p. 365). The emphasis on the administrative-based findings for Pennsylvania was because of concern over response bias in the survey data, although as will be discussed in the fifth section, this bias is likely to be small.
Although the evaluators made different choices in the New Jersey and Pennsylvania reemployment experiments concerning whether to emphasize findings based on survey data or the UI administrative data, cost–benefit analyses were conducted in both experiments using both employer-reported UI administrative data and survey data. Thus, it is possible to examine the implications of these choices. These implications are quite dramatic, as can be seen in Table 2, where the net benefits for society as a whole are shown, and the only difference between the two sets of estimates is the data source used to compute the earnings impacts that were incorporated into cost–benefit studies. Net benefits are always larger when based on earnings impacts estimated with the survey data.
Estimates of the Social Net Benefits in New Jersey and Pennsylvania Demonstrations Computed With Alternative Data Sources.
Note. Adapted from various tables in Corson et al., 1989 and 1991. The treatments for New Jersey were (1) JSA only; (2) JSA plus training or relocation assistance; and (3) JSA plus reemployment bonuses. The treatments for Pennsylvania were (1) low bonus, short qualification period; (2) low bonus, long qualification period, (3) high bonus, short qualification period, (4) high bonus, long qualification period, and (5) initially high bonus that declines, long qualification period.
The Individual Training Account experiment
After estimating earnings impacts with both survey and UI administrative data, the evaluators of the Individual Training Account (ITA) experiment argued that because “the UI data excludes a number of types of employment,” the survey-based impact estimates are the more reliable ones (Perez-Johnson, Moore, & Santillano, 2011, p. 91). Consequently, that is the result they emphasize in the final report and exclusively use in a cost–benefit analysis of the ITA experiment. The cost–benefit analysis based on the survey data found that switching from a guided choice model, the experimentally tested model most closely approximating practices in the offices administrating the Workforce Investment Act, to a structured choice model would result in social benefits of over US$40,000 per enrollee over the worker’s lifetime (Perez-Johnson et al., 2011), a very large amount for a social program, particularly given that the training involved generally costs under US$10,000 per participant. It seems evident from the impacts reported in Table 1 that cost–benefit findings based instead on the UI data would have been far less positive.
The National Job Corps study
The random assignment evaluation of the Job Corps used administrative data from both quarterly UI wage records and annual social security wage records as well as survey data. The same report that investigated the survey- and administrative-based earnings impact differences also presented findings from a cost–benefit analysis of the Job Corps (Schochet, McConnell, & Burghardt, 2003). In conducting this analysis, the evaluators relied on the survey earnings data, but they attempted to adjust for response bias by multiplying survey earnings by the ratio of average social security earnings for the full sample to average social security earnings for survey respondents, a ratio that is usually less than one. This was done separately for the treatment and control groups. In addition, they reduced survey earnings by 10% to account for the overreporting of hours and also assumed that, after the period covered by the survey used in the evaluation, the earnings impacts decayed by 68.3% per year, the rate implied by earnings impacts estimated with the social security data used in the evaluation. The rationale for using the survey data findings for the cost–benefit analysis, rather than the administrative data, was that they “include informal earnings and other sources of income not reported on the [administrative] data” (Schochet, McConnell, & Burghardt, 2006, p. 46). The rather small 10% adjustment in the survey earnings is somewhat surprising because, as will be seen in fourth and fifth sections, the observed difference between the survey- and administration-based earnings is much larger than 10%, and much of this difference appears due to biases in the survey data, rather than in the administrative data, especially in the case of the social security earnings records. Still, the cost–benefit analysis found that program costs greatly exceeded program benefits from the perspective of society as a whole. This gap would only have been greater with a larger adjustment. Indeed, a more recent study that is based on a slightly larger adjustment for nonresponse bias and overreporting (and also a somewhat larger decay rate) has very similar findings (Schochet et al., 2006). Interestingly, an earlier cost–benefit analysis than the 2003 study, which was conducted before the 68% annual decay in earnings implied by the social security data could be observed, assumed there was no decay in earnings impacts and concluded that the net benefits to society of the Job Corps were positive and substantial (McConnell & Glazerman, 2001). Thus, although the evaluators chose to use impacts estimated with survey data for their cost–benefit study, administrative data played a critical role in the ultimate findings.
The U. K. Employment Retention and Advancement Demonstration
As will be discussed in the fifth section, there is evidence from administrative data that earnings impacts estimated with survey data collected as part of the Employment Retention and Advancement (ERA) Demonstration were subject to severe response bias. Attempts made to correct this bias through various weighting schemes were unsuccessful. Thus, the ERA final report relied on the administrative data to the extent possible, 10 although earlier reports had put more emphasis on findings based on the survey data than on findings based on the administrative data. Earnings estimates from the administrative data were also used in the cost–benefit analysis, resulting in considerably smaller estimates of net benefits than would have resulted had they been based on the survey data instead.
The National JTPA study
The JTPA evaluators did not view either administrative or survey data as superior—as discussed previously, each has advantages and disadvantages. Thus, except for the small subgroup of male youths with arrest records, they combined the data from the two sources to maximize the research sample available for the 30-month follow-up period used in the analysis. This was necessary because, on one hand, the employer-reported earnings data were only available for 12 of the 16 research sites and, on the other hand, as a cost-saving measure, the second survey, which covered the later part of the 30-month follow-up period, was limited to only a random subsample of the full research sample (fewer than 30% of the adults and a little over 60% of the youths). Thus, the survey data were exclusively used in the four sites for which the administrative data were unavailable, and the administrative data were used in the 12 remaining sites for observations that were not included in the second survey and observations that did not respond to the survey (Orr, Bloom, Bell, Doolittle, & Lin, 1996, Appendix B). To make the data from the two sources as compatible as possible, several adjustments were made to each. 11
The Seattle–Denver Income Maintenance Experiment
Even estimates of program impacts on work effort based on administrative data became available for the Seattle–Denver Income Maintenance Experiment, and suggested possible underreporting in the survey, the evaluators continued to emphasize impacts based on the interview data, typically simply cautioning that the findings may be subject to bias. For example, the final report of the Seattle–Denver experiment (SRI International, 1983, p. 112) states that “it is probably fair to say that the evidence suggests a possible bias in the [Seattle-Denver interview] data … Because the estimates presented here do not adjust for underreporting, they should be interpreted as reflecting potential reporting errors as well as true reductions in labor supply.”
There are a number of reasons why the Seattle–Denver evaluators continued to focus on findings based on the interview data. First, there was concern about certain assumptions made in conducting the analysis of reporting errors. Second, the administrative data could be used to estimate impacts on employment status and earnings, but not hours of work, one of the key outcomes of interest. Third, although earnings reductions due to reductions in labor supply are more costly to the overall economy than the underreporting of earnings, both would be similarly costly to taxpayers. Finally, most of the research based on the administrative data came late in the evaluations of the Seattle–Denver Experiment. An enormous amount of previous work based on the interview data would have been rendered moot, had a decision been made to rely on administrative data rather than the survey data.
Mechanisms Causing Administrative-Based and Survey-Based Impacts on Earnings to Differ
There are two key causes of differences between impacts estimated with administrative data and survey data, namely, reporting errors and nonresponses to surveys. These two sources of differences in impact estimates are discussed sequentially. We first examine reporting errors, assuming that there is no nonresponse to the survey. We then discuss the implications of survey nonresponse, assuming no reporting errors. These assumptions are relaxed near the end of this section.
Reporting Errors
Assuming 100% response to the survey, the program impact on earnings according to the survey data (S) and the program impact according to the administrative data (A) would, respectively, equal:
where T is the treatment group, C is the control group, j = T or C,
Equation 1 implies that if the survey data are subject to reporting errors, the resulting bias to the survey-based impact estimates is
Together, Equations 1 and 2 imply that the absolute difference between the survey- and administrative-based earnings impact is:
If there is no misreporting, then O
j
and U
j
will equal one for both the treatment and the control groups and
“Balanced reporting errors” occur when a program outcome such as earnings are overstated or understated in either the survey data or the administrative data, but, as implied by the word “balanced,” by the same proportion for the treatment group and the control group. That is, O
T
= O
C
= O
j
and U
T
= U
C
= U
j
and, hence,
“Unbalanced reporting errors” occur when misreporting differs between the treatment and the control groups. That is, O
T
≠ O
C
or U
T
≠ U
C
. Empirical evidence that unbalanced reporting exists would be implied if
If O T > O C and U T = U C , then O T /U T > O C /U C , (O T − U T ) > (O C − U C ), and, as implied by Equation 3, the extent to which unbalanced reporting increases the gap between the survey- and administrative-based impacts will depend on the degree to which O T exceeds O C .
Because contrary to our maintained assumption, no survey actually has a 100% response rate, the usual procedure used to determine the importance of reporting error is to compare survey-based impact estimates with those from administrative data—that is,
Table 3 presents parameter values obtained from evaluation reports for the social experiments listed in Table 1 that estimated program impacts with both administrative and survey data. The first six columns of the table pertain to reporting errors and the last four, which will be discussed in the following subsection, to survey nonresponse. The first and fourth columns, which are based on survey respondents, respectively, show the values of the O j /U j ratio for the treatment and control groups. Based on the assumption that the minimum value of O j is 1 and the maximum value of U j is also 1, the remaining columns show the maximum values of O j and the minimum values of U j for the two groups. Note that if the observed O j /U j ratio is below 1, assuming that U j is 1 implies that the maximum value of O j is less than 1, but assuming that O j is 1 implies that the minimum value of U j is more than 1. This is obviously untenable. Thus, in the two instances in Table 3 in which the observed O j /U j ratio is slightly less than 1, we assume that the maximum value of O j is 1 and the minimum value of U j is also 1.
Key Parameters From Experiments Using Both Administrative and Survey Data.a
Note. Adapted from sources reported in Table 1. O l /Uj = The ratio of survey-based earnings to administrative-based earnings for group j; Max O j = Maximum value of the proportional overstatement of survey-based earnings for group j; Min U j = Minimum value of the proportional understatement of administrative-based earnings for group j; R j = Survey response rate for group j. f j = ratio of the earnings of survey non-responders to those of responders for group j. NA = Not Available; JSA = job-search assistance; ERA = employment retention and advancement; WTC = Working Tax Credit; ITA = Individual Training Account; NDLP = New Deal for Lone Parents.
a Unless otherwise indicated, the administrative data were obtained from unemployment insurance records.
b There were three treatment groups but only one control group in this experiment.
c Administrative data based on tax records.
dEstimates based on only survey respondents.
The O j /U j ratios are generally consistent with the assumptions made earlier. That is, most are well above one, implying that earnings were overreported in the survey data and/or underreported in the administrative data. 16 In addition, the ratios reported in Table 3 for the treatment groups tend to be larger than those for the control groups, implying that O T exceeds O C. The only exception occurs in the National JTPA experiment, where the ratios are similar for four of the five subgroups and larger for the control groups for a small subsample of 386 male youths with an arrest record at the time of random assignment (Kornfeld & Bloom, 1999).
Based on the values presented in Table 3, Figure 1 illustrates the potential effect of misreporting on the gap between earnings impacts estimated with administrative data and those estimated with survey data. For the illustrative purposes of the figure, it is assumed that the true annual earnings of the treatment group in a hypothetical program evaluated through random assignment were US$11,000 and the true annual earnings of the control group were US$10,000. Hence, the true impact of the program was US$1,000.

Absolute gap between survey and administrative data earnings impacts at alternative values for O and U. Note. The figure is based on a hypothetical program in which true treatment group earnings are US$11,000 and true control group earnings are US$10,000, resulting in a true impact of US$1,000. It is assumed that the survey response rate (R) is 100%. The horizontal axis shows the difference between the proportional overstatement of survey-based earnings (O ≥ 1) and the proportional understatement of administrative-based earnings (U ≤ 1). The gray-shaded area represents the likely range of this value according to Table 3, although it could be larger. The three positively sloped lines in the figure allow for differences between the treatment and the control groups in the overstatement of survey-based earnings (O T − O C ), with the highest line resulting from the largest plausible difference according to Table 3 and the lowest line resulting from the smallest plausible difference. The legend in the figures indicates the assumed difference on which each line is based. The vertical axis indicates the absolute difference between the impacts estimated with the two data sources, i.e., the “$gap” resulting from alternative values for O − U and O T − O C .
As discussed previously, the size of the gap between the earnings impacts estimated with the two data sources depends on the difference between O j and U j . The vertical axis in Figure 1 shows the size of this gap for each of the values for the difference between O j and U j that appear on the horizontal axis, which are allowed to vary over a range between 0 and 1. The value of the gap was computed using Equation 3.
The smallest possible difference between O j and U j that is found in Table 3 is 0, which occurred in the New Jersey Reemployment experiment. The largest possible difference between O j and U j that is implied by Table 3 is 0.9, which would have transpired for the treatment group in the Job Corps experiment if U j equaled 1 and, thus, O j was at its maximum value of 1.90. If U j were smaller than 1, the value of O j would have to shrink in order to maintain the observed O j /U j ratio, and the difference would be smaller. Moreover, most of the other maximum values for O j in Table 3 imply that the difference between O j and U j is less than 0.5. It appears likely, therefore, that the left side of Figure 1, which is shaded, is its more relevant part.
As discussed previously, the extent to which unbalanced reporting errors increase the gap between the survey- and the administrative-based impacts depends on the degree to which O T exceeds O C . A comparison of the maximum values for O j for the treatment and control groups in Table 3, which occur when U T = U C = 1, implies that in some experiments, misreporting was more or less balanced, and when it was not balanced, O T − O C was unlikely to have been much larger than 0.1. If O j was not at its maximum, this value would be smaller.
The lowest of the three lines in Figure 1 is based on assumption that reporting errors are balanced—that is, O T − O C = 0. If O j – U j is less than 0.5 for both the treatment and control groups—and we suggested earlier that it probably is in most evaluations—then this implies that the difference in earnings impacts estimated with survey and administrative data would be under US$500, relative to a true impact of US$1,000. The other two lines in Figure 1 allow for possible unbalanced reporting. The highest line is based on the assumption that O T − O C = 0.1, which is near the upper limit suggested by Table 3. The middle line is based on the assumption that O T − O C = 0.05. 17 Figure 1 suggests that the gap between the survey- and administration-based estimated impacts is much larger if reporting errors are unbalanced than if they are balanced. For example, if O T − U T = O C – U C = 0.4, the gap would equal US$400; but if O T − U T = 0.425 and O C – U C = 0.375, then the gap would equal US$925. This last amount is almost as large as the true impact of US$1,000. If the assumption that U T = U C is valid and, hence, unbalanced reporting errors are solely due to O T > O C , then this would suggest that the gap between survey- and administration-based impact estimates is largely attributable to reporting errors in survey earnings.
Survey Nonresponse
Nonresponse to a survey can cause response bias in impact estimates if nonresponders differ from responders—for example, if they have lower postprogram earnings. This tends to be the case as nonresponders are both more mobile and difficult to locate for interviews 18 and, as a result, they tend to have lower earnings on average. The longer the time between random assignment and the survey, the more difficult individuals are to locate for interviews. Hence, response bias is likely to be larger with longer follow-up periods.
When individuals are missing in administrative data, unlike the situation with survey data, they are not dropped from the analysis but are instead assigned a 0 value for earnings. Thus, any bias this may cause shows up as a reporting error, not as a response bias. Thus, response bias can occur with survey data but not administrative data.
Assuming the absence of reporting errors, the true average earnings for the full sample (nonresponders as well as responders),
where θ T = R T + f T (1 − R T ) and θ C = R C + f C (1 − R C ).
Equation 4 implies that response bias in the survey data is equal to
“Balanced nonresponse” to surveys occurs when both R
T
= R
C
= R
j
and f
T
= f
C
= f
j
. Thus, Equation 5 reduces to
“Unbalanced nonresponse” occurs when either R T ≠ R C or f T ≠ f C . Survey response rates for the treatment group appear likely to be larger for the treatment group than the control group. After all, the segment of the treatment group that actively participates in the program is probably more readily located and to feel more committed to it (especially if the tested program involved positive financial incentives). Consequently, a typical member of the treatment group may be more likely to respond to a survey related to the program than the average member of the control group. Moreover, if the program provides financial incentives that influence behavior by rewarding work, and responders tend to be those who work more as a result of the program, responders in the treatment group will have higher average earnings than responders in the control group, and nonresponders will have lower earnings than their control group counterparts. 19 This will also be the case if the program is successful in improving the human capital and, hence, the earnings of those who actively participate.
As can be seen in Equation 4, if R T > R C , the denominator of the first term to the right of the equal sign will be larger than the second term and response bias will be smaller if R T = R C . In contrast, if f T < f C , the denominator of the first term to the right of the equal sign will be smaller than the second term and response bias will be larger than if f T = f C .
A procedure that can be and has been used to examine response bias in social experiments that have access to both survey and administrative data is to compare the earnings impact for the entire sample with the impact for only those who respond to the survey. Only the administrative data can be used for this comparison. Response bias is implied if the impact for responders is larger than the impact of the full sample comprised of both responders and nonresponders. One problem with this approach is that nonresponders may have both lower true earnings than respondents and earnings that are more understated because of reporting biases. For example, at least in the case of UI data, nonrespondents are more likely than respondents to work at out-of-state jobs or in casual or irregular employment and thus have their earnings missed. This will tend to result in the difference between the earnings impact for the entire sample and that for only respondents being larger than it would be in the absence of reporting errors. It is difficult to assess the importance of this problem.
The last four columns of Table 3 present the values of R j and f j for the treatment and control groups for the subset of the listed experiments for which they were available. The values for f j were computed from the administrative data.
The survey response rates vary from a high of 82% to a low of 60%. It was hypothesized above that these rates would be higher for the treatment than the control group and, on balance, they are. However, there are several instances in which they are equal, and they are actually slightly larger for control youth in the JTPA experiment. And even when they are in the hypothesized direction, the differences are small, no more than five percentage points.
The information needed to compute f j is not available in most of the evaluation reports on the experiments listed in Table 3. When it is available, f j ranges from a low of 0.67 to a high of 0.99. For the Job Corps and U.K. ERA experiments, the values for f j are, as hypothesized, higher for controls than for the treatment group. Moreover, the differences are substantial, ranging between .15 and .18. However, f j is slightly larger for the treatment groups than for the controls in the New Jersey Reemployment experiment.
Figure 2, which is based on the values for R j and f j in Table 3, illustrates the potential effect of survey nonresponse on the gap between earnings impacts estimated with administrative data and those estimated with survey data. Because the figure assumes the absence of reporting errors, this gap can be viewed as identical to response bias. The vertical axis in Figure 2 indicates the size of the gap for survey response rates between 60% and 100%. These rates appear on the horizontal axis. Table 3 suggests these rates are likely to range between 60% and 80%. Thus, it is the left shaded half of the figure that is most pertinent.

Absolute gap between survey and administrative data earnings impacts at alternative values for R and f. Note. The figure is based on a hypothetical program in which true treatment group earnings are US$11,000 and true control group earnings are US$10,000, resulting in a true impact of US$1,000. It is assumed that there is no misreporting (O j = 1 and U j = 1). The horizontal axis shows the survey response rate (R j ), which in accordance with Table 3 is assumed to be between 60% and 100%. The gray-shaded area represents the likely range of R j according to Table 3. The negatively sloped curves in the figure allow for alternative values of the ratio of the earnings of survey nonresponders to those of responders (f j ), which may differ between the treatment and the control groups. The legend in the figures indicates the assumed values of f j , which are all within the plausible range implied by Table 3. As indicated by the legend, one curve also allows R j to differ between the treatment and control groups. This difference is slightly larger than any of those shown in Table 3. The vertical axis indicates the absolute difference between the impacts estimated with the two data sources, the “$gap,” i.e., resulting from alternative values for f j and R j .
The lowest of the curves in Figure 2 depicts a situation in which nonresponse is balanced and f j = .9, which Table 3 suggests is relatively high. As a result, the gap between the survey- and administrative-based earnings impact estimates is small relative to the true program impact of US$1,000, even when the survey response rate is low. As the second lowest curve shows, however, as long as nonresponse remains balanced, the gap does not become much larger when f j = .7, a value that is close to the smallest reported in Table 3. It is only when nonresponse is unbalanced that the gap becomes relatively large. For example, the highest of the curves in Figure 2 represents a situation in which f j is not only low but also considerably smaller for the treatment group than the control group (.6 vs. .8). Moreover, the survey response rates for the two groups are assumed to be equal. In this situation, the gap would be about the same size as the true program impact of US$1,000 if the survey response rate was also low. However, as indicated by the second highest curve in Figure 2, if the survey response rate were 5 percentage points higher for the treatment group than the control group, which is slightly larger than any of the differences shown in Table 3, the gap would diminish by about US$200.
Combining Reporting Errors and Survey Nonresponse
So far, we have considered reporting error and survey nonresponse separately, but actually, of course, they will exist in combination. Specifically, it is only survey responders who can overstate their earnings. Thus, it is what respondents report that is subject to survey reporting bias. Consequently, the estimated survey-based impact on earnings,
Figure 3, which is identical in format to Figure 1, illustrates some possible combinations of errors caused by misreporting and survey nonresponse. Like Figure 1, the left-side shaded area displays the more realistic possibilities. The lowest curve in Figure 3 is identical to the lowest curve in Figure 1 and shows the gap between survey- and administrative-based earnings impacts in the absence of survey nonresponse and unbalanced reporting errors. Because the gap would only slightly increase if balanced survey nonresponse occurs, a curve is not shown for this possibility. However, as can be seen by the middle curve in the figure, even if reporting errors remain balanced, the gap greatly increases if unbalanced survey nonresponse occurs, and, as a result, f j is substantially smaller for the treatment group than for the control group. As shown by comparing the middle with the highest curves in Figure 3, another sizable jump in the gap results if reporting errors are also unbalanced. Indeed, if both unbalanced reporting errors and unbalanced survey nonresponse occur, the gap between survey- and administrative-based earnings impacts may be larger than the true impact of US$1,000.

Absolute gap between survey and administrative data earnings impacts at alternative values for R, f, O, and U. Note. The figure is based on a hypothetical program in which true treatment group earnings are US$11,000 and true control group earnings are US$10,000, resulting in a true impact of US$1,000. It is assumed that both misreporting and survey nonresponse can occur. The horizontal axis shows the difference between the overstatement of survey-based earnings (O) and the understatement of administrative-based earnings (U). The gray-shaded area represents the likely range of this value according to Table 3, although it could be larger. The three positively sloped lines in the figure allow for differences between the treatment and the control groups in the overstatement of survey-based earnings (O T − O C ), for different survey response rates (R) and for alternative values of the ratio of the earnings of survey non-responders to those of responders (f). f differs between the treatment and the control groups in two of the lines. The assumed values for O T − O C , R, and f are shown in the legend and are consistent with those suggested by Table 3. The vertical axis indicates the absolute difference between the impacts estimated with the two data sources, i.e., the “$gap” resulting from alternative values for O − U, O T − O C , R, and f.
Adjusting for Biases in the Survey Data
As seen in the third section, the usual response to differences between survey- and administrative-based impact estimates is simply to select one set of estimates over the other. For example, one can argue in favor of impacts estimated with administrative data because, unlike survey-based impact estimates, they are not subject to response bias and are less likely to be subject to unbalanced reporting errors. Administrative data have an important disadvantage in estimating earnings impacts, however, because individuals who are missing from administrative data are usually assumed not to be working. This is an important potential source of balanced reporting errors, especially in the case of UI data.
If impacts estimated with survey data are preferred to those estimated with administrative data, the former can be adjusted for nonresponse bias and unbalanced reporting errors (although not balanced reporting errors) by using the following four-step procedure, which is based on the framework developed in this section and is illustrated in the fifth section for the Job Corps experiment: Multiply Using the new values for Assuming that the administrative data are not subject to unbalanced underreporting and thus O
T
/U
T
exceeds O
C
/U
C
only because O
T
is larger than O
C
,
21
multiply Recompute the survey-based impact using the adjusted estimate of
The four-step procedure outlined above requires that administrative data, as well as survey data, be available, but it only uses available information from these two sources. However, it does not eliminate biases resulting from balanced reporting errors. It is also relies on a fairly strong assumption that the administrative data are not subject to unbalanced reporting errors.
The Role of Survey Response Bias and Reporting Error Bias in Experiments that Have Compared Earnings Impacts Estimated with Administrative and Survey Data
In this section, we use the model described in the previous section to examine reporting errors and nonresponses to surveys as sources of the differences in impacts on earnings produced by administrative and survey data shown in Table 1.
Survey Response Bias
In interpreting the comparisons of impacts for the two data sources that appear in Table 1, it is important to recognize that for four of the experiments—the Seattle-Denver and the Gary Income Maintenance Experiments, the National JPTA Study, and the ITA Experiment—the differences between impacts estimated with the two types of data are not subject to survey response bias because the impact estimates for both survey and administrative data were based on the same sample: one limited to survey respondents. This was done because those evaluating the experiments wanted to focus on reporting errors.
The U.K. ERA evaluation
Of the four sets of differences between survey- and administrative-based impact estimates shown in Table 1 that allow for survey response bias, only the one for the U.K. ERA Demonstration appears mainly attributable to response bias. Although ERA had multiple target groups, survey- and administrative-based earnings impacts were available for only one group, that is, single mothers who were enrolled in the New Deal for Lone Parents (NDLP) program and not working at the beginning of the experiment. A major concern with the 60-month survey, which was critical to the 5-year follow-up period used in the evaluation, was that the response rate was only 64.2% for the NDLP treatment group and 59.8% for the NDLP control group (Hendra et al., 2011, table A.3). It was higher, 70.6% and 67.7%, respectively, for a second target group of single mothers drawn from the Working Tax Credit (WTC) program, who, unlike the NDLP parents, were working at the time they were randomly assigned (Hendra et al., 2011, table A.4). Given the low survey response rates and the characteristics of the nonresponders, it would not be surprising if response bias occurred. As might be expected of an experimental test of a program that featured financial incentives, there was also evidence of unbalanced nonresponse to the survey. The ratio of the earnings of NDLP nonresponders to the earnings of NDLP responders was .67 for the treatment group and .84 for the control group. 22 The corresponding figures for the WTC target group was .79 and .95, respectively. 23
The evaluators determined that response bias was indeed important by using the administrative data to compare ERA’s impact on earnings for the full ERA sample with its impact for only those who responded to the 60-month survey (Hendra et al., 2011). This comparison, which necessarily excluded workers in jobs that were not covered by the administrative data, was made for the NDLP and WTC target groups and for 4 tax years. Over the 4 tax years, the estimated total earnings impact for the survey respondent sample was £2,018 (about US$3,027) for the NDLP target group and £2,322 (US$3,483) for the WTC group; for the full sample, in comparison, these impacts were £539 (US$809) and £920 (US$1,380), respectively. Moreover, the impacts for the respondents were statistically different from 0 in six of the eight possible instances, but only twice for the full sample, although the full sample obviously was considerably larger (Hendra et al., 2011, tables A.10 and A.11).
The National Job Corps Study
The evaluators of the National Job Corps Study ascribed only about one quarter of the rather striking differences in impacts resulting from the survey data and those produced by the administrative data (see Table 1) to response bias in the survey, attributing the remainder to reporting differences between the two data sources (Schochet et al., 2006). Response bias in the National Job Corps study was mitigated by the fact that the overall response rate for the 48-month follow-up survey, the last one conducted and the one on which the findings reported in Table 1 are based, was quite high (79.9%) and was also a little higher for the treatment group (81.5%) than for the control group (77.8%; Schochet et al., 2003). 24 However, some response bias nonetheless resulted because, according to the administrative data, the earnings of responders were higher than those of nonresponders and much more so for the treatment group than for the control group, suggesting that nonresponse to the survey was unbalanced. According to the social security data, the ratio of the 1998 earnings of nonresponders to those of responders was .84 for the treatment group and .99 for the control group. Using the social security data, the evaluators found that the earnings impact for responders was almost twice that for the full sample (nonresponders as well as responders).
The New Jersey and Pennsylvania Reemployment Experiments
Survey response bias seems unlikely to explain much of the differences in the earnings impacts estimates appearing in Table 1 for either the New Jersey Unemployment Insurance Reemployment Demonstration Project or the Pennsylvania Reemployment Bonus Demonstration. As indicated in Table 3, the overall response rate was identical in New Jersey for the treatment and the control groups. It was somewhat higher for the treatment group in Pennsylvania, but, as indicated in the fourth section, this would tend to reduce response bias. Moreover, although the UI data indicate that survey respondents had higher earnings after random assignment in the experiment in New Jersey, the ratio of nonrespondent to respondent earnings is similar for the treatment and control groups (Corson et al., 1989, table B2). As shown in fourth section, balanced nonresponse implies that survey response bias is probably small. Small response bias in the Pennsylvania experiment is implied by estimates that used UI data to examine program impacts on UI outcomes (Corson, Dunstan, & Kerachsky, 1991, table A.3). For example, the impact on weeks of UI benefit collection indicated that the tested programs resulted in a 1-week reduction when estimated over only survey respondents and a 0.8-week reduction when estimated over the entire sample (nonrespondents as well as respondents).
The ITA Experiment
Although, as previously mentioned, the difference between the survey- and administrative-based earnings impact estimates in Table 1 for the ITA experiment does not reflect survey-response bias, the impact estimates that rely on the survey are still subject to such a bias. In fact, the response rate to the survey used in estimating the impact estimate shown in Table 1 was only 69% (Perez-Johnson et al., 2011), a rate not much larger than that for the U.K. ERA demonstration, where the response bias appeared to be large. The evaluators attempt to control for possible response bias by using weights that “adjust for differences between [survey] respondents and nonrespondents in baseline characteristics” (Perez-Johnson et al., 2011, p. B3) respondents and nonrespondents in baseline characteristics.”]. In other words, they attempt to make the respondents representative of the full sample of respondents and nonrespondents. The weighted and unweighted survey-based estimates of the earnings impacts are very similar—US$522 and US$470, respectively (Perez-Johnson et al., 2011, table D.1)—suggesting perhaps that either the weighting was not necessary or the weighting was not very successful. In fact, if the weighting corrected for response bias, one would expect the weighted estimate to be smaller than the unweighted estimate.
The Seattle-Denver and Gary Income Maintenance Experiments
As mentioned earlier, the studies of underreporting in the Gary and Seattle-Denver Income Maintenance Experiments abstracted from possible survey response bias. Nonetheless, such a bias appeared possible because, in order to receive their NIT payments, members of the treatment groups had a stronger incentive to remain available for interviews than members of the control groups. Moreover, treatment group members with low earnings received larger payments than those with higher earnings and thus had stronger incentives to be available for interviews. On the other hand, as is more generally true of surveys, it seemed possible that members of the control groups with lower earnings were less likely to be interviewed than those with higher earnings. To the extent such unbalanced nonresponse to the surveys actually occurred, the survey-based earnings impacts would appear more negative than they actually were. However, an analysis of this possibility for the Seattle-Denver Experiment found no evidence of such a bias (Robins & West, 1980).
Reporting Error Bias
The U. K. ERA evaluation
As indicated in the previous subsection, with the exception of the U.K. ERA demonstration, survey response bias appears to play a relatively small role in impacts estimated with survey data. Although survey response bias seems to have caused much of the difference between impacts based on survey and administrative data in the ERA Demonstration, relatively little of this difference appears likely attributable to reporting errors. Looking only at responders to the demonstration’s 60-month survey, the ratio of survey-based earnings to administrative-based earnings was 1.13 for the treatment group and 1.11 for the control group. These ratios are relatively close to 1.0 and nearly the same. 25
The New Jersey Reemployment Experiment
If response bias seems unlikely to explain much of the differences between earnings impacts produced by UI administrative data and those estimated with survey data in the remaining seven experiments listed in Table 1, these differences must be mostly attributable to reporting errors. For example, there is some evidence of unbalanced reporting errors among those responding to the survey conducted in evaluating the New Jersey Unemployment Insurance Reemployment Demonstration. 26 Specifically, the second quarter ratios of survey-based earnings to administrative-based earnings that appear in Table 3 are higher for the three treatment groups than for the control group. 27 Figure 1 suggests that even moderate unbalanced reporting error can cause substantial differences between impact estimates based on survey and administrative data. Thus, unbalanced reporting errors would appear likely to explain much of the difference between the New Jersey findings relying on survey data and those based on the UI administrative data.
The National Job Corps Study
As shown in Table 1, estimates of the earnings impacts of the Job Corps that are based on survey data are much larger than those based on either UI or SSA administrative data. From their thorough investigation of these rather striking differences, which arguably is the most detailed yet conducted of differences between impact estimates based on survey and administrative data, the evaluators concluded that they are attributable to a considerable extent to reported earnings levels being much higher in the survey data than in the administrative data (Schochet et al., 2006). It is not surprising that this was important. The Job Corps targets youths from disadvantaged backgrounds who are especially likely to hold unstable jobs. Reflecting this, perhaps, the ratio of survey-based earnings to social security-based earnings was 1.68 for the treatment group and 1.62 for the control group in 1998, and the ratio of survey-based earnings to UI-based earnings was 1.90 for the treatment group and 1.80 for the control group in the 16th quarter after random assignment. 28 Thus, there is considerable indication of reporting errors that are unbalanced.
Apparently, reporting errors were quite common within the Job Corps research sample: Among workers, survey-based earnings were higher per job than administrative-based earning for about 75% of those with earnings, according to both administrative data sources (Schochet et al., 2003). The evaluators conclude that an important source of the reporting errors in the survey seemed to be the overreporting of weekly hours worked and, hence, earnings. The overreporting of hourly wage rates and weeks worked was less important. Their investigation also indicated that some workers and their earnings were missed in the UI earnings records because of errors in social security numbers. This was less likely in the social security data because, unlike the state agencies administering UI, the SSA verifies reported social security numbers. This is at least part of the reason reporting errors appear worse in the UI data than in the social security data and, hence, the differences between the survey- and administrative-based earnings impacts are larger for the former than for the latter. The evaluators did not find evidence that the noncoverage of jobs in the administrative data accounts for very much of the differences between the survey- and administrative-based earnings impact differences, although because of data limitations a firm conclusion could not be reached about this issue. However, consistent with the discussion in the second and fourth sections, they did find some evidence that UI-based earnings were especially small relative to survey-based earnings for workers in less stable jobs (Schochet et al., 2003).
The National JTPA Study
As part of the JTPA evaluation, Kornfeld and Bloom (1999) conducted a special study of differences between earnings estimated with survey data and those estimated with employer-reported UI data. As shown in Table 3, the earnings amounts from the survey data collected in the National JTPA Study, like those for the National Job Corps Study, were considerably larger than the employer-reported earnings amounts. This occurs although, based on a separate study that suggested overtime hours were exaggerated, 29 the study reduced the gap between survey- and administrative-reported earnings by about one third by reducing survey-reported overtime earnings by 50%. 30 About half the remaining difference in earnings levels is attributed by Kornfeld and Bloom (1999) to earnings that were not reported by employers to state UI agencies. The ratio of survey earnings to UI earnings was larger for youth, especially those with an arrest record, than for adults and somewhat larger for males than females. Because the survey-based earnings amounts were larger than the employer-reported earnings amounts, the survey-based impacts were also larger, as can be seen in Table 1. However, these differences were not sizable except for a small subgroup of male youths with arrest records. As suggested by Table 1, this lack of large differences between the survey- and administrative-based impact estimates for most subgroups seems to be more the exception than the rule. As shown in Table 3, with the exception of the small sample of male youths with arrest records, the ratio of survey earnings to UI earnings (O/U) was very similar for the treatment and control groups. 31 Thus, balanced reporting errors occurred, but, except for one group, unbalanced reporting errors did not. As illustrated in Figure 1, this tends to greatly diminish the gap between impacts estimated with the two data sources.
The ITA Experiment
Looking at just survey responders in the ITA Experiment, there is strong evidence of unbalanced reporting errors; for example, in the 22nd quarter after random assignment, the ratio of survey-based earnings to UI-based earnings is 1.41 for those assigned to the structured choice program model and 1.30 for those assigned to the guided choice model who served as a control group. 32 This presumably explains much of the very large difference between the earnings impacts based on the two data sources shown in Table 1 for the structured choice model.
The evaluators attempted to determine the reasons survey-based earnings were so much larger than UI-based earnings, and they concluded that over two thirds of the difference is attributable to differences in employment rates, rather than differences in earnings among persons who worked. More specifically, according to the evaluators, approximately one third of the gap in average earnings between the two data sources is due to out-of-state wages that are missed by the UI data and jobs that are uncovered by the UI data. Informal work did not appear to play an important role. The evaluators also found some evidence that is consistent with the possibility that weekly hours worked are overreported by survey respondents or that earnings are underreported by employers in the UI data (Perez-Johnson et al., 2011, appendix I).
The Seattle-Denver and Gary Income Maintenance Experiments
The Seattle-Denver and the Gary Income Maintenance Experiments appear to be the earliest experiments in which impact findings from survey and administrative data were compared. As previously discussed, it was suspected that estimates of program impacts on work effort and earnings were biased by differences between treatment and control groups in self-reporting these outcomes. In other words, it was suspected that the survey data were subject to unbalanced reporting errors. To determine whether this was the case, the evaluators obtained employer-reported UI earnings data from the three states in which the experiments were conducted. As anticipated and indicated in Table 1, the findings from both experiments indicated that impacts on earnings that were based on self-reported data were more negative (or less positive) than those based on the employer-reported administrative data, and for some groups of participants, consistent with the possibility of unbalanced reporting errors, substantially so. As previously mentioned, this finding is based on only survey respondents, thereby abstracting from the possibility of survey response bias.
Differences Among Target Groups in Biases
The researchers who evaluated some of the experiments that estimated earnings impacts using both survey and administrative data looked at demographic factors that may have influenced survey response rates and (to a lesser extent) reporting errors. For example, nonrespondents to the Job Corps Study survey were more likely to be male, to be childless, and to have not completed high school prior to random assignment (Schochet et al., 2003, p. 120). Nonrespondents in both the New Jersey and Pennsylvania Reemployment Experiments were more likely to be Black or Hispanic, younger, and male, characteristics that were similar for the treatment and control groups (Corson et al., 1989 and 1991). Survey response rates in the U.K. ERA Demonstration were especially low for persons with less education, living alone, and with low pretreatment earnings (Hendra et al., 2011, table A.9). Overall, such findings suggest, as might be expected, that less stable, more disadvantaged populations are less likely to respond to surveys than more stable, less disadvantaged groups. Moreover, as Schochet, McConnell, and Burghardt (2003) found in the National Job Corps Study, UI-based earnings were especially small relative to survey-based earnings for workers in less stable jobs, thereby resulting in balanced reporting error bias. Somewhat similarly, the ratios of UI-based earnings to survey-based earnings were smaller for youths than for adults in the JTPA experiment (see Table 3).
The demographic characteristics of the target population do not necessarily determine which experiments are most likely to produce great differences between survey- and administrative-based estimates of earnings impacts. For example, these differences were substantial for the New Jersey and Pennsylvania Reemployment experiments, but they were also ample for the ERA Demonstration and the National Job Corps Study. The target group for the two reemployment experiments was UI claimants, who had to have stable jobs prior to losing them in order to qualify for benefits, while the ERA target group was nonworking single parents who were on welfare at random assignment and the Job Corps target group was disadvantaged youths. Moreover, it is not apparent that the demographic characteristics of target groups are associated with unbalanced survey nonresponse or unbalanced reporting errors; as previously, shown, it is this lack of balance that is responsible for much of the differences between survey- and administrative-based earnings impact estimates.
The nature of the treatment itself may be associated with differences between survey- and administrative-based earnings impact estimates. For example, we earlier suggested that survey-based impact estimates may be especially subject to biases in the case of experiments that test financial incentives or disincentives—for instance, reemployment bonuses and negative income taxes. In addition, simply receiving a tested treatment may influence the program group in ways that do not affect the control group, thereby causing unbalanced nonresponse. In the New Jersey Unemployment Insurance Reemployment Demonstration, for example, the response rate was 62% for those who received no services in the experiment compared to 82% for those who received training, relocation assistance, or a reemployment bonus (Corson et al., 1989). Similarly, in the Pennsylvania Reemployment Bonus Demonstration, the response rate was 72% for those who did not apply for a bonus and 90% for those who did (Corson et al., 1991). In principle, this can cause response rates to be larger for treatment groups than control groups. Moreover, if the treatment has a positive impact on earnings, it can also cause the ratio of nonrespondent earnings to respondent earnings to be smaller for treatment groups than control groups. As previously indicated, however, this did not seem to actually occur in the case of either the New Jersey Unemployment Insurance Reemployment Demonstration or the Pennsylvania Reemployment Bonus Demonstration. The only exception was that the survey response rate in Pennsylvania was somewhat higher for the treatment group than that for the control group, but this would tend to reduce survey response bias.
Adjusting for the Biases
The report investigating the differences in the survey- and administrative-based impact estimates in the National Job Corps Study (Schochet et al., 2003) provides all the information needed to adjust the survey-based impact estimates for the 1998 estimates using the four-step procedure outlined in the fourth section and social security earnings records for 1998. 33 Thus, we use this experiment to illustrate the procedure.
In 1998, as indicated by Table 1, the earnings impact estimated with the survey data is US$972, and the impact estimated with the social security data is US$220. After adjusting for nonresponse bias (step 1), the survey-based impact falls from US$972 to US$682. After additionally adjusting for unbalanced reporting (steps 2 through 4), the impact is further reduced to US$388. The remaining difference between the survey- and administrative-based impact estimates—$388 versus $220—is presumably attributable to balanced overreporting in the survey data and underreporting in the social security data. Indeed, the estimated impact in percentage terms is 4.2% for the survey data once adjusted and 3.9% based on the social security data. As shown in the fourth section, the percentage impacts would be the same, although the absolute impact would not be, if the impact estimates are biased only by balanced misreporting.
Implications
It appears evident from the examples shown in Table 1 that, in evaluating programs tested by social experiments, earnings impacts based on survey data can often be very different from those based on administrative data, with the former apparently almost inevitably larger than the latter. 34 Of the eight experiments we investigated, the differences were substantial and important in all but one. 35 Differences between treatment and control groups in whether they responded to surveys and how accurately they reported their earnings (that is, unbalanced nonresponse to surveys and unbalanced reporting errors) appear to have played a major role in this.
The one experiment in which the survey/administrative differences in impacts were modest was the National JTPA Study, and in that case, the comparison of impacts based on the two data sources did not allow for the possibility of response bias. The role of response bias is also unclear in the ITA Experiment. In general, however, biases resulting from reporting errors, especially unbalanced reporting errors, appear to be a more important source of the differences than response bias. 36 The bias from reporting errors seems to result both because administrative data understate earnings impacts and survey data overstate them; and in the latter situation, the overstatement is often worse for the treatment group than the control group. Because understatements occur in one data type and overstatements in the other, it is obviously crucial that data from the same source be used in measuring outcomes for both the experimental and the control groups. 37
The one experiment in which response bias was extremely important was the U.K. test of the ERA program, which featured a financial incentive that encouraged work. The response rate to the survey used in the ERA evaluation was exceptionally low, and the earnings of nonrespondents were much lower than those of respondents and more so for the treatment group than for the control group. If survey data are to be used to estimate program impacts, it is obviously imperative that response rates be kept as high as possible so as to minimize the loss of low earners. 38
In those experimental evaluations that have analyzed data from both surveys and administrative sources, differences in impacts in earnings have been treated in diverse ways, but they have been very difficult to reconcile. The choice of how to treat these differences appears to have especially important implications for cost–benefit analyses of the tested programs. Both sources of data are subject to weaknesses, in considerable part because some members of the sample population are inevitably missed by both, albeit for different reasons. As a result, unfortunately, using either source of data is likely to produce substantial errors in the experimental impact findings. When it is possible to obtain data from both sources, they can at least serve as checks on one another. Moreover, we earlier suggested a four-step procedure that uses administrative data to adjust survey data for both nonresponse bias and unbalanced overreporting (although not balanced overreporting). Most evaluations, however, are not so fortunate as to have data from both sources available. And even those that do usually cannot measure all the impacts of interest with both. In these instances, the biases resulting from having to rely on one data source cannot be known.
Many experimental evaluations depend solely on administrative data because they are much less expensive to obtain than survey data. If one is willing to assume that unbalanced reporting errors are minimal in the administrative data, but these data are subject to balanced underreporting (an assumption we have generally maintained throughout this article), an upper bound can be placed on the estimated earnings impact by simply multiplying it by an appropriate survey–administrative earnings ratio. Although this has not previously been done (or even suggested) to the best of our knowledge, such a ratio might be obtained from a previous evaluation in which both data sources were used and the demographic characteristics of the study population are similar to those of the current evaluation. In determining this ratio, data for controls should be used because the data for the treatment group will likely be affected by the treatment and the data should be adjusted to the extent possible for misreporting. If the population used to compute this ratio is similar to the target population in the current evaluation, multiplication by this ratio would result in an impact estimate similar to that which would be produced by survey data, if they existed and were not subject to unbalanced misreporting and response bias, but were subject to balanced misreporting.
As an illustration of the sorts of ratios that might be used, consider Kornfeld and Bloom’s (1999) examination of the data used in the National JTPA Study. After adjusting the survey data for overstated overtime hours, they found the survey–UI earnings ratios among the control group to be 1.24 for adult women, 1.30 for adult men, 1.36 for female youths, and 1.53 for male youths. Use of these ratios would adjust earnings impacts estimated by administrative data upward by around 25–50%. The resulting impact estimates would be overstated to the extent earnings in the survey data are overstated. Thus, they should be viewed as upper bounds. In using the National JTPA Study to compute upper bounds, they should be refined to match the target population of the current evaluation as closely as possible. 39 For example, if the target population is female family heads, only JTPA data for female family heads should be used. 40
Conclusions
This article has examined the relatively small number of experiments in which both administrative and survey data were used to estimate impacts on earnings. The impact estimates were often very divergent, with survey data typically producing substantially larger impacts. Nonexperimental evaluations are also subject to the biases that cause this divergence as well as other problems. The divergence is troubling but does not imply that impact evaluation should be discouraged. Instead, it suggests that both data sources should be used to estimate impacts whenever possible—that is, when appropriate administrative data are available and funding for the evaluation is sufficient to conduct surveys. It is also evident that more research is urgently needed to resolve the resulting uncertainty. In particular, although great caution should be used in mixing data from different sources, consideration should be given to how each can best be used to adjust for weaknesses in the other. We suggested a four-step procedure in which administrative data can be used to adjust survey data for both nonresponse bias and unbalanced overreporting. Others have suggested alternative approaches. In the National Job Corps Study, for example, the evaluators attempted to account for response bias by multiplying survey-reported earnings by the ratio of social security-reported earnings for the full sample to social security-reported earnings for respondents. In addition, after determining that hours of work were overstated by survey respondents, they adjusted survey earnings downward by 10% to account for this. Somewhat similarly, in the JTPA evaluation, survey-reported overtime earnings, which appeared to be seriously overstated, were adjusted downward. In addition, employer-reported UI earnings were scaled up by multiplying them by the ratio of mean survey earnings to mean earnings in the UI records. Alternatively, when positive earnings are reported in a survey for individuals in uncovered or out-of-state jobs, but are missed in administrative data, it might be possible to use the survey information to impute the earnings of the workers missed in the administrative data. It would be useful to examine how effective such procedures are. With sufficient investigation, perhaps approaches can be developed that could systematically be used to adjust reported earnings in both administrative data and survey data. Ideally, these methods could be used in evaluations that use only one source of data as well as those with data available from multiple sources. As an example, we suggest earlier that an upper bound can be placed on administrative-based earnings impacts by multiplying them by an appropriate survey–administrative earnings ratio.
If insufficient funds are available to conduct a survey, an evaluation based on administrative data is likely to be better than no evaluation at all. We recommend, however, that the evaluators take note of the biases often found in evaluations based on administrative data, and if evaluations based on both survey and administrative data are available for a similar group of people, the evaluators should note the differences that have occurred in prior evaluations. 41
In recent years, the U.S. Department of Labor and other agencies have stressed the importance of basing policies on evidence, including randomized controlled trials (RCTs) whenever possible. This review demonstrates that even when RCTs are used, the size and sometimes the direction of program impacts on employment and earnings can vary substantially depending on whether administrative or survey data are used. Rather than having evaluations resort to ad hoc adjustments to account for possible but unproven anomalies in the data, which is what has been done thus far, we strongly recommend that, until we learn more definitively about the magnitudes of bias from administrative and survey data and how to best adjust for anticipated biases, the federal government encourage the collection of both administrative and survey data for purposes of evaluating social experiments and, in addition, conduct investigative audits in conjunction with upcoming evaluations when both administrative and survey data are used. For example, some families for whom there are large disparities in earnings between data sources could be contacted to determine how these discrepancies can be reconciled. In addition, audits of employer-reported earnings data could be conducted. All these steps, of course, are contingent on the availability of sufficient funding.
Footnotes
Acknowledgments
The authors wish to acknowledge useful comments on a previous version provided by David Grubb, Bruce Meyer, Jeffrey Smith, participants in the Institute for Research on Poverty Summer Workshop, two anonymous referees, and an editor of this journal.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
