Abstract
Realistic response rate expectations are important for successfully allocating and managing data collection efforts under limited resources. Interviewer performance is often evaluated against response rate standards, and face-to-face interviewer performance can vary due to, in part, the socioeconomic characteristics of the neighborhoods in which they work, reflecting well-documented differences in survey participation by these characteristics. In this article, we describe a method to establish unified, data-driven response rate standards to evaluate field interviewers. Using U.S. Census and American Community Survey data at the block-group level of geography, we identify correlates of 2010 census participation and cluster the block groups into homogeneous strata. The resulting clusters vary in meaningful ways to differentiate participation rates in household surveys. We use response rates by cluster to establish interviewer response rate standards for these sample surveys. We describe the process for simulating response rate distributions and the procedure for identifying performance-level cutoffs.
Introduction
It is common to use interviewer performance standards to set expectations and to evaluate interviewer productivity in both face-to-face and telephone interviewing environments (Fowler 2008; Tarnai and Moore 2007). Performance standards are often established for cooperation rates or response rates as these are critical metrics for the survey performance overall and are functions of the interviewers. Standards are, by definition, common across interviewers and are meant for evaluating interviewers because of the known variability across interviewers in their performance. At the same time, sample cases to be interviewed vary in their likelihood of being contacted and in giving cooperation once contacted (Groves and Couper 1998).
Factors such as urbanicity, mobility, and ethnic diversity are associated with lower survey participation rates. In a telephone interviewing environment, it is likely that these characteristics will vary randomly across interviewer workloads as interviewers call households in various geographic areas. However, interviewers working in different shifts are likely to have different contact rates, and the likelihood of cooperation is often affected by other interviewers’ prior interactions with a case. In contrast, in a face-to-face interviewing environment, interviewers have autonomy to schedule their work and usually do not inherit cases that have been worked by other interviewers. However, they are typically assigned to geographic areas that vary substantially in difficulty across interviewers. For these reasons, survey organizations are examining new measures of interviewer performance that account for the variation in case difficulty.
Durand (2005), Laflamme and St-Jean (2011), and West and Groves (2013) each propose methods to standardize the evaluation of interviewers based on the likelihood of contact and cooperation. Each of these works defines metrics that assign higher weight to interviews obtained in calls (or visits) with a low likelihood of contact and cooperation than to interviews obtained in calls with a higher likelihood of contact and cooperation. Of the three metrics, only the propensity-adjusted interviewer performance score developed by West and Groves (2013) was designed for the evaluation of both telephone and face-to-face interviewers. Despite its applicability in both modes, however, it is not appropriate for the evaluation of field interviewers employed by the U.S. Census Bureau for a few reasons.
First, the score relies on the ability to estimate a strong model of contact-level response propensity. To estimate such a model, the authors use contact-history paradata that are collected throughout the survey fieldwork period. This is problematic because our interviewers must receive performance expectations before fieldwork begins. In addition, the paradata include highly subjective items that are recorded by the interviewer (e.g., “maximum resistance” from a sample member), making it possible for interviewers to manipulate the scores. Finally, the U.S. Census Bureau collects data for other federal statistical agencies that are held to the Office of Management and Budget (2006) standards, which require that sample surveys be designed to achieve the highest practical rates of response. Because our performance standards must align with the goals of sponsoring agencies, they need to include specific response rate targets for each sample survey.
Although response (or cooperation) rate is a common and, in our case, necessary measure of interviewer performance, it is often just one component of interviewer evaluations, supplemented by measures of productivity and data quality. This is true at the U.S. Census Bureau where interviewers are evaluated in the following four categories: (1) customer service, which includes the protection of property and data; (2) administrative and automation activities; (3) production and cost, which is essentially measured by hours per case; and (4) interviewing, which includes a variety of data quality measures (such as item nonresponse, collection of survey supplements, time spent on various portions of interviews, etc.), and response rates that vary by survey. While all of these performance elements are essential, this work focuses on the response rate component of interviewing.
During noncensus years, the U.S. Census Bureau employs approximately 6,500 interviewers who are managed out of regional offices. Some interviewers work on only one sample survey during a year, but most work on two to four sample surveys throughout a year. Each interviewer is rated on a scale of 1–5 in each of the four performance categories and, for interviewers working on multiple sample surveys, performance scores are calculated as the average of each survey’s score, weighted by the time spent on each sample survey. To better account for differences in case difficulty and to establish comparable expectations for interviewers across the country, the U.S. Census Bureau has implemented a new set of response rate standards for the interviewing element of performance evaluations. In this article, we discuss the prior system for assessing interviewer performance with respect to response rate and the construction of the new system.
In the prior performance system, interviewers were expected to achieve response rates that vary with the level of urbanicity of the county in which the majority of their cases resides. More specifically, each county is assigned to one of the following six strata: (1) metropolitan: areas that are considered to be very difficult, (2) urban: moderately difficult areas, (3) low-density urban to suburban: areas that are less difficult than higher-density urban areas, or (4) slightly rural, (5) moderately rural, or (6) rural: areas that are considered relatively easy. Following the assignment of counties to strata, management of the 12 regional offices determined ranges of response rates corresponding to the five performance levels for each sample survey and stratum. Because regional offices adjusted the response rate expectations as they thought appropriate, there were inconsistent standards across the nation. Further, regional offices sometimes combined two or more strata into one, with one set of response rate performance levels for the combination. As a result, there was a great deal of variability in the standards for all sample surveys and managing offices.
The performance system came under review with the recent restructuring of the regional offices. Although the prior response rate expectations allowed for different response rates depending on geography, in the interest of consistent management across regions, the differences in performance standards were examined. It was clear that the standards were not consistent across the nation. The performance standards of a highly urban county on the East Coast were likely to be different from the performance standards of a highly urban county on the West Coast, even if it is equally difficult to obtain interviews in each county. A less apparent issue was that within many counties there is significant variability in socioeconomic characteristics known to be related to survey response propensities. Los Angeles County, which if considered as a state would rank ninth in population, has areas that are very urban and others that are very suburban, yet all areas were treated equally under the prior system.
Our objective is to determine data-driven expectations for response rates in six of the U.S. Census Bureau’s national sample surveys to establish consistent, scientifically defensible interviewer response rate standards. Specifically, we have the following two goals: (1) to account for the variability below county level in the difficulty of obtaining interviews and (2) to create national response rate standards. In the next section, we describe the data used in our analyses; in the section after that, we discuss the methods used to develop new response rate standards that address the aforementioned issues. In the final section, we discuss future enhancements of our work.
Data
The Census Planning Database was developed to identify reasons that people are missed in decennial censuses. The first planning database was compiled from 1990 census data and contained tract-level characteristics of households and people that are associated with survey response rates (Bruce and Robinson 2003). In June 2012, the U.S. Census Bureau released the first block-group-level version of the database. In this analysis, we use block group as the unit to approximate a neighborhood. The 2012 planning database contains estimates of many person and household characteristics compiled from the 2006–2010 American Community Survey (ACS) and the 2010 census as well as several operational variables that describe the 2010 census mail-back behavior. A full description of the variables in the block-group-level planning database is given in Supplemental Table 1.
For purposes of this project, we supplemented the planning database with block-level summaries of the 2010 census nonresponse follow-up operation that include the number of contact attempts made, the proportion of interviews completed, and the proportion of interviews obtained by proxy.
Methods and Results
Recall that one of our goals is to account for the variability in the difficulty of obtaining survey responses below the county level. Because the 2012 planning database is at the block-group level, it is an obvious tool for accomplishing this. However, despite the wealth of information in the planning database and the supplemental nonresponse follow-up data, there is very often a single survey case or no cases in sample in each block group, which means that we cannot predict sample survey response rates at this level of geography. To circumvent this issue, we fit logistic regression models to predict response propensities at the case level using block-group-level covariates. However, we found that neighborhood-level characteristics alone do not have much predictive power at the case level. Although we do have case-level predictors in the form of paradata collected throughout each sample survey’s fieldwork period, as in West and Groves (2013), they are obtained only after an interviewer has interacted with a case, and they cannot be used to establish the objective baseline estimates of response propensity.
For these reasons, we modeled block-group-level 2010 census mail return rates as a proxy for sample survey response rates using block-group-level characteristics. Once we found a set of characteristics that are predictive of census return rates, we then clustered block groups based on these characteristics and validated the clusters with past sample survey response rates. Finally, for each sample survey, we bootstrapped interviewer response rates in each cluster of neighborhoods to estimate sample survey response rate distributions and suggested performance standards based on these distributions.
Regression Analysis
To determine the set of variables that are most predictive of block-group-level 2010 census mail return rates, we perform an exhaustive search for the best subset of predictors in ordinary least squares regression using the method implemented in Lumley and Miller (2009). A summary of the resulting regression model is shown in Table 1. Although it is highly probable that there are significant interactions among the variables in the model, the purpose of this search is to find the set of characteristics that are most related to return rates, rather than the best regression model.
Predicting 2010 Census Block-group Mail Return Rates.
Note: Most predictors are either log or square root transformed, and the independent variable is not. Residual standard error: 5.593 on 214,569 degrees of freedom. Multiple R 2: .5653, adjusted R 2: .5653. F statistic: 1.329e + 04 on 21 and 214,569 degrees of freedom, p value: < 2.2e − 16.
*Significant at the .0001 level.
In general, the model results are in line with the literature concerning characteristics related to survey nonresponse (see Dixon and Tucker 2000; Griffin 2002; Groves and Couper 1998; Johnson et al. 2002). Neighborhoods with higher proportions of children (age groups <5 and 5–17) and people over age 65 have higher return rates, while neighborhoods with higher proportions of the college-aged population (18–24) have lower return rates. Neighborhoods with higher proportions of racial and ethnic minorities have lower return rates, as do neighborhoods with higher poverty levels, vacancy rates, crowded households, and those with no plumbing or telephone service. Conversely, neighborhoods with higher proportions of single-family homes, married couples, and college graduates have higher return rates. Only two variables, population density and the proportion of nonresponse follow-up interviews by proxy, have coefficients that are in a direction one might not expect based on the literature.
In the following section, we cluster neighborhoods based on these variables and examine sample survey response rates in each of the resulting clusters.
Cluster Analysis
We partition the country on the variables from Table 1 using k-medoids clustering (Kaufman and Rousseeuw 1990) as implemented in Maechler et al. (2012). The algorithm is as follows: Randomly select k of the n observations (in this case, block groups) as centers or “medoids.” Assign each observation to the closest medoid. The distance between each observation, o, and medoid, m, is calculated as For each medoid, for each nonmedoid observation, swap m and o and compute the total cost of the partition: Choose the partition with the lowest cost. Repeat steps 2 through 4 until there is no change in the medoid.
In cluster analysis, there are several methods for determining the optimal number of clusters. In this particular problem, however, the number of block groups to be clustered is greater than 200,000, and the methods that depend on the number of observations such as Hartigan’s rule of thumb (Hartigan 1975) and choosing

Percentage of variance explained by cluster solutions.
Table 2 displays the prior county-level strata versus the new clusters. The “metropolitan” stratum represents the most difficult areas under the prior performance system, and these comprise about 36% of the nation’s counties. Of the block groups (or neighborhoods) within these major metropolitan counties, only 11.8% are in the low (response rate) cluster, over half are classified as medium (response rate), and approximately 35% are in the high (response rate) cluster. Similarly, of the block groups in counties classified as urban, the second most difficult of the prior strata, an even smaller percentage (2.2%) of the block groups are classified as low, about 40% are medium, and the majority (almost 60%) are in the high cluster. These numbers demonstrate that within counties—metropolitan counties in particular—there is significant variability in the attributes of the people and homes that characterize the neighborhoods. These differences are detailed in Table 3.
Percentage of Prior Strata in Each New Cluster.
Demographics by Cluster.
When we examine socioeconomic characteristics by cluster (in Table 3), we see that each of the clusters distinguishes well between characteristics associated with survey response rates. In general, the characteristics of the low-cluster neighborhoods are the most urban, the characteristics of the medium cluster are slightly less urban, and those of the high cluster are the most suburban or rural. Specifically, the low and medium clusters have more racial and ethnic minorities, many more multiunit buildings and homes occupied by renters, and higher rates of poverty. Also, while the medium and low clusters have lower median household incomes than the high cluster, they have significantly higher median house values (US$258,049 and US$368,794, respectively, vs. US$98,748 for the high cluster). This is to be expected due to the urbanicity of the medium and low clusters. Figure 2 displays medium-cluster block groups in gray and low-cluster block groups in black—both of these clusters are concentrated in urban areas.

Map of cluster solution. Block groups in the high cluster are displayed in white, block groups in the medium cluster are in gray, and block groups in the low cluster are in black.
Other notable differences across the clusters are the age distributions, family and household compositions, mobility, language, and education. The high cluster has the highest proportion of people in both the 45–64 age group and the 65+ age group, while the medium and low clusters have higher proportions of people in the 18–24 age group. It is no surprise, then, that the medium and low clusters also have higher mobility proportions of people who have moved within the previous one and five years. The high cluster has more married family households, whereas the medium and low clusters have higher proportions of female-headed households, nonfamily households, and single-person households. The three clusters have roughly equal proportions of people over age 25 who have attended college, but the medium and low clusters have higher proportions of people over age 25, without a high school diploma or equivalent. The medium and low clusters also have much higher proportions of households in which a language other than English is spoken and higher proportions of linguistically isolated households—those in which no one over age 14 speaks English well or very well.
In addition to the socioeconomic differences among the clusters, other striking distinctions are found in the results of the 2010 census nonresponse follow-up operations. As a first attempt at enumerating the population, the U.S. Census Bureau sends census questionnaires to all mailable addresses where people are believed to live. If, after a period of time, a questionnaire has not been mailed back, a second questionnaire is sent. Finally, if a questionnaire is still not received, enumerators attempt census interviews by phone or personal visit. Of the mailable addresses in the high and medium clusters, approximately 64% and 63%, respectively, returned the first mailed census questionnaire, while only 54% of the mailable addresses in the low cluster returned the first questionnaire. Once telephone and personal visit follow-up began, an average of two contact attempts per address was needed to obtain interviews in the high cluster, approximately 2.3 attempts on average were needed to obtain interviews in the medium cluster, and an average of 2.6 attempts was needed to get interviews in the low cluster. These statistics convey the differences in the efforts required to gain cooperation among the clusters and suggest how the clusters discriminate in survey response behaviors, which we examine next.
Figure 3 displays sample survey response rates from January through July 2012 for each cluster. For each sample survey, the high cluster has the highest response rate, followed by the medium cluster, and then the low cluster. On average, the high cluster’s response rates are 2.6 percentage points higher than those of the medium cluster, and the medium cluster’s response rates are 2.6 percentage points higher than those of the low cluster. The number of cases per survey and cluster are given in Supplemental Table 2. This response rate pattern is even more pronounced for the 2010 census, which has final mail return rates of 81.4%, 76.8%, and 71.6% for the high, medium, and low clusters, respectively. With these clusters, we have accomplished our first goal—to account for the variability in the difficulty of obtaining interviews below the county level. In the following section, we use these clusters to tackle our second goal—to create objective, national response rate standards for each of these sample surveys.

Response rates by cluster (January to July 2012). L = low; M = medium; H = high. ACS = American Community Survey; CED = community expenditure, diary; CEQ = community expenditure, quarterly; CPS = Current Population Survey; NCVS = National Crime and Victimization Survey; NHIS = National Health Interview Survey.
Bootstrap Analysis
Under the prior performance evaluation system, cutoffs for each of the five levels of rating were independently set by the management of each regional office. To determine the cutoffs for response rate standards, managers considered historical response rates and the goals of survey sponsors. For example, the Bureau of Labor Statistics has a goal of at least 90% response for the Current Population Survey (CPS), so the interviewer performance standards for CPS must align with this.
To aid the U.S. Census Bureau in determining new national cutoffs for interviewer ratings, we use bootstrapping to estimate distributions of the first through the 100th percentile of interviewer response rates for each combination of survey and cluster. We begin with data from the first half of 2012 and calculate interviewer response rates for each sample survey and cluster worked. Because half a year is a relatively short period of time, we find that many response rates are 0%, 50%, or 100%, as only one or very few cases were worked by each interviewer for some combinations of survey and cluster. To mitigate this and to estimate longer-run averages, we increase each interviewer’s caseload by a factor of 10. We then create a synthetic sample by drawing case outcomes from within survey and cluster and recalculate interviewer response rates for each survey–cluster combination. Finally, for each combination, we take 1,000 samples with replacement and compute smoothed bootstrap (or kernel density) estimates of each percentile. Specifically, let nsc
be the total number of interviewers who worked on survey s in cluster c and
The bootstrap distributions of the low cluster’s percentiles generally have the lowest center (mean and median) and the most variation, followed by the medium cluster and then the high cluster. As an example, Supplemental Figure 1 displays the resulting smoothed bootstrap distribution of the 25th percentile of CPS interviewer response rates in each cluster.
We provide U.S. Census Bureau management with the bootstrap estimates of each percentile along with 95% confidence limits to describe our best estimate of long-run interviewer response rates and their variability. This allows management to select national percentile cutoffs so that interviewer response rate targets align with survey response rate targets and to estimate the effect of the new response rate expectations on the distribution of interviewer ratings. For example, if the 20th, 40th, 60th, and 80th percentiles are chosen as the cutoffs for ratings of two through five, respectively, it is likely that approximately 20% of interviewers will receive each rating.
Discussion
When beginning this project, we had the following two goals: (1) to account for the variability below the county level in the difficulty of obtaining interviews and (2) to create national response rate standards for interviewers. We have demonstrated that the cluster solution presented in this article identifies three groups of neighborhoods that are distinct in characteristics related to survey response, and these groups also differ in actual survey response behavior including both rate of response and the difficulty in obtaining interviews. In the third section, we simulated sampling and bootstrapped to provide U.S. Census Bureau management with a complete description of the estimated distributions of interviewer response rates. This allowed a single set of national response rate standards to be set from data-driven estimates of both response rates and their variability. As cutoffs for the five levels of rating, management selected sets of the bootstrap estimates of percentiles that varied by survey. The new, national standards are to take effect in the 2013 rating year.
The national standards set from our bootstrap distributions will improve on the prior performance standards in three main ways. First, the new standards will be based on current data, whereas the prior standards were based on an analysis of 1990 and 2000 census data and historical survey outcomes. The cluster solution we provided was developed from both the 2010 census and the 2006 to 2010 ACS five-year estimates, and the bootstrap distributions were derived from 2012 survey outcomes. In addition, we are able to update the cluster solution annually using ACS estimates. Second, the new standards will account for the variability within counties and differences across interviewer caseloads. The cluster solution captures the differences that are present across the neighborhoods of densely populated, metropolitan areas. The response rate standards will better match the reality of the locations rather than just geography. Third, this allows interviewers across the nation to be held to a single set of response rate expectations.
Although the new response rate standards will greatly improve the interviewer performance rating system, we acknowledge some shortcomings of our analyses. First, we had only six months of survey response outcomes from which to estimate response rates due to the deadline for implementing the 2013 performance standards and the regular purging of data. Ideally, we would have at least a full year of data to estimate these annual rates and to capture any seasonal variation that may occur. Second, review of our cluster solution by the management of the U.S. Census Bureau’s remaining six regional offices revealed that the clusters do not always capture high-crime neighborhoods, which are regarded as difficult. The cluster analysis considered many variables that are associated with crime but did not explicitly include crime rate as we are not aware of any robust crime estimates at the neighborhood level. This is a topic for future research. Third, although response rate is the most heavily weighted component of the U.S. Census Bureau’s interviewer performance standards, interviewers are evaluated on additional, measurable criteria including “production and cost” and various data quality metrics. To fully move to national performance standards, we must also examine these measures. In working with the field operations staff to implement the new response rate standards, we have already seen opportunities to improve the interviewer rating system and have begun research to determine how we might refine and extend our methodology.
Footnotes
Author’s Note
This report is released to inform interested parties of ongoing research and to encourage discussion. The views expressed on statistical issues are those of the authors and not necessarily those of the U.S. Census Bureau.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
