Abstract
The Current Population Survey (CPS) has been the nation’s primary source of information about employment and unemployment for decades. The data are widely used by social scientists and policy makers to study labor force participation, poverty, and other high-priority topics. An underutilized feature of the CPS is its short-run panel component. This paper discusses the unique challenges encountered when linking basic monthly data as well as when linking the March basic monthly data to the Annual Social and Economic (ASEC) Supplement in the 1976–1988 period. We describe strategies to address linking obstacles and document linkage rates.
Introduction
The Current Population Survey (CPS) is the primary source of information about employment and unemployment in the United States. It has been a key data resource for the social science research and policy making community for decades, providing monthly snapshots of the civilian labor force, with microdata available since 1976. The survey has a short-run panel component that is largely unknown and underutilized by the research community. CPS respondents participate in the survey eight times, answering surveys for four consecutive months, rotating out of the survey for eight months, then answering for four more consecutive months. Linked CPS data present numerous opportunities for charting and understanding dynamics of short-run change over the last half century. Examples of research that could be conducted using linked CPS data include analyzing the demographic and employment correlates of families transitioning in and out of poverty; examining the extent to which engagement in volunteering changes following transitions out of employment among older adults; and studying how families’ employment arrangements are stable or change following the birth of a child. Other investigations might focus on how individuals organize their work and family lives in response to recessions, the effects of policy changes on employment, or the labor force participation of veterans from different wars.
IPUMS1 (ipums.org) is a leading disseminator of CPS data, streamlining and simplifying access to these vital social, economic, and demographic data [1]. IPUMS CPS (cps.ipums.org) delivers Annual Social and Economic (ASEC) Supplement data from 1962 to present and basic monthly survey (BMS) data from 1976 to present along with nearly all topical supplement data. IPUMS codes variables consistently across time, provides access to original unrecoded versions of the variables, documents all variables for ease of use, and enables users to create customized datasets that include only variables from the months and years of data they want to analyze. This effort dramatically reduces redundant effort across researchers.
Previously we have documented linkages across months from 1989 forward [2] and between the March BMS and the ASEC [3]. These linkages can be replicated using the IPUMS-constructed linking keys CPSIDP and MARBASECIDP, respectively; these linking keys are available via IPUMS CPS. While researchers could use original keys to link CPS files over time, IPUMS-constructed linking keys and accompanying documentation provide the research community with a common starting point, save individual researchers hours of effort, and reduce potential for error in the linking process. In this paper, we document the specific challenges linking individuals across months and between the ASEC and March BMS in the 1976 to 1988 period. We outline the solutions we implemented and provide linkage rates against which researchers can check their linkages.
CPS rotation pattern and linking
The CPS is a rotating panel household-level survey in which all individuals residing in a household are surveyed for four months, rotate out of the survey for eight months, and then rotate back into the survey for another four months (known as the 4-8-4 rotation pattern). The first time a household appears in the CPS is their first “month-in-sample” (MIS)2 and this is indicated in the data (MIS coded as 1). Each subsequent month that a household is included in the CPS, their MIS value increments by one. A household’s MIS value does not increment in months when the household is out of rotation (the eight-month break). Thus, the final month of the 4-8-4 rotation pattern in the CPS has an MIS value of eight.
The Census Bureau provides a very brief set of guidelines for linking individual observations across CPS data files [4], which are insufficient for creating linkages. While they list the variables needed to link observations over time, the documentation is sparse and provides insufficient detail about how to do the linking. Furthermore, the instructions focus on linking within a set of years when linking keys are stable and provide limited direction on how to bridge changes in linking keys over time. Finally, the documentation [5] also indicates that it is not possible to link some years of data together due to changes in the survey (i.e., 1976 to 1977 and 1985 to 1986).
CPSID(P)
To overcome barriers related to the rotation pattern and limited guidance for linking CPS data over time, IPUMS created CPSID and CPSIDP. CPSID is a linking key that accounts for changes in original CPS linking keys and consistently identifies households across time. CPSIDP is a person-level linking key that consistently identifies persons across time. CPSID(P)3 is created by applying Census Bureau rules for linking across time. CPSID(P) eliminates the need for every researcher to navigate the complex CPS rotation pattern, changes in linking keys over time, and additional data quality challenges that arise. CPSID(P) provides a common starting point for the research community, which should increase reproducibility of research using linked CPS data.
The CPS is a household survey, which has implications for the creation of CPSID(P). A unique CPSIDP value, based on original linking keys, is assigned to a single individual each time they appear in the CPS. If one or more of the individuals in the household move out of the household between CPS interviews, individuals who have moved are not followed. If everyone in the household moves, and new people occupy the dwelling, the new people are interviewed the next time the household is included in the CPS. In 1976 to 1988, a household is identified using a household identification number (HRHHID) and a household number (HUHHNUM).4 If an entire household is replaced, the household number (HUHHNUM) value changes. Under these circumstances, IPUMS CPS generates a new CPSID value for the new household [2]. For example, a household that was first observed in the CPS in December of 1981 may have the following CPSID value 19811203287400; the two individuals living in the household would have CPSIDP values of 19811203287401 and 19811203287402, respectively. A third individual who joins the household in January of 1982 would be assigned a CPSIDP value of 19811203287403; the two returning household members would be assigned the CPSDIP values they were first assigned in December.5
We encountered several problems when linking CPS respondents in the 1976 to 1988 period. Broadly, these included issues with 1) inconsistent household identifiers, and 2) duplicated person identifiers within households. We describe these problems in detail along with solutions we employed to extend the IPUMS CPS variables CPSID(P) and MARBASECID(P) to basic monthly and ASEC data in the 1976 to 1988 period.
Creating CPSID(P) for basic monthly files
Each BMS data file contains only individuals interviewed in that specific month. In some months, topical supplement data are also collected. Regardless of whether topical supplement data are collected, a file containing only the basic monthly variables (hereafter referred to as the basic monthly-only file) is always released. In months where a topical supplement is also fielded, a second file (the supplement-containing file) is released later that includes the topical supplement variables appended to the end of the records. The basic monthly portions of the basic monthly-only and supplement-containing files should be identical, though this is not always the case in the 1976–1988 period. We detail the differences we encounter and our efforts to reconcile differences between the basic monthly-only and supplement-containing files in Appendix A. We also encounter challenges linking across months in the 1976 to 1988. The two major obstacles to linking monthly files are inconsistent household identifiers and duplicated person identifiers.
Household identifiers and linking across months
To address problems with the household identifiers that affected linking across months between 1976 and 1988, we made modifications to the original household identifier prior to creating CPSID(P). HHID refers to the original, unmodified household identifier and HRHHID refers to the IPUMS CPS-modified household identifier used for linking. Except in specified months, HHID and HRHHID are the same.
Non-unique household identifiers
In 1976 and 1977, the original household identifier (HHID) does not uniquely identify all households in basic monthly-only files. Through trial and error, we located a set of variables that together uniquely identify households within and across months. We used these to create HRHHID for the basic monthly-only files as follows: we used the first nine digits of the twelve-digit HHID and replaced the last three digits of HHID with information from the first two digits of the third and sixth sets of columns marked as “blank” in the original CPS codebook (i.e., BLANK3 and BLANK6).6
In 1976 and 1977, HRHHID, constructed as just described, yields more plausible households within basic-monthly only files than the original household identifier (HHID). Specifically, households identified using HHID were extremely large and contained multiple household heads; using HRHHID results in smaller households with only one head per household. HHID uniquely identifies households in the supplement-containing files; accordingly, we gauge the quality of HRHHID in the basic monthly-only files by making comparisons with the supplement-containing files from the same month (in months that contain supplements). In 1976, HRHHID from the basic monthly-only files matched HHID in the supplement-containing files for 75% of individuals. For the remaining quarter of the records, the final three digits of HRHHID (which correspond to BLANK3 and BLANK6) matched across files, but the first nine digits of HRHHID (which correspond to the first nine digits of HHID) did not match. In these instances, the first nine digits differ systematically between the basic monthly-only and supplement-containing files, so we made a second adjustment to HRHHID (see Appendix A). In 1977, neither HHID nor HRHHID yields successful matches between basic monthly-only and supplement-containing files, so we are unable to check the validity of our constructed HRHHID household identifier as we did for 1976. However, HRHHID in the basic monthly-only files results in each household containing only one household head, which is almost always consistent with the organization of households in the supplement-containing files and represents an improvement over HHID.
Non-numeric household identifiers
In January 1976 to June 1985, some records contained non-numeric characters in HHID. We made adjustments to the original household identifier that allow us to treat HRHHID as a numeric variable in all months. In 1976 and 1977 basic monthly-only files, a few records in each monthly file, usually less than 10, had a ‘
Unlinkable months
Census Bureau documentation indicates that it is not possible to link some months of CPS data across years (1976 to 1977 and 1985 to 1986) due to changes in the survey [5]. We elaborate on these known limitations in the 1970s and remark briefly on our investigation in the 1980s.
Unlinkable months within 1977 and between 1976, 1977, and 1978 are the result of a phased-in sample size increase. Starting with the supplement-containing months of 1977, additional households were included in the CPS, resulting in 15,000 to 25,000 more individuals in these months (known as the D-sample). This sample size increase was extended to include all months in 1978. To prevent identification of these additional individuals in 1977, the Census Bureau altered the procedures for generating HHID so that supplement-containing months that included the D-sample could not be linked to basic monthly-only months without it [6].
Despite HRHHID uniquely identifying households within the basic monthly-only files in 1976 to 1978, this identifier does not always yield linkages across months within years or across years. Linkages are possible within 1976 and 1978 between months that are basic monthly-only and supplement-containing (see Fig. 1a). This is not the case, however, in 1977 due to the D-sample. Figure 1b shows linkages between types of files across years. In 1976, both basic monthly-only and supplement-containing files can be linked to 1977 basic monthly-only files, but not to 1977 supplement-containing files. No linkages are made between 1976 and 1978 even when possible given the 4-8-4 rotation pattern (i.e., MIS 1 in October, November, and December 1976). In 1977, supplement-containing files can be linked link to both 1978 basic monthly-only and supplement-containing files; no 1977 basic monthly-only file will link to either basic monthly-only or supplement-containing files in 1978.
a. Linkages possible between file types within years, 1976–1978. b. Linkages possible between file types across years, 1976–1978.
We find that no month before June of 1985 can be linked to any month after June of 1985. This is consistent with Census Bureau technical documentation, which provides no explanation of why linkages are not possible [5]. We suspect that this linking barrier may be due to a CPS redesign that began in April of 1984 and concluded by July of 1985 [7]. We find a similar linking discontinuity between September and October of 1985. No month after September of 1985 can be linked to any month before September of 1985. This break in linking is due to a change in Census Bureau confidentiality rules beginning in October of 1985 that allows for the identification of smaller individual metropolitan areas in the public use data [12]. As a result of these two barriers to linking in 1985, July, August, and September of that year can only link to one another and to no other months.
Non-unique person identifiers
Person identifiers are required along with household identifiers to match individuals across months of the CPS. While person identifiers should be unique within household in a given data file, the same person identifier is sometimes assigned to multiple individuals in the same household in months between 1976 and 1983, resulting in duplicate values of the person line number (LINENO) within households. This presents a problem for linking individuals across months using only person (LINENO) and household identifiers (HRHHID and HUHHNUM).
In the basic monthly data, if two or more records in a household have identical person identifiers, we do not allow linkages across months for those individuals. CPSIDP is a linking key based solely on Census Bureau identifiers. Accordingly, we do not use additional information about the individuals to try to ascertain which record with the duplicated person identifier should be linked to a single record with the same identifier in a surrounding month. When we encounter pairs of records with duplicate LINENO values, we assign them both a new (and unique) CPSIDP value. Consider the following household with the HRHHID value of 003147831701 that appears in three months of the CPS, beginning in December of 1981.
Household from the December 1981 current population survey file.
The first two records in the household in December of 1981, shown in Fig. 2, have the same identifiers (HRHHID, HUHHNUM, and LINENO). Though the identifiers are the same, there is variation between these records on the other variables, so the records are not complete duplicates. Since the first two records have the same identifiers, we assign them unique CPSIDP values.
This household is also in the January 1982 CPS (see Fig. 3). In January, HRHHID and HUHHUM values match and the LINENO values are 01 and 02, respectively. Without additional information, we do not know which of the people in this household in December 1981 with LINENO
Household from the January 1982 current population survey file.
As of January 1982, the household with an HRHHID value of 003147831701 and an HUHHNUM value of one has appeared in the CPS twice and has four CPSIDP values. When we see the same household in February of 1982 (see Fig. 4) with LINENO values of 01 and 02, we assign the CPSIDP values to them that they were assigned in January of 1982. This occurs because we create CPSIDP by looking to the previous month for a match and continuing backward in time through all linkable months until a match is found.
Household from the February 1982 current population survey file.
Our approach is conservative and undoubtedly misses some plausible links across months. However, CPSIDP is a mechanical match using only household and person identifiers to assign CPSIDP values. Duplicate person line numbers are uncommon, representing less than 1% of cases each month from January 1976 to December 1983.
Duplicate person line numbers are not an issue in the 1984 to 1988 period. The challenge in this period is that the CPS basic monthly files include three different versions of demographic variables and the Census Bureau documentation on which version to use for these purposes is unclear (see Appendix B for more details). Importantly, these files include three person line number variables. However, only one of the person line number variables (Item 18A in columns 541–542) uniquely identifies persons within households. We use the variable that uniquely identifies persons within households for linking.
Children under 14 are not included in basic monthly-only files until 1982. However, children are included in supplement-containing months prior to 1982: October 1976, October 1977, and May 1978 through December 1981 (see Appendix A). Despite inclusion in the data file, we do not attempt link persons under 14 across months in this period, as they do not appear in all months; they have CPSIDP values of 0.
Basic monthly survey linkage rates across time
Linkage rates across months in 1976 to 1988 using HRHHID, HUHHNUM, and LINENO as linking keys are lower than in recent years [2] due to comparatively poor data quality as outlined above in the earlier time period. We provide sample sizes and retention rates before and after validating links using AGE, SEX, and RACE for CPS data collected in 1976–1977 and 1987–1988 (see [2] for linkages in 1994–1995 and 2009–2010). Table 1 shows the total number of records in each month-in-sample group from January 1976 to April 1977 and January 1987 to April 1988.
Based on the 4-8-4 rotation pattern of the CPS, 75% of respondents are eligible to link between consecutive months (MIS 1-3 and MIS 5-7 in a given month can link to the next month; MIS 4 and 8 rotate out of the survey). Table 2 shows linkages and retention rates between January and February in 1976 and in 1987. More than 90% of eligible records in January are observed in February and nearly all of them are plausible based on age, sex, and race. Plausible links are those which have the same values for SEX and RACE in all time points and whose AGE does not increase by more than two years.
Linkages are also possible across non-consecutive months. Table 3 shows links two months apart, from October to December, in 1976 and in 1987. Half of individuals in October are eligible to participate in the CPS in December. Of those who are eligible, 73% are linked in 1976 compared to 90% in 1987. Most of these linkages are plausible based on comparisons of age, sex, and race.
The CPS rotation pattern also allows for linking the same month across adjacent years (see Table 4). The individuals in MIS 1-4 in 1976 and 1987 are eligible to participate in the CPS in the same month in 1977 and 1988, respectively. About 75% of eligible individuals are observed the next year and just over 70% of these links are plausible based on age, sex, and race.
Respondents may also be linked across up to eight months of participation in the CPS. Table 5 shows the number of people starting the CPS in January 1976 and January 1987 who appear in the CPS up to eight times. The individuals who are
Number of people responding to the CPS, by calendar month, month-in-sample group, and year
Number of people responding to the CPS, by calendar month, month-in-sample group, and year
Note: Table reports unweighted sample sizes for the number of people participating in the CPS in each calendar month, by month-in-sample group.
Sample size and retention rate, CPS respondents linked across two consecutive calendar months
Note: Table reports the unweighted number and percentage of CPS respondents in January of Year X (the shaded box) who responded to the CPS in February of that year. Under “Year X,” entries report the month and year in which respondents were in MIS1. Because of the rotation group structure, not all respondents in January are eligible to respond in February. The column labeled “plausible” omits apparent matches when respondents’ sex or race/ethnicity differs or when their age differs implausibly.
Sample size and retention rate, CPS respondents linked across two non-consecutive calendar months
Note: Table reports the unweighted number and percentage of CPS respondents in October of Year X (the shaded box) who responded to the CPS in December of that year. Under “Year X,” entries report the month and year in which respondents were in MIS1. Because of the rotation group structure, not all respondents in October are eligible to respond in December. The column labeled “plausible” omits apparent matches when respondents’ sex or race/ethnicity differs or when their age differs implausibly.
Sample size and retention rate, CPS respondents linked in March across two consecutive years
Note: Table reports the unweighted number and percentage of CPS respondents in March of Year X (the shaded box) who responded to the CPS in March of the next year. Under “Year X,” entries report the month and year in which respondents were in MIS1. Because of the rotation group structure, not all respondents in March are eligible to respond the following March. The column labeled “plausible” omits apparent matches when respondents’ sex or race/ethnicity differs or when their age differs implausibly.
Number and percentage of people responding to subsequent CPS surveys among those beginning the CPS in January 1976 and 1987
Note: Separately for people entering the CPS in January 1976 and January 1987, the table reports unweighted sample sizes for the number of people participating in all of the CPS surveys for which they were eligible up through the focal month. For example, among the 12,287 people who began the CPS in January of 1976, there were 10,486 (or 85.34%) who participated in all four surveys between January and April 1976 and 0 who participated in all eight surveys between January 1976 and April 1977 due to the linking discontinuity between samples with and without supplements in these years. The column labeled “plausible” omits apparent matches when respondents’ sex or race/ethnicity differs or when their age differs implausibly.
first observed in January 1976 link well in 1976; 93% are also observed in February 1976 and 85% are also observed in February, March, and April of 1976. However, no individuals who started the CPS in January 1976 are linked to January 1977, as it is a supplement-containing month and such linkages are not possible (as described above). The individuals who begin the CPS in January 1987 link well within 1987; 85.91% are observed in all four months between January and April 1987. Linkage rates decrease after the eight month gap between MIS 4 and MIS 5 (see also [2]). About 59% of respondents who started the CPS in January 1987 are observed all eight times. In both 1976 and 1987, most of the linkages we make are plausible based on age, sex, and race.
Number and percentage of people responding to subsequent CPS surveys among those beginning the CPS in January 1976 and 1987
Note: Separately for people entering the CPS in January 1976 and January 1987, the table reports unweighted sample sizes for the number of people participating in ANY of the CPS surveys for which they were eligible up through April of the following year. For example, among the 12,287 people who began the CPS in January of 1976, there were 11,775 (or 95.83%) who participated in at least one more survey between February 1976 and April 1977. The column labeled “plausible” omits apparent matches when respondents’ sex or race/ethnicity differs or when their age differs implausibly.
Number and percentage of people responding to subsequent CPS surveys among those beginning the CPS in January 1976 and 1988
Note: Separately for people entering the CPS in January 1976 and January 1987, the table reports unweighted samples sizes for the number of people participating in any of the CPS surveys for which they were eligible between January and April of the following year. For example, among the 12,287 people who began the CPS in January of 1976, there were 9,475 (or 77.11%) who participated in at least one more survey between February 1976 and April 1977. The column labeled “plausible” omits apparent matches when respondents’ sex or race/ethnicity differs or when their age differs implausibly.
Tables 6–8 show linkage rates for a variety of different linkage scenarios. Table 6 shows those individuals who entered the CPS in January 1976 and January 1987 (MIS 1) and appear in any of the subsequent seven months that their household could have been in the CPS (either MIS 2 or MIS 3 or MIS 4 or MIS 6 or MIS 7 or MIS 8). Most respondents (95% in 1976 and 97% in 1987) appear in at least two months of the CPS. Table 7 shows the percent of individuals in January 1976 (and 1987) who appear in the CPS in MIS 2-4. About three-quarters of individuals who begin the CPS in January 1976 and January 1987 are observed in at least one additional month between February and April of 1976 and 1987, respectively. Finally, Table 8 shows attrition between MIS 4 and 5.
The ability to easily link the Annual Social and Economic (ASEC) Supplement with the CPS BMS creates many research possibilities. Information only available in the ASEC may be combined with multiple data points from the BMS or used in combination with CPS topical supplements. To make these linkages easier, we add CPSID(P) to the ASEC files. Research utilizing these linking keys, for example, combines information on union membership from the monthly data with tax and public benefit receipt from the ASEC [8] and analyzes family income and health insurance from the ASEC along with smoking behavior from the Tobacco Use Supplement [9].
This section of the paper details the creation of MARBASECIDH and MARBASECIDP, hereafter MARBASECID(P), for the 1976 to 1988 period.9 MARBASECIDH and MARBASECIDP are IPUMS CPS variables that link the March basic monthly data to ASEC data from the same year and enable the addition of CPSID(P) to the ASEC files. Data quality issues in 1976 to 1988 such as differing numbers of records in the household between the March basic monthly and ASEC files and mismatched or duplicated person identifiers necessitated a methodology distinct from the 1989 forward period, which relied on Census Bureau identifiers only (with a few exceptions). The methodology for creating MARBASECID(P) in 1976 to 1988 is also distinct from our CPSID(P) methodology. Because all records from the March BMS should theoretically appear in the ASEC file for a given year, we are more persistent in our attempts to create linkages between the March BMS and the ASEC than we are for links across months of the BMS. We describe the problems encountered in attempting to link March BMS and ASEC files in this period, detail the solutions we implemented to generate MARBASECID(P), and compare our methodology with alternatives.
Missing March basic monthly records in the ASEC
Between 1976 and 1988, the ASEC file should include all March basic monthly households plus a Hispanic oversample drawn from the previous November CPS [4]. However, in all years during this period, the March basic monthly file contains individuals who do not appear in the ASEC file. Table 10 shows the total number of
Sample size and retention rate, CPS respondents in month-in-sample 4 linked to month-in-sample 5
Sample size and retention rate, CPS respondents in month-in-sample 4 linked to month-in-sample 5
Note: Table reports the number and percentage of CPS respondents in month-in-sample four who responded to the CPS in month-in-sample five nine months later. Under “Year X,” entries report the month and year in which respondents were in MIS1. The rows labeled “plausible” omit apparent matches when respondents’ sex or race/ethnicity differs or when their age differs implausibly.
individuals in the March basic monthly survey for each year between 1976 and 1988 (Panel A), the number that merge to the ASEC (Panel B), and details about unmerged individuals (Panel C). Panel B shows that, except in 1977, over 98% of individuals in the March basic monthly file are merged with the ASEC; the majority are located in merge stage 1, which we describe in more detail below in the “Strategy for Linking March basic monthly to ASEC” section. For those unmerged, we differentiate between individuals whose household (using HRHHID) is or is not in the ASEC. In the first instance, the same HRHHID value is in both the ASEC and the March BMS. Either the household in the ASEC contains fewer persons than the same household in the March BMS or we are unable to confidently match records within a household across files due to duplication or mismatch of variables (described in the next section). The number of basic monthly records for which the HRHHID does not appear in the ASEC is less than 100 per year in 1976 to 1985 and is in the thousands in 1986 to 1988. In the 1986–1988 period, the Census Bureau scrambled household identifiers in the ASEC file for privacy reasons,10 and we do not attempt to link the basic monthly and ASEC in these years.
We encounter difficulties uniquely identifying records within files and in matching records across BMS and ASEC files from 1976–1988. This is not the case in the 1989 forward period where HRHHID and LINENO are sufficient to uniquely identify almost all records within and link across March BMS and ASEC files [2].
Difficulty uniquely identifying records
With the exception of the 1982 and 1983 March BMS files, no March BMS or ASEC files contain records that are complete duplicates between 1976 and 1988. However, even though entire records are unique in most of the files, we are often unable to find a single set of variables to uniquely identify all records in households that appear in both the March BMS and the ASEC file for a given year. The 1989-onward method of simply using linking keys to match March BMS and ASEC data is insufficient. Furthermore, we are unable to identify a single set of variables to use as linking keys that allow us to link all March basic monthly records to their ASEC counterparts for all years in the 1976–1988 period.
Household 509037594903 in the 1982 March BMS and ASEC.
Household 202962182216 in the 1985 March BMS and ASEC.
Consider the following example household (HRHHID
In other cases, person identifiers (LINENO) do not match across the March BMS and ASEC files and the demographic variables are insufficient for uniquely identifying individuals when the person identifier is omitted. For example, consider the following household (HRHHID
Household 046112033020 in the 1976 March BMS and ASEC.
Household 50007137050 in the 1976 March BMS and ASEC.
There are two ways that records may not be linked even if identifiers are unique in 1976 to 1988. Both occur because LINENO values do not match across BMS and ASEC files, which undermines the utility of LINENO as a linking key.11 For example, in this household from 1976 (HRHHID
Second, some individuals are uniquely identified by demographic variables and line number and appear in the same order within the household, but person identifiers are different across the March BMS and ASEC files. This is another instance where despite having unique person line numbers linkages across March BMS and ASEC files are not possible. We illustrate this situation in a household from 1976. The households contain the same number of people with the same age, sex, race, and work hours, but have different LINENO values in the March BMS and the ASEC. This data quality issue is not present in the data from 1989 forward but presents a major obstacle to linking March BMS and ASEC records from 1976–1988.
Strategy for linking March basic monthly to ASEC
Given the problems detailed above, we apply a multi-stage merging process to maximize linkage rates while minimizing spurious matches. Demographic information is indispensable from 1976 to 1988 for making matches between the March BMS and ASEC, though it is only used in a handful of cases in the 1989–2019 period. Reliance on demographic and other auxiliary information for linking represents a departure from our general approach of creating mechanical links for CPSID(P) and MARBASECID(P). However, the unique challenges during this period necessitate a different approach. We identify matches between the March BMS and ASEC in up to six stages: four in 1976–1981, five in 1982–1987, and six in 1988. We use person identifiers and demographic information as linking keys in some stages and exclude them in other stages.
Preliminary steps
For an individual to be matched across BMS and ASEC files, the household identifier must appear in both files. Before attempting to link, we exclude those individuals whose HRHHID value does not appear in both the BMS and ASEC files (see Table 10). Because there are no persons under the age of 14 in the March BMS from 1976 to 1981, we also exclude individuals under age 14 in the ASEC from linking to the March BMS. Individuals excluded from the matching are assigned non-linking MARBASECIDP values and CPSIDP values of 0 in the ASEC file.12
We use original Census variables for linking March BMS to ASEC with three exceptions, HRHHID in the 1976 BMS, Race in the 1988 ASEC, and hours worked in the 1982–1988 BMS. The adjustment made to the 1976 BMS household identifier is described in the “Problem 1: Household Identifiers and Linking across Months” section above. In 1988, race codes are different in the March BMS and ASEC; we standardize race codes by recoding values of 3, 4 or 5 in the ASEC (“American Indian or Aleut Eskimo”, “Asian or Pacific Islander”, and “Other”) to 3, which represents “Other” in the March BMS. In the “hours worked” variable (AHRSWORKT) in all March BMS files from 1982–1988, there are two varieties of missing values, one for adults who were not working last week and one for children. In the ASEC files from these years, all NIU cases have a value of 0. We recode the two missing values in the March Basic Monthly files to 0 before matching. This missing value code harmonization is available in the IPUMS CPS variable AHRSWORKT.
Linking methods
Before each linking stage, we set aside all records for later use that are not uniquely identified by that stage’s linking keys (Table 9). After linking, we combine the set-aside records with the unlinked records (separately for the March BMS and ASEC); these files serve as the inputs for the next linking stage. We detail the number of linkages between the March BMS and the ASEC at each stage for every year from 1976 to 1988 in Table 10. The linking stage in which a given record was linked is available to researchers via IPUMS CPS in the variable MARBASECSTAGE, which is available on both the March BMS and ASEC files from 1976 to 1988.
Multi-stage merge linking keys, 1976–1988
Multi-stage merge linking keys, 1976–1988
All BMS files in the 1976–1988 period contain multiple versions of the person identifier and age, sex, and race variables, which differ slightly from one another (see Appendix B). We use the version of these variables that yields the highest match rate with the ASEC. These variables are available via IPUMS CPS in LINENO,
Multi-stage merge of March basic monthly and ASEC files, 1976–1988
AGE, SEX, and RACE. The ASEC only has one person identifier and one set of demographic variables in all years.
Stage 1: Stage 1 records are those that are uniquely identified by and linked using HRHHID, HUHHNUM, LINENO, AGE, SEX, and RACE. The majority of March BMS records are linked to the ASEC in this stage (Table 10).
Stage 2: In the second stage of linking, we exclude LINENO to mitigate the problem of non-consecutive and mismatched LINENO values across March BMS and ASEC files. We retain and attempt to link unlinked stage 1 records that are uniquely identified in each file using HRHHID, HUHHNUM, AGE, SEX, and RACE. The result is hundreds or thousands of additional matches (Table 10).
Stage 3: Stage 3 linking adds the number of hours worked last week (HOURS) to the Stage 2 linking keys. We link a handful of records in this stage (Table 10).
Stage 4: In stage 4, we use only household and person identifiers to link records, which is consistent with our approach to assigning MARBASECIDP in 1989 forward. This approach accounts for many successful merges between 1982 and 1988 (Table 10).
Stage 5: In this stage, we link singletons who are in households that have the same number of individuals in the March BMS and the ASEC and where all other household members have already been linked. Stage 5 uses household identifiers and the number of persons in the household (NUMPER) as linking keys.13 These singletons have not merged in previous stages due to mismatch of either LINENO or demographic variables but are the only remaining possible matches. For example, in the household shown in Fig. 9 (HRHHID
Stage 6: Stage 6 linking addresses a specific problem with household number (HUHHNUM) in 1988. There are many households in 1988 where HUHHNUM does not match across the March BMS and ASEC files, but whose members have the same values for AGE, SEX, and RACE across files. Most of these households have an HUHHNUM value of 0 in the ASEC file, despite HUHHNUM generally having a minimum value of 1. In this case, we eliminate HUHHNUM as a linking key and add number of persons in the household (NUMPER) as a linking key. This yields 336 additional linked records. Figure 10 shows an example of persons merged between March BMS and ASEC files in stage 6.
Comparison between multi stage and single stage merges, 1976–1988
March BMS-ASEC merge stages for household 80079928707.
An example of persons linked between March BMS and ASEC in stage 6.
Comparison between multi stage and single stage merges, 1976–1988
Our multi-stage linking process yields more matches across March basic monthly and ASEC files than a single-stage linking approach. Table 11 compares the number of linkages resulting from single-stage linking using HRHHID and LINENO; single-stage linking using HRHHID, LINENO, AGE, SEX, and RACE; and our multi-stage approach. Our multi-stage approach yields more linkages than both single-stage linkage approaches and higher quality linkages as we illustrate next.
Validating March basic and ASEC linkages
Table 12 shows the number of records from our multi-stage linking process that match on AGE, SEX, and RACE in every year. Note that because AGE, SEX, and RACE were linking keys in the first three stages, the number of linked records that match on AGE, SEX, and RACE is extremely high across all years. AGE mismatches are most common followed by SEX and RACE.
Validation on demographic characteristics broken out by merge stage
Validation on demographic characteristics broken out by merge stage
Table 13 shows validation broken out by linking stage for 1982–1988. Stage 4, where we use only household and person identifiers, validates most poorly.
Generating a linking key that performs well for merging across months of the CPS and to ASEC files from 1976 to 1988 is an involved process. Different numbers of people across files and within households, duplicate and unlinkable records, and different coding schemes in linking keys across files amplify the complexity of this endeavor. This documentation along with linking keys in the IPUMS CPS data are intended as a resource for the research community. Our goal is to facilitate linkages between CPS files and to supplement existing documentation about CPS linking. As part of this effort, we also strive to be transparent about our processes and to provide researchers with the flexibility to retain and drop linkages we have made based on their comfort with the linkages we have created. We have described the many problems we encountered, the several steps we took to resolve problems, and the multi-stage methodologies we used to create and add a single linking key to all CPS files from 1976 to 1988. By creating CPSID(P) and MARBASECID(P), IPUMS saves time, eliminates duplication of effort, reduces errors for individual researchers, and provides the research community with a common starting point for linking across CPS data files.
The primary issues we encountered in linking across BMS files to create CPSID(P) were with household and person identifiers. Our description details the adjustments we made to household identifiers based on extensive investigation and a series of checks to ensure the quality of the adjustments. Our approach does not overcome the problem of duplicate person identifiers for linking across BMS files.
Despite our efforts, BMS linkage rates in the 1976 to 1988 period are lower than observed in the 1989 forward period [2]. Some linkages are completely impossible. Particularly complex is 1977, in which linkages between basic monthly-only and supplement-containing months are not possible, though some 1977 months can be linked to adjacent years. Basic monthly-only months in 1977 may only be linked to 1976 and the supplement-containing months in 1977 may only be linked to 1978. Extending existing Census Bureau documentation about linking problems between 1985 and 1986, we find that the break occurs between June and July of 1985; no months may be linked across the break, though linkages are possible prior to and following the break.
We also created linkages between March BMS and ASEC records. The creation of MARBASECID(P) allows us to add CPSID(P) to the ASEC and enable researchers to use ASEC data linked to any month of the CPS. We expected the ASEC to contain all individual records in the March BMS, but this was not always the case. In addition, we encountered problematic identifiers; to address this issue, we implemented a multi-stage process to identify matches between the March BMS and the ASEC and make information about the merge stage available to researchers in the IPUMS CPS variable MARBASECSTAGE. We were unable to link the 1977 March BMS and ASEC, meaning that the 1977 ASEC cannot be linked to any other months of CPS data using CPSID(P). Though no documentation clarifies the exact reason, we strongly suspect that this is related to the “D sample” which was added to the CPS for the purpose of measuring the efficacy of a jobs training program (see more information in Appendix A).
Our approach for creating CPSID(P) and MARBASECID(P) in the 1976 to 1988 period deviated from that employed in the 1989 forward period. In general, we followed the rules elaborated previously [2] for creating CPSID(P) while addressing the challenges described above. Creating linkages between the March BMS and ASEC and constructing MARBASECID(P) in 1976 to 1988 deviated considerably from our approach in 1989 forward. Using the same logic in 1976 to 1988 as we did with the more recent data misses many linkages, which reduces the value of being able to use the ASEC in combination with monthly CPS data. We have demonstrated that our multi-stage linking approach is superior to the single-stage approaches. It strikes a balance between maximizing the number of possible links while applying a consistent linking algorithm across years from 1976 to 1988.
In short, this work serves as a resource for the community of researchers who wish to leverage the underutilized panel component of the CPS. It sheds light on some of the mysteries encountered when attempting to systematically link CPS data across months for a period of more than two decades. It also serves to catalog the challenges we encountered and solutions we employed when creating MARBASECID(P) and CPSID(P) in the 1976 to 1988 period. With this information, researchers can make informed choices when balancing the risks and rewards associated with leveraging the panel component of the CPS back to 1976.
Footnotes
IPUMS is an organization that provides census and survey data from around the world integrated across time and space. IPUMS originally stood for Integrated Public Use Microdata Series, but as of 2016, we no longer treat IPUMS as an acronym given the growth in our data collection beyond microdata and access conditions in some instances that limit usage. For additional information about IPUMS, please see
We use MIS to indicate month-in-sample; this information is contained in the IPUMS CPS variable MISH.
Hereafter, we use CPSID(P) to refer to both CPSID, the household-level identifier, and CPSIDP, the person-level identifier.
A note on terminology: Prior to 1989, there are often variables without proper variable names in codebooks; rather, variables were referenced as “items” such as “Item 18A. Line no.” For convenience, we refer to the variables that should uniquely identify households and persons within households in the original census CPS files as HRHHID and HUHHNUM and LINENO, respectively. These are IPUMS CPS variable names.
BLANK3 is found in columns 25–27 in the original data. Only the first two columns, 25–26 are used to uniquely identify households; the third column is blank for all records. BLANK6 is found in column 107 in the original data.
We used an inductive approach to replace spaces and ‘
We looked for the individuals in adjacent months with problematic HHIDs; based on our analysis, we replaced this character with a 4.
This is based on a conversation with staff at the U.S. Census Bureau, not official documentation.
See Appendix B for more information on versions of the person identifier in 1984–1988. Only one version of the line number variable exists in the ASEC files.
As a result of this data quality issue, it will not be possible to distinguish ASEC oversample records from unlinked March basic monthly records in 1976–1988 and those under the age of 14 in 1976–1981.
We generate the count of individuals per household used for linking (NUMPER) based on HHID values. It may differ from the IPUMS CPS variable NUMPREC for some households in the ASEC files during this period. It is necessary to use this generated count rather than the variable from the original data file or NUMPREC due to the fact that some households as defined by HHID are split across multiple households as defined by HHSEQ in the ASEC files. For more information on split households in the ASEC files, see Appendix C.
Supplementary data
The supplementary files are available to download from https://dx-doi-org.web.bisu.edu.cn/10.3233/ JEM-210480.
