Abstract
This article reports an investigation of errors of measurement in self-reports of financial data in the Health and Retirement Study (HRS), one of the major social science data resources available to those who study the demography and economics of aging. Results indicate significantly lower levels of reporting reliability of the composite variables in the HRS relative to those found for “summary” income approaches used in other surveys. Levels of reliability vary by type of income source—reports of monthly benefit levels from sources such as Social Security or the Veterans Administration achieve near-perfect levels of reliability, whereas somewhat less regular sources of household income that vary across time in their amounts are measured less reliably. One major area of concern resulting from this research, which may be beneficial to users of the HRS surveys, involves the use of imputation in the handling of missing data. We found that imputation of values for top-end open income brackets can produce a substantial number of outliers that affect sample estimates of relationships and levels of reliability. Imputed income values in the HRS should be used with great care.
Introduction
This article reports results of a systematic investigation of estimates of measurement reliability of data concerning income and assets from the Health and Retirement Study (HRS), collected by the Survey Research Center (SRC) at the University of Michigan. The HRS is an important study of the economics of health and aging in the United States and is arguably the most expensive single project in the history of the social sciences. 1 Beginning with the baseline study in 1992, and having followed their initial respondents for 20 years, the HRS is one of the most important sets of surveys conducted by social scientists over the past 25 years in the United States (excepting the U.S. censuses, the Current Population Surveys, and other large-scale governmental surveys). Internationally recognized, and innovative in many ways, the HRS (and its original companion study, the Asset and Health Dynamics Among the Oldest Old [AHEAD] study) fashions itself as the premier study on retirement, pensions, and the interrelationships between health and socioeconomic status in middle-aged and older persons (see Juster and Suzman 1995; Soldo et al. 1997).
One of the features of the HRS study involves the use of bracketing (and imputation) of responses to questions about income and assets when respondents refuse to answer the question or respond that they “don’t know.” For the HRS, this was explicitly an effort undertaken to reduce the missing data problem in household wealth surveys, and the investigators showed improved estimates of income and assets when imputed values based on the bracketed responses were included (Juster and Smith 1997). The HRS is by no means the first to apply this approach, as it has been practiced by survey researchers for many decades. However, this approach is becoming more widely used in many types of surveys in many different cultural contexts. Some “within-bracket” imputation is required under such regimes of measurement in order to arrive at an “income value” to assign to the respondent. These imputations are widely used, and little or no research has been undertaken to investigate whether these “unfolding” or “bracketing” approaches introduce more error, or whether other approaches to missing data, for example, full-information maximum likelihood (or FIML), may be superior (Wothke 2000). Our point is that, although the aim of bracketing in this case is to obtain a response that is within a relatively narrow range, depending upon the nature of the bracketing categories, the within-bracket imputations may introduce measurement error. Another feature of the HRS, increasingly typical in other studies, is the separation of financial questions according to the sources and/or types of income and assets (e.g., self-employment income or income from stocks and bonds), involving more than two dozen separate series of questions to gain source-specific income estimates (e.g., see Moore, Stinson, and Welniak 2000; Hurd, Juster, and Smith 2003).
In this article, we investigate three questions using the publicly available HRS data. First, how reliable are financial reports in the HRS study? Using widely accepted methods for assessing measurement errors in longitudinal survey data, which overcome the limitations of cross-sectional designs, we address this question by obtaining estimates of reliability for each of several source-specific questions concerning household income and assets. Second, what differences, if any, are there in the reliability of reports across different sources of income and assets? For example, do reports of monthly benefit levels from sources such as Social Security or Veterans Administration achieve any more or less reliability than less regular sources of household income, such as self-employment income? Third, what are the effects of the bracketing approach (mentioned earlier and described in detail below) on the estimates of reliability in economic measurement in the HRS? Specifically, does the bracketing and imputation approach employed by the HRS study (to decrease nonresponse and increase reporting levels) improve or diminish the accuracy of financial data? After reviewing the research problem and the available literature on these issues, we describe in as much detail as possible the methods employed by the HRS in the gathering of financial data and the methods we employed to estimate patterns of measurement error. We then present a set of results on the reliability of reports of income and assets from 21 specific sources, which vary in methods of handling incomplete data, that can hopefully serve as a baseline for future studies of source-specific financial reports in panel data.
In the next section of the article, we present a brief discussion of the main approaches to obtaining financial data in surveys, including the “bracketing” approach employed by the HRS and other surveys to remedy the problem of nonresponse. Within this section, we summarize the objectives of the present study, which are to obtain estimates of the amount of measurement error associated with financial reports in the HRS. In the subsequent section, we describe the methods of the HRS, including sampling methods and the specific procedures used by the HRS to measure specific sources of income and assets. Following this, we describe the analytic methods we employ in this study to estimate the extent of random error of measurement in the HRS reports, namely, the use of quasi-simplex models applied to longitudinal data. We describe these methods in detail and provide a summary of the modeling assumptions employed to identify parameters that estimate the extent of random measurement error in the HRS data. This section is followed by a discussion of the challenges posed by the use of the HRS data and the approach we take to the analysis. This is followed by a section describing the results of our study, which suggest that the reliability of the composite variables made available by the HRS project for income and assets are significantly lower than levels of reporting reliability of “summary” approaches used in other surveys. This section also presents an examination of levels of reliability by type of income source, which indicate there is variation in reliability across different sources of income and assets. This section also examines the impact of the bracketing approach employed by the HRS on reliability of measurement. Finally, our concluding section summarizes our findings and the implications of these results for the measurement of financial reports in surveys. As noted earlier, results indicate significantly lower levels of reporting reliability of the composite variables in the HRS relative to those found for “summary” income approaches used in other surveys, but that levels of reliability vary by type of income source—reports of monthly benefit levels from sources such as Social Security or Veterans Administration (VA) benefits achieve near-perfect levels of reliability, whereas somewhat less regular sources of household income that vary across time in their amounts are measured much less reliably. One major area of concern resulting from this research, which may be beneficial to users of the HRS surveys, involves the use of imputation in the handling of missing data. We found that imputation of values for top-end open income brackets can produce a substantial number of outliers that affect sample estimates of population quantities, including estimates of reliability. Our results suggest that one might be better off accepting the refusals and Don’t Know responses as missing data, employing techniques such as FIML in the analysis rather than employing the bracketing approach to dealing with such types of nonresponse.
Research Problem
The typical approach to measuring household income in most surveys is to directly request a summary measure from the respondent, and if the respondent has difficulty or refuses to answer the question, some type of “income brackets” or grouped categories is used to elicit a response and get a “ball park estimate” (BPE) of the relevant income variables, employing one of a variety of methods to assign an actual value to the respondent. Some surveys use only bracketed categories, whereas others attempt to obtain an actual value. The income question in the General Social Survey (Davis, Smith, and Marsden 2008), for example, uses solely bracketed information, as follows: In which of these groups did your total family income, from all sources, fall last year before taxes, that is … ? Just tell me the letter of the category (using a showcard listing two dozen income categories, ranging from under $1,000 to $110,000 or over” (in the 1998 survey).
One of the problems with this type of stylized reporting of income is that too many people say they don’t know or refuse to answer. Many researchers assume, however, that this is still a preferable approach to measuring household or personal income, when survey time is limited and only a general idea of the household’s economic position is needed. On the other hand, when it is necessary to know how much income people have from specific sources, for example, income from assets, or income from VA benefits, or Supplemental Security Income (SSI), or welfare, such a general question is too imprecise (e.g., Hurd, Juster, and Smith 2003). This stylized approach has two problems. The first is the need for the respondent to come up with a number representing “total family income, from all sources,” and the second is the fact that relatively coarse categories of income are used, for which some imputation of income values is normally required.
It is generally agreed, however, that like many other variables assessed in surveys, income measurement is subject to a variety of sources of measurement error, including errors of recall or confusion regarding the meaning of income concepts and definitions and tendencies to overreport or to underreport one’s income (Moore, Stinson, and Welniak 2000). Studies of the reliability of income data suggest, however, that such stylized measures produce highly reliable information, and registering reliabilities in the range of .9, comparable to other types of factual information about the respondent, for example, years of schooling completed or occupational position (see Alwin 2007:302-4).
Others argue that these stylized measures of household income do not produce valid data (in the sense of producing accurate mean levels of income), and in the past few decades, some of our major public use survey data sets have tried to measure specific sources of income in surveys, examples of which are the Current Population Survey and the Survey of Income and Program Participation (SIPP; both conducted by the Bureau of the Census), the HRS, and the National Longitudinal Study (NLS). 2 The present study focuses on the HRS, which is innovative in a number of ways with respect to the measurement of income using survey methods. One of the features of the HRS study is that it queries the respondent about some 60 sources of income and assets—which is one of the major topics assessed by the survey. The HRS staff then adds these various sources of income and assets into two composite variables, one representing the amount of income and one the value of assets. In addition, the HRS survey employs a strategy, which has been practiced by survey researchers for many decades, involving the use of a combination of direct estimation of income, along with bracketing (or coarse categorization) of responses to questions about income and assets, when respondents refuse the initial question or indicate they “Don’t Know.” Instead of accepting these answers, a feature of the HRS study is that a series of questions are asked about amounts involving “more or less than $1,000,” for example; and if they respond “more,” they would be asked, “Do you have more or less than $5,000?” and so on. In the present study, we address (1) the general level of reliability of self-reports of specific sources income, assets, and equities in the HRS and (2) the effects of these kinds of “bracketing” approaches in the collection of information about different sources of income (see Juster and Smith 1997). This article reports an investigation of errors of measurement in self-reports of income and assets in the HRS survey. As we already noted, the HRS is one of the major social science data resources available to researchers interested in the demography and economics of health and aging. Internationally recognized, and innovative in many ways, the HRS fashions itself as the premier study on retirement, pensions, and the interrelationships between health and socioeconomic status in middle-aged and older persons. Yet, little attention has been given to the quality of the HRS income and asset data—particularly with respect to their reliability and validity.
This Study
The present study is part of a larger project focusing on the quality of the HRS survey data (Alwin 2013). The HRS and its companion AHEAD study contain a wide range of measures, including information on health status, health care, cognitive functioning, functional status, sensory function, employment, and other important measures of respondents’ health and well-being. The HRS data sets contain constructed summary variables that specify current household income and assets in the year prior to interview as well as variables for each of the component parts included in the summary measures. Compared with other data sets, the HRS and AHEAD are particularly rich in income and asset information because of its innovative use of “unfolding” techniques aimed at eliciting greater economic detail from respondents and fewer missing values. In the present study, we focus only on the income, asset, and equities measures. We first examine the reliability of the summary composite variables put together by the HRS study staff for income and assets. Because the summary variables for income and assets are not comprised of identical components, an effort was made to investigate the measurement errors involved in source-specific components, and this constitutes the second major set of results. In these source-specific analyses, we investigate the contributions of bracketing income to measurement error using reliability estimates as an indicator of measurement errors.
Methods of the Health and Retirement Study
In this section, we describe the samples and the key measures of income in the HRS. We also outline our analysis plans for the HRS data and the procedures we used to implement those plans.
HRS Samples
The HRS, funded under a cooperative agreement between the National Institute on Aging (NIA) and the SRC at the University of Michigan, is a longitudinal survey of adults over the age of 50 and their spouses and is based on a nationally representative probability sample of the U.S. households. Surveys are conducted biennially, and additional cases from the appropriate birth cohorts are added every six years in order to maintain a sample of the U.S. household-based population aged 51 and older. The HRS was designed as a set of parallel studies by University of Michigan researchers and panels of other national experts on current employment and job history, family and social supports, health and function as well as economic status (Juster and Suzman 1995; Soldo et al. 1997). As of 2004, the HRS is comprised of four samples: (1) AHEAD, a sample of persons born prior to 1924, first interviewed in 1993 and interviewed for the fifth time in 2002; (2) CODA (Children of the Depression Era), a sample of persons born from 1924 to 1930, first interviewed in 1998 and interviewed for the third time in 2002; (3) HRS, a sample of persons born from 1931 to 1941, first interviewed in 1992 and interviewed for the sixth time in 2002; and (4) WB (War Babies), a sample of persons born from 1942 to 1947, first interviewed in 1998 and interviewed in 2002 for the third time (see Table 1 for details). 3 Spouses were interviewed in all data collections regardless of age. All samples included oversamples of African Americans and Hispanic Americans.
Sample Sizes in Health and Retirement Study by Wave.
Note. AHEAD = Asset and Health Dynamics Among the Oldest Old; HRS = Health and Retirement Study.
Includes only cases present at first occasion of measurement.
aIncludes AHEAD data collected in 1993 and 1994. bIncludes AHEAD data collected in 1995 and 1996.
The HRS, AHEAD, CODA, and WB data were collected through the use of computer-assisted interviews, conducted by telephone or in person by trained SRC interviewers. The initial response rate for HRS and AHEAD was over 80 percent and was over 70 percent for both CODA and WB. The overall response rates for each sample at each follow-up wave average near 90 percent. In the present study, we rely on these panel data from the 1998, 2000, and 2002 waves of HRS data. These are the first three waves available using a common questionnaire across all four subsamples (i.e., HRS, AHEAD, CODE, and WB subsamples). Table 1 presents the sample sizes for the available HRS/AHEAD/CODA/WB data through 2002. Throughout the remainder of this article, we use the acronym HRS to refer to the entire collection of subsamples, not simply those cohorts originally studied as the HRS subsample.
HRS Questions About Income and Assets
As we have already noted, the HRS questionnaire painstakingly goes through (primarily in Section J) all the possible sources of income a household might have, one by one, seeking information from the financial respondent regarding income from all possible sources. Similarly, interspersed with these questions are questions about ownership of assets and their value, in addition to the questions about income from such types of wealth. The sources of income and assets the HRS questions focused on are listed in Tables 2 and 3 (from the 1998 questionnaire). This is an exhaustive list of HRS income content—about 60 separate questions on sources and amounts of income, assets, and home equity. There was only one person designated as the financial respondent for the household and that did not change. In this study, if data were not available from this person, the case was deleted from the analysis sample, that is, the designated financial respondent had to be present in all three waves.
Sources of Income Questions in the Health and Retirement Study.
Note. “Sample” inclusion code: the sample was judged to be too small for analysis purposes. “Object” inclusion code: the object of the question was not consistent over waves and the triad therefore cannot be analyzed.
aNonannual reporting period, for example, monthly, but converted into annual amounts by Health and Retirement Study (HRS).
Sources of Assets and Equities Questions in the Health and Retirement Study.
Note. “Sample” inclusion code: the sample was judged to be too small for analysis purposes. “Object” inclusion code: the object of the question was not consistent over waves and the triad therefore cannot be analyzed.
The bulk of these questions occur in Section J of the HRS questionnaire, but in addition, Section F includes questions about housing. Many of the questions about the value of certain assets occur within the context of obtaining information from the financial respondent on income generated from particular assets. Hence, questions about assets and income from assets are interspersed in the stream of questions focusing on income. In Tables 2 and 3, we cross-reference the two sets of income and asset questions that occur in proximity in the questionnaire, in order to clarify their differences as “income” and “assets” and their relation to one another.
For all HRS, AHEAD, and so on, waves, the HRS staff combined these sources of income and assets into a composite income variable or “index” for household income and a similar “index” or composite variable for household assets. These composites are formed from the sum of values for the variables listed in Tables 2 and 3. Here we analyze the reliability of self-reports of income and assets for these composite variables, but even more importantly, for the components as well. For reasons given subsequently, when we analyze the separate income components, we analyze only a subset of these variables—about one-third of the total number of variables (designated in Tables 2 and 3 as “included”). The HRS data release also makes the raw data on the components of these composite variables available for analysis as well, allowing one to ignore imputed data if necessary. One of the advantages of being able to study the reliability of reporting with the components is that we can exclude imputed data, either data imputed in the “ball parking” (BPE) or bracketed data or data imputed because the question had a missing data code.
In Tables 2 and 3, we indicate several characteristics of each of the HRS income and asset questions. First, we note “unit” of the question, that is, whether it involves self-reports by the financial respondent, proxy reports on the income and assets of spouses, or reports of the financial respondent on the household as a unit. We also note of which HRS index (Income or Assets) the question was a component, whether bracketing was used in the question, and whether the question used a nonannual reporting period that was later converted into annual amounts by the HRS study staff. Finally, the table indicates which questions we can treat as “measures” for the purpose of assessing reliability (questions included in the analysis are shaded in Tables 2 and 3), and we indicate the reason why specific questions were excluded from the analysis. These are of two types. First, we excluded variables for which there were fewer than 100 cases reporting any income from a given source with data at all three waves—designated “sample” in Tables 2 and 3. Second, we excluded those measures where the object of the question was not consistent over waves—designated “object” in Tables 2 and 3, for example, we do not know whether the first individual retirement account (IRA) reported on at wave 1 is the same first IRA discussed at wave 2. If it is likely that the object of the survey report changes over waves, it cannot be used as a basis for quantifying reliability of measurement.
In the HRS, income sources were not treated uniformly. There are five “types” of income and assets questions asked in the 1998, 2000, and 2002 waves of the HRS, differentiated in terms of the question formats employed for a given case. We have numbered them 1 to 5 in Tables 2 and 3. A brief explanation of each “type of income variable” follows: The financial respondent was asked whether the household had any income from a source. If the answer was yes, then they were asked for the amount. If the amount was a refusal or they didn’t know, then the “bracketing” questions were asked. The financial respondent was asked whether they or their spouse had any income from a source. If the answer was yes, then the question was asked whether it was the respondent, the spouse, or both. The next question was the amount of income, and whether the amount was a refusal or they didn’t know, then the “bracketing” questions were asked. The same as “2” except the “bracketing” questions were not asked. The respondent was asked whether they had income from a source, if yes, then the amount of income was asked. No “bracketing” for refusals or don’t knows. A dollar amount was asked, with the “bracketing” for don’t know or refusals. These were asked of income from assets of some type.
The 2002 questions all had the “bracketing” questions asked of the financial respondent. The categories where the bracketing was omitted were in the 1998 and 2000 variables.
An Example of Income Questions
In order to reduce some of the confusion surrounding the procedures followed by the HRS, as an example of the income questions, we discuss the questions employed to obtain information on respondent’s income from wages and salary. The original question in the HRS on respondent’s wages and salary income appears in Section J of the HRS questionnaire and asks the financial respondent the questions shown in Figure 1. As shown in Figure 1, these measures use an unfolding approach, beginning with a filter question, “Did any of your income (last calendar year) come from wages and salary?” If the respondent answered “Yes” to this question, the follow-up question was then asked: “About how much wage and salary income did you receive in (last calendar year) before taxes and other deductions?” Respondents who answered “Don’t Know” or who refused the question were automatically asked a bracketed version of the question, as shown in Figure 1.

Example of the use of bracketing questions in the Health and Retirement Study (HRS) for amount of earnings from wages and salaries.
Table 4 presents a description of the types of possible responses/outcomes to this question and their frequencies in the 1998 HRS survey as reported in the HRS codebook. In this case, a large percentage of those who had wages and salary income (5,108/5,858 = 87 percent) provided an actual value of the income. Another 569 (10 percent) cases provided responses within brackets, and 3 percent used the brackets but did not ultimately provide a response—these are the cases coded 6 in Table 4. This latter group is considered missing, along with those coded 1 in Table 4 (n = 8,389). The cases coded 7 or “Imputed Ownership and Amount” (n = 13) involves those cases who were coded “Don’t Know” or “Refused” in response to Question J7 (see Figure 1). Except for the cases where there is no financial respondent (n = 135), the HRS provides an imputed value for all cases with codes 1, 3, 4, 5, 6, and 7. The final variable for Respondent’s Wages and Salary used by the HRS to build a total income variable includes values for 14,260 cases in the 1998 HRS survey year (see, e.g., the sample sizes in Table 4). The majority of these cases have imputed values (which includes cases imputed to zero, where the report indicated there was no income from a particular source). Thirty-six percent of these cases (n = 5,108) provided an actual value in response to question J8 (see Figure 1), another 4 percent provide BPE using the income brackets, and the remainder (roughly 60 percent) are missing values that were imputed by the HRS staff. 4 We have not to date surveyed the other components of income that are used for the composite total household income variable with respect to the breakdown of the data the HRS study provides in terms of valid income values versus bracket-based imputations as well as imputation of values without brackets. This would in general be something of interest to the users of the HRS income data.
An Example of Health and Retirement Study (HRS) Income Data: Question Dealing With Respondent Wages and Salary Income, 1998 (Wave F) Questionnaire, Section J.
Each of the sources of total household income listed in Tables 2 and 3 was treated in a similar manner by the HRS study staff, as follows: A filter question was used to determine whether an income source was relevant. If respondents had income from a given source, they were directly asked to provide an amount of income (or assets). If respondents refused or responded Don’t Know to the direct question, they were entered into a sequence of questions using income brackets to determine which income category they were in. Bracketed data were used as a basis of imputation of actual values. Missing and INAPP (respondents who had no income from that source were coded as “0”) were imputed, and the imputed data combined with the actual data.
The components of the composite measure of assets were treated similarly according to the above-mentioned steps. For both the composite household income and the household assets’ measures provided by the HRS, the scores were based on a sum of values across all imputed component variables. This approach to providing imputations on the full data set creates some potentially serious problems because, not only are INAPP respondents given an inputed value, the data include a substantial number of outliers (extreme values) resulting from the imputed data. Roughly 60 percent of the outliers (extreme values) we identified in the income data involved cases with imputed values on one or more of the components used to form the income composites. We return to a discussion of this problem when we consider estimates of measurement reliability below.
Reliability of Measurement
It can be argued that for purposes of assessing the reliability of survey data, longitudinal data provide an optimal design (see Alwin 2007). Indeed, the idea of replication of questions in panel studies as a way of getting at measurement consistency has been present in the literature for decades—the idea of “test–retest correlations” as an estimate of reliability being the principle example of a longitudinal approach. The limitations of the test–retest design are well known, but they can be overcome by incorporating three or more waves of data separated by lengthy periods of time (see Alwin 2007:96-116). The multiple-wave reinterview design discussed in this article goes well beyond the traditional test–retest design (see Moser and Kalton 1972:353-54), and specifically, by employing models that permit change in the underlying true score (using the quasi-Markov simplex approach) allows us to overcome one of the key limitations of the test–test design. 5 And through the use of design strategies with relatively distant reinterview intervals, for example, 2-year intervals, the problem of consistency due to retest effects or memory can be remedied or at least minimized. There are two main advantages of the reinterview design for reliability estimation. First, the estimate of reliability obtained includes all reliable sources of variation in the measure, both common and specific variance. Second, under appropriate circumstances, it is possible to eliminate the confounding of the systematic error component discussed earlier, if systematic components of error are not stable over time. In order to address the question of stable components of error, the panel survey must deal with the problem of memory, because in the panel design, by definition, measurement is repeated. So, while this overcomes one limitation of cross-sectional surveys, it presents problems whether respondents can remember what they say and are motivated to provide consistent responses. If reinterviews are spread over months or years, this can help rule out sources of bias that occur in cross-sectional studies. Given the difficulty of estimating memory functions, estimation of reliability from reinterview designs makes sense only if one can rule out memory as a factor in the covariance of measures over time, and thus, the occasions of measurement must be separated by sufficient periods of time to rule out the operation of memory.
The analysis of panel data in the estimation of reliability must be able to cope with the fact that people change over time, so that models for estimating reliability must take the potential for individual-level change into account (see Coleman 1964, 1968; Wiggins 1973; Goldstein 1995). Given these requirements, techniques have been developed for estimating measurement reliability in panel designs where there are three or more waves, wherein change in the latent variable is incorporated into the model. With this approach, there is no need to rely on multiple indicators within a particular wave or cross-section in order to estimate the measurement reliability, and there is no need to be concerned about the separation of reliable common variance from reliable specific variance. That is, there is no decrement to reliability estimates due to the presence of specific variance in the error term; here, specific variance is contained in the true score. This approach is possible using modern structural equation modeling (SEM) methods for longitudinal data and is discussed further in related literature (see Alwin 1989, 1992, 2007; Alwin and Krosnick 1991; Saris and Andrews 1991).
The idea of the simplex model began with Guttman (1954) who introduced the idea of the simplex structure for a series of ordered tests, which demonstrated a unique pattern of correlations. Following Anderson’s (1960) discussion of stochastic processes in multivariate statistical analysis, Jöreskog (1970) summarized a set of simplex models, the parameters of which could be fit using confirmatory factor analytic methods, and for which tests of goodness of fit could be derived. He made a distinction between “perfect simplex” and “quasi-simplex” models—the former being those in which measurement errors were negligible and the latter being those allowing for substantial errors of measurement. The Markov simplexes he discussed were scale-free in the sense that they could be applied in cases where the units of measurement were arbitrary (see Jöreskog 1970:121-22). In the survey methods literature, the model has come to be known as the “quasi-simplex model” (Saris and Andrews 1991) or simply the “simplex model” (Alwin 1989, 2007).
In the general case of the simplex model for a single variable assessed in a multiwave panel study can be defined as:
Here Y is a (P × 1) vector of observed scores; T is a (P × 1) vector of true scores; E is a (P × 1) vector of measurement errors; Z is a (P × 1) vector of disturbances on the true scores; ΛY is a (P × P) identity matrix; and B is a (P × P) matrix of regression coefficients linking true scores at adjacent timepoints. The general path diagram for the case of P overtime measures is given in Figure 2.

Path diagram of the quasi-Markov simplex model—general case (p > 4).
For the case where P = 4, we can write the matrix form of the model as:
The reduced form of the model is written as:
where B and ΛY are of the form described earlier, Ψ is a (P × P) diagonal matrix of variances of the disturbances on the true scores, and Θ2 is a (P × P) diagonal matrix of measurement error variances. This model for multiple-wave data (where P ≥ 3) can be estimated using any SEM approach.
Heise (1969) developed an application of this model based on three-wave quasi-simplex models within the framework of a model that permits change in the underlying variable being measured (see also Achen 1975). This same approach can be generalized to multiwave panels, as described earlier. The three-wave case is a special case of the above-mentioned model that is just identified, and it is therefore instructive to illustrate the approach produces an estimate of reliability of measurement. Consider the quasi-simplex model for three occasions of measurement of Yg. The measurement model linking the true and observed scores is given in scalar form as follows:
The first set of equations (given earlier) represents a set of measurement assumptions indicating that (1) overtime measures are assumed to be τ-equivalent, except for true score change and (2) that measurement error is random. By assuming that measurement error is random, it is uncorrelated with the true scores and also uncorrelated across waves. To address the issue of taking individual-level change into account, this class of autoregressive or quasi-simplex models specifies two structural equations for a set of P overtime measures of a given variable Yt (where t = 3) as follows:
Recall that these models invoke all the assumptions of the Classical True Score Theory (CTST) models, that is,
This model assumes a lag-1 or Markovian process in which the distribution of the true variables at time t is dependent only on the distribution at time t − 1 and not directly dependent on distributions of the variable at earlier times. If these assumptions do not hold, then this type of simplex model may not be appropriate. In order to estimate such models, it is necessary to make some assumptions regarding the measurement error structures and the nature of the true change processes underlying the measures. All estimation strategies available for such three-wave data require a lag-1 assumption regarding the nature of the true change. This assumption in general seems a reasonable one, but erroneous results can result if it is violated. The various approaches differ in their assumptions about measurement error. One approach assumes equal reliabilities over occasions of measurement (Heise 1969). This is often a realistic and useful assumption, especially when the process is not in dynamic equilibrium, that is, when the observed variances vary with time. Another approach to estimating the parameters of the above-mentioned model is to assume constant measurement error variances rather than constant reliabilities (Wiley and Wiley 1970). Where P = 3 either model is just-identified identified and where P > 3 both models are over-identified with degrees of freedom equal to .5
Wiley and Wiley (1970) show that by invoking the assumption that the measurement error variances are equal over occasions of measurement, the P = 3 model is just-identified and parameter estimates can be defined. They suggest that measurement error variance is “best conceived as a property of the measuring instrument itself and not of the population to which it is administered” (p. 112). Following this reasoning, one might expect that the properties of one’s measuring instrument would be invariant over occasions of measurement and that such an assumption would be appropriate. Following the CTST model for reliability, the reliability for the observed score Yt is the ratio of the observed variance, that is,
To summarize, the three-wave model is just-identified, that is, there are no overidentifying restrictions that allow for the possibility of testing the model against a null model of interest. Where P > 3, the simplex model is over-identified, and a test of the model (with degrees of freedom equal to .5
Modeling Assumptions
While the assumption of random measurement error is a feature of this model, it should be pointed out that the assumption of the equality of error variances across waves is not essential for obtaining an unbiased estimate of reliability in these models. These models provide an unbiased estimate of the reliability of wave-2 measurement, and it is necessary to apply some equality constraints on the error structure of the model to obtain estimates of reliability for waves 1 and 3, if one wants to have more than the wave-2 estimate. In other words, for an estimate of wave-2 reliability, the only assumption required is that measurement error is random—no assumptions are required concerning the equivalence of the measurement error variances across waves.
When estimates of reliability for waves 1 and 3 are also needed, the Wiley and Wiley (1970) approach to identifying the parameters of this model—the assumption of homogeneity of error variances over time—is often used. This appears on the face of it to be a reasonable assumption, but we have found it to be somewhat questionable in practice. As we have repeatedly stressed in prior work (see Alwin 1989, 2007), measurement error variance is a property of both the measuring device and the population to which it is applied. It may therefore be unrealistic to believe that it is invariant over occasions of measurement. If the true variance of a variable changes systematically over time because the population of interest is undergoing change, then the assumption of constant error variance necessitates a systematic change in the reliability of measurement over time, an eventuality that may not be plausible. For example, if the true variance of a variable increases with time, as is the case with many developmental processes, then, by definition, measurement reliability will decline over time. Under the assumption of constant measurement error variance, reliability increases as true score variance increases.
Thus, it can be seen that the Wiley–Wiley assumption requires a situation of what might be called “dynamic equilibrium”—constant true score variance over time—one that may not be plausible in the analysis of developmental processes. Such a state of affairs is one in which the true variances are essentially homogeneous with respect to time. In order to deal with the possibility that true variances may not be constant over time, Heise (1969) proposed a solution to identifying the parameters of this type of quasi-simplex model which avoids this problem. He assumed that the reliability of measurement of variable Y is constant over time. Because of the way reliability is defined, as the ratio of true to observed score variance, Heise’s model amounts to the assumption of a constant ratio of variances over time. This model is frequently considered to be unnecessarily restrictive because it involves a strong set of assumptions compared to the Wiley–Wiley model. However, it is often the case that it provides a more realistic fit to the data (see Alwin and Thornton 1984; Alwin 1989, 1992, 2007; Alwin and Krosnick 1991). 6
Analysis of the HRS Financial Data
We employed the currently available SEM software, specifically Mplus (Muthén and Muthén 1998–2010), to obtain the information necessary to provide reliability estimates using the above-mentioned model. As indicated earlier, we first analyzed the income composite variables created by the HRS study staff for the public use files. The variables employed in this analysis include substantial amounts of imputed data, and one may reasonably question the sense it makes to estimate parameters of models for data that are so heavily laden with chance. One of the major problems with estimating reliability of these composite variables is that they are a complex result of several different stages of measurement, from the response process, to the use of brackets, to imputation. Normally, when we estimate reliability of survey responses, we exclude data generated by a process other than the respondent (see Alwin 2007). We nonetheless employed these data but moved beyond the analysis of the reliability of these composites to a consideration of reliability estimation for reports of income from the various sources. The results reported below suggest that the imputation creates a serious problem of outliers, and this is in part why, in our analysis of self-reports of the separate components, we do not use any imputed data (except where we consider the bracketed reports).
In order to include a case from the HRS database in our analysis of a given income source, there were several requirements it had to satisfy. First, the designated financial respondent must be present at all three waves. Second, the eligibility of the income source must be recorded at all three waves (e.g., for the analysis for wages and salaries, we need to know whether R is working at each of the waves, or for the analysis of the reporting of social security, we need to record whether the respondent is eligible at all three waves) so we can determine whether a given case should be in the sample in a given analysis. Third, within the above-mentioned constraints, the total sample was defined to include cases that have at least one wave of income data for a particular source, regardless of whether actual values ($$$) were reported, whether “income brackets” (BPE) or whether “imputed values” (IMP) developed by HRS staff were included. (Note that there are two types of imputation involved, within-bracket imputation for the BPE estimates and IMP for nonresponse or otherwise missing data. These are both imputed values, but we distinguish between the two for present purposes.)
The total sample can be partitioned into several nonoverlapping subsamples that are combinations of these three types of data across the waves of the study. These combinations of data are shown in Table 5 for the “Respondent wages and salary” variable we used earlier to illustrate the nature and types of income data provided in the HRS data set. (Note that in Table 5, R_WAGE = 0 means that these are cases in which no actual dollar amounts were reported, R_WAGE = 1 means that there was one wave in which actual dollar amounts were reported, etc.) 7 Here there are 1,796 cases (of a total of 5,910) in which the financial respondent provided actual income data in response to the initial question about wages and salary at all three waves. There are another 1,534 cases that provided actual income data at two waves, with the other wave being either missing (designated with a “.” in Table 5) or a bracketed response (designated as “BPE”in Table 5), and another 1,946 that provided actual income data at one of the three waves (i.e., coded 1 on R_WAGE). Finally, the remaining cases (N = 634) provided at least one wave of bracketed responses (what we here call “ball park estimates” or BPE). Cases that were missing at all three waves are excluded from the analysis, that is, a given case had to have either $$$ or BPE data at one or more waves. Also, because we account for type of data, cases were removed if coding was not clear—whether the value was actual ball-parked or imputed—at any wave.
Distribution of Cases by Number of Waves in Which Actual Dollar Amounts Were Reported.
Note. $$$ = respondent reports an amount; BPE = a “ball park estimate” is provided using brackets. “.” = missing due to nonresponse. R_WAGE the number of waves in which $$$ available.
Our analysis of the HRS data was done using two different approaches. First, method A included the total sample cases (i.e., must have at least one wave of data on $$$ or BPE) and considered the IMP data to be missing (and handled by FIML in our analysis). There is also a listwise version of this sample, which requires nonmissing data at all three waves. For the example given in Table 5, there are N = 2,396 cases with either actual value ($$$) or bracketed (BPE) data at each of the three waves—this excludes data that are missing at least once across the three waves. Second, method B includes only those cases with actual values ($$$ cases) and considers all other cases—missing if imputed or BPE at a given wave—as missing data. This second approach clearly produces a set of cases that is a subset of method A—there is also a listwise version that includes only the 1,796 cases having actual income values (non-bracketed) at all three waves.
In Table 6, we show the percentage of cases at each wave that involved imputation based on the bracketed data, and the percentage of cases in which data were missing and hence were imputed to come up with an income values for the 21 income and assets self-reports we analyze. (Note that the N given in this table for each triad corresponds to the number of cases for the total sample employed by method A—see Tables 8 and 9.) In general, the proportion of missing (IMP) cases is not large, and the bulk of the imputation involves the bracketed data (BPE). For several of the assets measures (e.g., value of business or farm, or value of stocks), the extent of cases imputed from bracketed data involves roughly one-quarter of the cases, with another 10 percent or so imputed from missing data. In other words, the largest subset of the data we analyze here employs actual income values for which respondents gave an income figure in response to the initial questions.
Percentage of Cases Employing Brackets and Imputation: Selected Measures, Health and Retirement Study (HRS) 1998 to 2002.
Note. BPE = ball park estimate; IMP = imputed values; Method A—Includes both actual and bracketed responses; imputed values are handled with full-information maximum likelihood (FIML). Method B—Bracketed responses considered missing and along with imputed values are handled with FIML.
aThe data set in this case has been trimmed of outliers.
For several of the HRS income variables, we experienced some severe effects on the parameters of the models when there was a substantial number of “high end” outliers. In several cases, we had to delete up to 5 percent of the data at the top of the income distribution for a given source/type of income in order to reduce the effects of these extreme values on the variances of the distributions. In cases where this was done, we were able to achieve some reasonable stability in the variance properties of the data and estimates of reliability. One of the things we learned from analyzing the HRS data is that one must be very wary of the imputed data—roughly 60 percent of the cases we considered outliers were imputed values (IMP rather than BPE data). In the discussion below, we indicate the cases in which we “trimmed” off “high end” outliers.
Results
This section of the article begins with an examination of the reliability of two composite variables created by the HRS study staff for the public use data files—household income and assets. Here we present the key findings from our examination of the reliability of these composites and their components. Here we also document some of the problems that occur in the estimation of parameters of standard reliability models due to the HRS use of imputation. In short, we found some serious outliers that affected the results, about 60 percent of them due to imputation. Following a presentation of these findings, the article provides a detailed examination of the source-specific estimates of measurement reliability for several of the major components that are contained in these composite variables. Due to our skepticism about the veracity of the imputed values provided in the HRS data sets, these later analyses ignore any imputed data other than that data that involved bracket-related imputation.
Reliability of Composite Variables
The results in Table 7 present two estimates of the reliability of measurement for the two HRS composite variables: household income and household assets. This table presents the Heise estimates (i.e., the wave-2 reliability estimates) in all cases. Results indicate significantly lower level of reporting reliability of the composite variables in the HRS relative to levels found for “stylized” income approaches used in other surveys, which obtain a summary estimate of household income for the past calendar year from a single question (see Alwin 2007:302-4). What may seem odd, however, is that the HRS composite measures are no more reliable than these stylized measures and in some cases they are substantially lower. This could be the basis for a concern about the exhaustive, time-consuming approach to measuring income in the HRS, employing hundreds of questions and a substantial amount of survey interview time, given that it does not seem to yield a more reliable estimate of household income. As is the case in other studies, these data show that the measurement reliability for measures of assets are slightly lower than reports of household income, but overall, both scores are modestly reliable—.73 to .79 (as shown in Table 7).
Reliability Estimates for Health and Retirement Study (HRS) Income Variables from Quasi-Markov Simplex Models—Heise Estimates.
aMethod of handling missing data in the total sample is by full-information maximum likelihood (FIML). bNote the data sets have been trimmed for outliers.
Given the extent of effort and the time taken up in the HRS questionnaire by income questions, why is the ultimate composite variable not more reliable? This seems to be the obvious question that arises from these results. This may clearly be the case where the gains in validity achieved by the HRS in the measurement of income far outpace the gains in reliability. The HRS data, particularly in its use of bracketing and its use of domain specific reports of income, are considered much more valid (see Juster and Smith 1997; Hurd, Juster, and Smith 2003), but these approaches may not produce salutary effects on the reliability of the data. We turn to the further examination of some of the reasons for this in the next section of the article where we separately analyze each of several of the components of the HRS composite income and assets variables.
Reliability of Component Variables
In Tables 8 and 9, we report the results of our analysis of 21 component variables among the HRS measures of household income and assets. We include all measures for which three waves of data were available from the financial respondent (see Tables 2 and 3), where the question clearly referred to the same object of measurement over time, and there were sufficient cases to analyze (at least 100 cases). As noted earlier, the analysis of the income data was done separately by source (see Tables 2 and 3), although there were several variables that had to be deleted from the analysis because of ambiguity of the target of the question across waves. As noted in the foregoing, following the assumptions of the classical true score model, the reliability for the observed score Yt is the ratio of the observed variance, that is,
Estimates of Reliability from Quasi-Markov Simplex Models for Income and Asset Reports from the Health and Retirement Study, 1998 to 2002.
Note. FIML = full-information maximum likelihood. Method A—Includes both actual and bracketed responses; imputed values are handled by FMIL. Method B—Bracketed responses considered missing and along with imputed values are handled with FIML.
aEstimates for this triad were slightly above 1.0, likely due to sampling error. For the purposes of these analyses, we have fixed the reliability estimates to 1.0 in this case in order to keep them within the theoretical range.
Estimates of Reliability from Quasi-Markov Simplex Models for Income and Asset Reports from the Health and Retirement Study, 1998 to 2002.
Note. Method A—Includes both actual and bracketed responses; imputed values are handled by FMIL. Method B—Bracketed responses considered missing and along with imputed values are handled with FIML.
aEstimates for this triad were slightly above 1.0, likely due to sampling error. For the purposes of these analyses, we have fixed the reliability estimates to 1.0 in this case in order to keep them within the theoretical range.
Our analysis of the HRS data, such as that shown in Tables 8 and 9, was done using the two different approaches discussed earlier—method A, which included the total sample cases (i.e., must have at least one wave of data on $$$ or BPE) and the “imputed” data are recoded to be missing and method B, which includes only those cases with actual values ($$$ cases) and considers all other cases—BPE and imputed data at a given wave—as missing data. As we already mentioned, there are also listwise versions of both of these samples, and we present all four types of analyses in Tables 8 and 9.
The results in Tables 8 and 9 indicate on average relatively high estimates of reliability of measurement of components of income (generally above .8 on average). There is obviously some variation in the mean levels of reliability. These results indicate that levels of reliability vary by type of income source. For example, our results indicate that reports of monthly benefit levels from sources such as Social Security or VA achieve near-perfect levels of reliability (i.e., reliability estimates of 1.0), whereas somewhat less regular sources of household income are measured less reliably (closer to reliability estimates of .6 or .7). In general, we find the reliability of many of the components of the household composites of income and assets to be substantially higher than the reliability of the composites, again suggesting that the income and asset composites may introduce substantial amounts of random error, thereby reducing the reliability of the measures.
Factors Affecting Reliability of Components
The results in Tables 8 and 9 are presented according to four different combinations of approaches, essentially resulting from the cross-classification of method A versus method B and nature of the sample/approach to missing data: total sample (FIML) or the listwise sample. The systematic analysis of these different approaches allows us to examine whether these different approaches introduce differences in average levels of reliability. Here we employ the Heise estimates of reliability in an analysis of the factors affecting the reliability of the components (note that in this case, these are the wave-g [wave-2] estimates given in Tables 8 and 9).
The numbers in Table 10 show that method B is slightly superior to method A, that is, those reliability estimates that were based on actual values, using FIML for all other responses, are slightly greater in magnitude than those based on data that include the imputations within brackets. This is a remarkable finding in that it suggests that from the point of view of high reliability, accepting initial Don’t Knows or refusals as missing data and using FMIL tools to deal with their “missingness” is a better strategy than the bracketing approach. Recall that the bracketing approach routes all Don’t Know and refusals through a series of brackets, selecting a category and then using imputation techniques for producing values within the brackets. This suggests that the bracketing approach basically adds to error variance and hence unreliability. The use of FIML techniques appears to be slightly better, at least with respect to the level of reliability.
Average Estimates of Reliability from Quasi-Simplex Models for Income Reports in the Health and Retirement Study (HRS), 1998 to 2002.
The implications of these numbers were examined using regression models that adjust for the clustering of measures across the cross-classification scheme in Table 10. Each triad has four values, and the lack of independence in these four estimates for each triad is taken into account in our regression models. Table 11 displays the results of a regression analysis of factors affecting levels of reporting reliability for the HRS income variables. The number of cases in this regression corresponds to the number of cases for which we separate estimates of reliability (i.e., 84 estimates). These results indicate that inclusion of the bracketed data in the analysis of reliability has a negative effect of roughly .036 on reliability, and the use of listwise versus FIML approaches to handing missing data has virtually the same effect on estimates of reliability. The coefficient for the FIML versus listwise variable is nonsignificant in this particular set of measures, and there is no evidence for any statistical interaction between the two sets of factors. Finally, the results suggest that these factors reflect a small, but significant, contribution to variation in reliability estimates.
Regression Analysis of Factors Affecting Reliability Estimates for Health and Retirement Study (HRS) Income Variables.
Conclusion
The limitations of one-shot cross-sectional designs for reliability estimation of survey questions have been known for many years (see Coleman 1964, 1968; Goldstein 1995). One major disadvantage of the cross-sectional approach derives from the inability to design replicate measures that can be used within a single survey that meets the requirements of the statistical models for estimating reliability. The assumption of the independence of errors is problematic—specifically, in order to interpret the parameters of such models in terms of reliability of measurement, it is essential to be able to rule out the operation of memory, or other forms of consistency motivation, in the organization of responses to multiple measures of the same construct (see Alwin 2007).
Here we have employed a longitudinal approach to assessing reliability of survey measurement which overcomes the limitations of the cross-sectional designs for reliability estimation. And although the traditional test–retest designs also have several limitations, the multiple-wave reinterview design used here goes well beyond the traditional test–retest design and the limitations of the test–retest design can be overcome by incorporating three or more waves of data separated by lengthy periods of time (see Moser and Kalton 1972:353-54; Alwin 2007:96-110). Specifically, by using models that permit change in the underlying true score, the quasi-Markov simplex approach allows us to overcome one of the key limitations of the test–test design (see Heise 1969; Wiley and Wiley 1970; Alwin 2007). And through the use of design strategies with relatively distant reinterview intervals, the problem of consistency due to retest effects or memory could be remedied or at least minimized. Finally, given the fact that reliability estimation is based on the idea of replicate measures, the assumption of the independence of errors is substantially less problematic when such longitudinal designs are used.
When we apply these models for reliability estimation to HRS income data, we find that levels of reliability for the two HRS composite variables—household income and household assets—are significantly lower than levels found for “stylized” income approaches. This suggests that the major gains in measurement quality in the HRS have more to do with validity than they do with reliability. Of course, another reason for asking about specific sources of income and types of assets is that these are important in their own right as well as for estimating total household income and wealth. Our analysis of the specific sources or components that make up the HRS composite measures indicate that levels of reporting reliability vary by type of income source. For example, our results indicate that reports of monthly benefit levels from sources such as Social Security or VA achieve near-perfect levels of reliability (i.e., reliability estimates of 1.0), whereas somewhat less regular sources of household income are measured less reliably (closer to reliability estimates of .8 or .85). In general, we find the reliability of many of the components of the household composites of income and assets to be substantially higher than the reliability of the composites. It deserves mention that by imputing so many values in these income sources, substantial amounts of random error are produced, reducing the reliability of the data. Thus, one major conclusion resulting from this research, which may be beneficial to users of the HRS surveys, concerns the use of imputation in the handling of missing data. We found that imputation of values for top-end open income brackets can produce a substantial number of outliers that affect sample estimates of relationships and levels of reliability.
Finally, recall that one of the features of the HRS approach to income and asset measurement involves the use of bracketing (or coarse categorization) of responses to questions about income and assets when respondents refuse or respond that they “don’t know.” For the HRS studies that was explicitly rationalized and undertaken on the basis that, it would reduce the missing data problem in household wealth surveys, which showed improved estimates of income and assets when imputed values based on the bracketed responses were included. However, this approach clearly introduces measurement error (within brackets), and our results indicate that assessment of levels of reliability with and without the bracketed data are substantially different. To our knowledge, while these imputation approaches are becoming more widely used, there has been little or no research undertaken to investigate whether these “unfolding” or “bracketing” approaches introduce more error, or whether other approaches to missing data, for example, FIML, may be superior (Wothke 2000). Although the aim is to obtain a response that is within a relatively narrow range, depending upon the nature of the bracketing categories, the within-bracket imputations may introduce measurement error.
We estimate the reliability of data that include bracketed response to be some .035 lower on average than estimates based on nonbracketed data—levels of about .775 versus .810. These bracketing approaches are becoming more widely used in many types of surveys in many different cultural contexts, and users of these surveys may not understand that the “within-bracket” imputation involved actually introduces measurement error. Our results suggest that one might be better off accepting the refusals and Don’t Know responses as missing data, using techniques such as FIML in the analysis, rather than employing the bracketing approach to dealing with such types of nonresponse. Of course, we understand that from the point of view of validity, these measurement approaches may be required, but it is important to understand that while these approaches may be shown to raise reporting levels, they may also contribute to greater levels of measurement error variance.
Footnotes
Acknowledgment
The authors acknowledge the assistance of Halimah Hassan, Mona Ostrawski, and Alyson Otto in conducting this research. Richard Campbell and two anonymous reviewers provided valuable comments on an earlier draft.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research reported here was supported in part by a grant “Aging and the Reliability of Measurement” awarded by the Behavioral and Social Science division of the National Institute on Aging ((R01-AG020673).
