Abstract
The Social Security Administration funded the development of a screener that could accurately classify household members into one of four disability groups (likely disabled, possibly disabled, not disabled, or current social security disability beneficiaries) for purposes of a larger national study. The authors developed a questionnaire, tested five screener algorithms on data from a pilot study, and assessed the performance of each screener in identifying individuals who are likely disabled, possibly disabled, and not disabled. An algorithm using the item response theory methodology of Rasch modeling offered the greatest improvement over the original screener algorithm and appeared to be quite superior to the other alternatives.
The Social Security Disability Insurance (DI) program is the largest social insurance program serving the working-age population with disabilities, paying more than US$108 billion in benefits in 2009 to approximately 9 million disabled workers, widow(er)s, and adult children. The Supplemental Security Income (SSI) is the largest means-tested program serving people with disabilities, paying more than US$40 billion in benefits in 2009 to over 8 million needy aged, blind, and disabled people. Changes in the labor market or changes to the DI or SSI programs often lead to changes in the size and composition of the population entering the program, and planning for these changes is critical to properly manage the program. To plan for such changes, the Social Security Administration (SSA) and disability policymakers need reliable data on the working-age nonbeneficiary population with a health-related impairment that is severe enough to qualify for the program.
One of the major challenges that SSA faces in obtaining data on this population is that it uses a unique definition of disability. The Social Security Act (1935) defines disability as the inability to engage in any substantial gainful activity by reason of any medically determinable physical or mental impairment that (a) can be expected to result in death or (b) has lasted or can be expected to last for a continuous period of not less than 12 months. Surveys using a less specific definition of disability, therefore, cannot correctly estimate the number of potential Social Security disability beneficiaries in the United States.
SSA made some attempts to gather information relevant to its disability determination process by fielding its own disability surveys in the 1960s and 1970s. More recently, SSA researchers developed a model for determining SSA disability by matching Survey of Income and Program Participation (SIPP) to SSA disability determination records (Hu, Lahiri, Vaughan, & Wixon, 2001). They then used this model to estimate that 2.9% of the general population who are not SSA disability beneficiaries qualified medically for SSA disability benefits (Dwyer, Hu, Vaughan, & Wixon, 2002/2003).
In an effort to identify a method to collect better data on potential Social Security disability beneficiaries, SSA awarded a contract to design and pilot a study that provided the opportunity to collect original data on the size and characteristics of the disabled, nonbeneficiary population of the United States. In this article, we compare the performance of five screener algorithms used to identify persons who medically met the Social Security definition of disability. The algorithms were tested on data collected in a pilot study conducted in four U.S. counties (Granville County, Georgia; Carroll County, Maryland; Cobb County, North Carolina; Washington County, Pennsylvania) in 2002. The pilot included a random sample of 3,900 households, representing 7,465 household members aged 18 to 69.
Method
The pilot study used a four-step process to establish whether household members were technically disabled according to the Social Security definition of disability. 1 Each step moved the study along from a population-based household sample to subsamples of individual household members who were assessed more thoroughly to determine their eligibility for Social Security disability benefits.
This article describes the administration and validation of a short set of questions we refer to as a “Household Screener.” We used the Household Screener to classify members into one of four disability groups—not likely disabled, possibly disabled, likely disabled, and current beneficiary. The screener was designed to be brief, administered over the telephone, and answered by a household respondent aged 18 years or older who could report on all household members.
The Household Screener was designed to emphasize sensitivity over specificity. Thus, the aim in developing the Household Screener was to include all the items needed to maximize the likelihood of identifying persons with moderate or severe physical or mental impairment. One aspect of maximizing sensitivity was to ensure that the screener identified persons that a household reporter might have difficulty identifying through an interview, such as persons with certain mental illnesses. The main concern in developing the Household Screener was to minimize the risk of misclassifying someone “likely disabled” as “not likely disabled.” Minimizing this type of misclassification was important because persons classified as not likely disabled were subsampled at a low rate for subsequent stages of the study and thus carried a large sampling weight to the data analysis. The large weight given to subsampled persons classified as not likely disabled occurs because each of these persons serves to represent a large number of persons in the general population. If some people are classified into the not likely disabled study group by the Household Screener algorithm, and are found to be disabled later in the study, their large sampling weights would seriously degrade the precision of the survey estimates.
Disability is a complex and multifaceted concept. In recognition of this fact, multiple areas of functioning and task performance were covered in the Household Screener, in addition to specific health conditions and assistive devices that are indicative of severe impairment. Allocation of individuals to the four disability groups was accomplished using algorithms that took multiple indicators into account and that produced scores that were used to classify participants. Both the approach to defining presence and severity of disability and the algorithm used to implement the definition were based on Nagi’s model of disability (Nagi, 1976; 1991) and the Institute of Medicine’s conceptual framework (Pope & Tarlov, 1991).
Sample
The final sample of individuals with complete information on the Household Screener items included 7,465 individuals (6,998 nonbeneficiaries and 467 beneficiaries). We examined the reliability of the responses to the Household Screener items by comparing them to the responses from individuals in the follow-up interview.
From within that 7,465, we identified 792 individuals to use in three validation groups for our analysis. The first includes 467 Social Security Disability Insurance (SSDI) beneficiaries, the second includes 70 individuals who were identified in our pilot as allowed nonbeneficiaries (i.e., they were determined to meet the SSA disability medical criteria but were not collecting disability benefits), and the third includes 255 individuals identified in our pilot as nonallowed nonbeneficiaries (i.e., they were determined to not meet the SSA disability medical criteria and were not collecting benefits). The last two validation groups went through medical examinations and had disability folders constructed to determine whether they would be allowed onto the SSDI program based on their medical condition. These “known” validation groups provided several opportunities to evaluate the performance of the Household Screener.
Procedures
We constructed the original household questionnaire to perform three functions. First, it was designed to collect information that was required for administering the interview (e.g., identifying all household members of appropriate age) and for sampling purposes (e.g., number of telephone lines for households sampled by telephone). Second, information was gathered to describe the screened population and to compare characteristics of this population with those of other national population-based surveys. Third, information was obtained to classify individuals into one of the four disability groups. The majority of the content of the questionnaire was focused on questions that would assess the likelihood that respondents have a disability. We refer to these sections throughout this article as the Household Screener.
The Household Screener was developed through consultation with subject matter experts and from analyses of data from the National Household Interview Survey’s Disability Supplement (NHIS-D; National Center for Health Statistics, 1994a, 1994b). We developed questions used in the Household Screener either specifically for the study or by drawing from previous national surveys and extant instruments.
We used previously developed items whenever possible for two main reasons. First, these items provided a basis for comparing our sample with national and/or normative data. Second, existing items had already been tested for reliability and validity. The relevant item sets concerned functional limitations, activities of daily living (ADLs) and instrumental activities of daily living (IADLs), health conditions, and perceptions of disability. Two additional sets of existing questions included in the screener were of particular interest to SSA: the six-item set developed for the Census 2000 and the 12-item version of the Short Form-36 (the SF-12), a widely used measure of physical and mental health status developed for the Medical Outcomes Study (Ware, Kosinski, & Keller, 1996). Each of these sets of items had the potential of providing very brief alternative algorithms on which to base the disability classification.
The Household Screener algorithm used the questionnaire data in a priority-based staged approach to classify household members into one of the four disability groups. At Stage 1, individuals were identified as either SSDI or SSI beneficiaries. All those not classified as current beneficiaries proceeded to Stage 2.
At Stage 2, a subset of the remaining household members were classified as likely disabled based on disease/impairment indicators, task limitations and limitations in social role functioning, and functioning of major body systems and senses, necessary to meet SSA’s listing of impairments. We classified indicators of disease/impairment as present or absent, and classified persons as likely disabled if a reported disease/impairment lasted or was expected to last 12 months or more. 2 SSA’s definition of disability requires that an impairment last 12 months or more. Within the task limitation category, ADL functioning was measured by adding individual scores (on a scale ranging from 0 = no difficulty performing to 4 = needs help performing) across six tasks—bathing, dressing, eating, transferring, toileting, and getting around inside the home. Individuals with scores above 12 (out of a total possible of 24) were considered as likely disabled. A similar additive score (this time using a different scale: 0 = no difficulty performing the task to 3 = needs help or supervision to perform the task) was developed for IADL functioning using six tasks. The tasks included preparing meals, shopping for personal items, managing money, doing light housework, doing heavy housework, and managing medication. Scores of this type were also developed for each of the functional limitation indicators in the algorithm. For example, a lower extremity score was constructed using scores from four items (standing, climbing stairs, sitting, and using assistance/equipment to sit). The literature on measurement of functioning, advice from technical experts, and empirical analysis of items from the NHIS-D were used as the basis for developing these indicators and cut points. Those household members not yet classified moved on to Stage 3.
At Stage 3, a subset of the remaining household members was classified as possibly disabled using the same indicators and functional measures as in Stage 2, but with less stringent thresholds. The threshold for possibly disabled was set at a less severe level of impairment, based on an analysis of NHIS-D data.
Household members still not classified after Stages 1, 2, and 3 were allocated to the not likely disabled study group in Stage 4.
We constructed the Household Screener algorithm prior to the collection of the pilot study data. The study plans called for an extensive analysis of those data to see whether an improved algorithm could be developed. With input from a technical advisory panel consisting of disability, Social Security, and survey development experts, four alternative algorithms for allocating household members to the likely disabled, possibly disabled, and not disabled study groups were developed.
Measures
One alternate approach to the Household Screener algorithm involved a revision of the original algorithm in light of the pilot study experience. A second and third approach used two alternate algorithms embedded in the questionnaire, the Census 2000 disability items and the SF-12, to screen for disability. Finally, a fourth approach used Rasch modeling (Wright & Stone, 1979) to construct a single measure of disability based on a special subset of data produced from the pilot study. An assessment of the performance of each alternative was conducted by applying each new algorithm to the three validation groups discussed previously (beneficiaries, allowed nonbeneficiaries, and nonallowed nonbeneficiaries) and comparing the results with the original algorithm.
Using a revised pilot study Household Screener algorithm
Analyses of the pilot study data resulted in two changes to the original Household Screener algorithm. One change involved moving the cut points in classifying persons as either possibly disabled or likely disabled. The second change concerned revisions to the definitions of the various components of the algorithm. These changes were made with two objectives in mind. The first objective was to retain sensitivity to persons who are likely disabled while increasing sensitivity to those who are possibly disabled. The second objective was to decrease the likelihood of classifying possibly disabled household members as likely disabled. We present a summary of these changes in Tables 1 and 2. As noted in the table, no changes were made to the revised Household Screener algorithm with respect to the use of selected assistive devices or with problems in communicating, seeing, or breathing.
Components in the Original Household Screener Algorithm and Their Status in the Revised Algorithm: Likely Disabled.
Note: ADL = activities of daily living; IADL = instrumental activities of daily living.
Components in the Original Household Screener Algorithm and Their Status in the Revised Algorithm: Possibly Disabled.
Note: ADL = activities of daily living; IADL = instrumental activities of daily living.
We classified persons receiving help or using special equipment in three or more ADLs or three or more IADLs as likely disabled in the original and the revised algorithms. However, persons receiving help in only one or two of either ADLs or IADLs were reclassified in the revised algorithm from likely disabled to possibly disabled. Among persons reporting a lot of difficulty, but no assistance, we considered only those with a lot of difficulty in four or five ADLs or more than one IADL possibly disabled in the revised algorithm. Persons reporting only some difficulty, regardless of the number of ADLs or IADLs, were classified as not likely disabled in the revised algorithm.
In the original algorithm, we assessed lower extremity difficulty by verbal reports about climbing, standing, and sitting. We added walking across a small room (mobility) to the revised algorithm. We classified individuals likely disabled if it was reported that they had a lot of difficulty or were unable to walk across a small room, or reported they had a lot of difficulty or were unable to perform any two of the other three items. We classified individuals as possibly disabled if they reported that they had a lot of difficulty or were unable to climb, stand, and sit, and had some difficulty in additional items (e.g., walking across a small room). The items used to assess upper extremity limitations (reaching, grasping, lifting, and using hands together) did not change. However, reports of a lot of difficulty or inability were required in at least two of these areas to be classified likely disabled on these grounds, and for at least one area to be classified possibly disabled. We classified persons with only some difficulty on all items possibly disabled under the original algorithm but not likely disabled under the revised algorithm.
The major change in definition of an algorithm component was made in assessing disability resulting from the presence of mental disorders. In the original algorithm, we based the classification of likely disabled resulting from a mental disorder on items from the NHIS-D. We classified as likely disabled individuals who reported having any condition on a list of serious mental disorders or another emotional condition that seriously interfered with daily activity. We classified as possibly disabled persons who reported that any one of several mental symptoms seriously interfered with daily activity, or that they used mental health-related services or medications and “accomplished less” or were “less careful” than usual. These items were examined in both the nonbeneficiary Household Screener sample and the sample of beneficiaries (salted cases in addition to self-reported), as shown in Table 3. Among beneficiaries endorsing any mental health item, the highest percentage of individuals (45.1%) reported a diagnosis and the occurrence of symptoms that interfered with activities. By contrast, the most commonly reported items among nonbeneficiaries endorsing any mental health item were medications or SF-12 items (45.1%)—without any accompanying report that mental health symptoms seriously interfered with daily activity.
Distribution of Type of Mental/Emotional Items Endorsed Among Persons Endorsing at Least One Mental Health Item.
Note: SF-12 = Short Form-12.
Consists of salted beneficiaries and others reported as beneficiaries in the Household Screener.
Medications for ongoing emotional problems and SF-12 items may or may not also be endorsed.
Two SF-12 items: (a) accomplishes less because of emotional problems and (b) is not careful in activities because of emotional problems.
Based on the data in Table 3, we decided to modify the criterion for classifying persons who endorse a mental item as likely disabled to include an endorsement of interference in daily activities. Further support for this change is that the use of criteria to classify a person as likely disabled based on functional outcomes of mental disorders (e.g., symptoms that interfere with activities), rather than diagnoses alone, is more consistent with the approach used in classifying persons with physical disorders. Thus, the proposed revision to the algorithm takes this approach to classify persons as likely disabled. Because some persons in the beneficiary group also reported (a) diagnoses only, (b) symptoms that interfered with activities but absent diagnoses, and (c) medications for ongoing emotional problems, we decided that these items would be used in the revised algorithm to classify nonbeneficiary household members as possibly disabled.
In the original algorithm, problems with interpersonal interaction were treated as a separate indicator for classifying persons likely disabled or possibly disabled. In the revised algorithm, we combined this indicator with the indicator for mental disorders as one of the symptoms evaluated in terms of interference with activities.
We added an item on severe hearing impairment to the revised algorithm. We made the addition because of concerns that the original algorithm could miss a person with profound deafness. These persons would meet criteria for SSA disability benefits, but they may not report any interference with their activities.
The original and revised algorithms classify persons into the four study groups (current beneficiary, likely disabled, possibly disabled, and not likely disabled) by assessing first beneficiary status and then disability across the indicators described in the previous section. Because we did not change the criteria for determining disability beneficiary status in the revised algorithm, the proportion of persons classified as beneficiaries is unaltered. Table 4 presents the differences in the proportions of nonbeneficiaries classified into the disability categories.
Classifications of the Pilot Study Household Screener Sample According to the Original and Revised Screener Algorithms (N = 7,465).
Note: SSA = Social Security Administration.
Criteria are the same for original and revised algorithm.
We account for the decrease from 7.0% to 5.4% in the percentage of persons classified likely disabled with the revised algorithm mainly by the changes in the mental health criteria used in the two algorithms. Under the revised algorithm, we classify persons with mental health disorders likely disabled based on symptoms or diagnoses that interfere with activities, rather than on diagnoses alone.
The increase in the percentage of possibly disabled household members with the revised algorithm is a combination of two counteracting effects. On one hand, changes to criteria for ADL limitations, IADL limitations, lower extremity limitations, and upper extremity limitations reduced the percentage of persons falling into the possibly disabled group according to these measures. We classify many of the individuals with mild disability in these areas (reporting some difficulty in one or more items) as not disabled in the revised algorithm. On the other hand, changes to mental health criteria substantially increase the percentage of persons qualifying as possibly disabled on this indicator. The net effect is an increase in the percentage of persons classified as possibly disabled from 2.9% to 7.4%. Because of the algorithm changes, the overall percentage of persons classified as not disabled nonbeneficiaries decreases in the revised algorithm by 3 percentage points, from 84.9% to 81.9%.
Using the Census items
The six disability-related items presented in Table 5 were included in the long form of Census 2000. These items provide an indication of the prevalence of disability in the nation. The first two questions (1a and 1b) attempt to identify persons with sensory impairments and physical impairments that limit activity. Question 2a intends to identify persons with mental impairments. Question 2b intends to identify persons with ADL limitations, and Question 2c intends to identify persons with IADL limitations. Question 2d addresses the ability to perform the social role function of working. Andresen, Fitch, McLendon, and Meyers (2000) argue that these items do not provide valid measures of disability when compared with traditional survey questions used to measure disability. However, for the current purpose, the aim is not to measure disability per se but to distinguish among persons who are likely disabled, possibly disabled, and not likely disabled. We examined them here for this purpose.
Disability Items Contained in the Census 2000.
Note: To make the Household Screener questionnaire more efficient for screening purposes, the Census items were modified to ask about the entire household at once (e.g., Does anyone in the household have . . . ) rather than about each person individually. If an affirmative answer was given, the person in the household with the given problem was specified in a follow-up question.
Inability to work (defined as “inability to engage in any substantial gainful activity”) is a requirement in the statutory definition of eligibility for being a disability beneficiary. Thus, current beneficiaries may be expected to endorse the item about having difficulty working at a job or business. However, the question remains whether this item will work equally well among nonbeneficiaries who are disabled enough to meet the medical eligibility criteria for disability, but who still work—precisely one group of persons that SSA would like to know about.
With only six dichotomous items, it was difficult to develop cut points for the three disability groups. We decided that persons who endorsed any one of the impairment-related items (1a, 1b, 2a, 2b, or 2c) should be classified as possibly disabled. In addition, people endorsing only the difficulty working at a job or business item were classified as possibly disabled, because they may have an impairment that is not covered under 1a, 1b, 2a, 2b, or 2c. We classified persons endorsing the difficulty working at a job or business item in addition to one of the other items as likely disabled, because ability to work is one of the criteria for receiving Social Security disability benefits. We classified all other persons as not likely disabled.
Using the SF-12
The SF-36 was designed for use in clinical research, health-policy evaluations, and general population surveys (Ware et al., 1996). The SF-12 is an empirically derived subset of 12 items from the 36-item SF-36, derived with the purpose of reducing respondent burden while maintaining acceptable precision. The trade-off between using the SF-12 or the SF-36 involves a choice between respondent burden and precision of scores, with the SF-12 requiring 2 to 5 min to administer on average and the SF-36 requiring 5 to 10 min on average. 3 We selected the SF-12 to be a part of the Household Screener because we considered its decreased respondent burden and loss in reliability acceptable for screener purposes.
The SF-36 and the SF-12 have been used as measures of physical and mental health status and quality of life. It remains to be seen whether these measures could be used to distinguish among persons who are likely disabled, possibly disabled, and not likely disabled. Thus, to examine the utility of retaining these items for future studies in place of the Household Screener algorithm, we tallied the 12 items according to the directions in the scoring manual (Ware, Kosinski, & Keller, 1998) to calculate physical and mental component summary scores. Scores were calculated for all 7,465 individuals for which complete Household Screener data existed. Given the SF-12 physical component and mental component summary scores, the next step was to use these scores in combination to classify these individuals. Instead of determining these groups by inspection or heuristically, a binary recursive partitioning algorithm (Ciampi, Chang, Hogg, & McKinney, 1987; Clark & Pregibon, 1992) was used (implemented in S-PLUS [S-PLUS 6 for Windows, 2001]). To apply this algorithm, a data set of the SF-12 summary scores for the entire sample of 351 beneficiaries (for whom an SF-12 score was available) and a random sample of 488 nonbeneficiary nonfolder cases was constructed. The idea was to partition the bivariate continuum of SF-12 scores in this data set into homogeneous regions where the probability (density) function of the response variable (beneficiary status) at a given score is independent of the other SF-12 scores. The SF-12 scores are partitioned into two mutually exclusive sets (left- and right regions), so that the probability functions over these two regions have the largest difference (as measured by the log likelihood deviance) when compared with any other partitions of SF-12 scores. Thus, the cut points are determined by identifying probability functions that yield the largest deviance in terms of the log likelihood function for each partition. Further partitioning of the left (and right) region continues in a similar vein, yielding additional cut points as desired. For the current application, the objective was to form three distinct but homogeneous partitions that we could use to classify individuals as not disabled, possibly disabled, and likely disabled.
We conducted a binary recursive partitioning analysis. We classified individuals with a physical component score greater than 43.5 and a mental component score greater than 45.5 as not likely disabled. This category had a total sample of 472: 55 beneficiaries and 417 nonbeneficiaries. We classified individuals with a physical component score below 30.5 as likely disabled irrespective of their mental component score. This category had a total sample of 181: 168 beneficiaries and 13 nonbeneficiaries. We classified all other individuals—that is, individuals with a physical component score equal to or greater than 30.5 and a mental component score less than or equal 45.5, and individuals with a physical component score between 30.5 and 43.5, irrespective of their mental component score—as possibly disabled. This category had a total sample of 186: 128 beneficiaries and 58 nonbeneficiaries. This classification does not, of course, distinguish perfectly between beneficiaries and nonbeneficiaries. In particular, 55 of the 351 (16%) beneficiaries in this data set are classified as not disabled, and the rate of misclassification is likely to be higher in a different data set. This type of misclassification is the most serious one for the screener.
We also used data from the SSA’s Accelerated Benefits (AB) Demonstration project to provide a comparison group outside of the study. The AB project was designed to test the impact of providing health insurance to newly entitled SSDI beneficiaries during the Medicare waiting period. The scores included in this table come solely from the control group (SSDI beneficiaries aged 18–54) in the AB project, to avoid any treatment effects. We used the SF-36, which uses the same eight scales as the SF-12 to compute scores, to compute the physical and mental component summary scores for the AB sample. Of the 612 in the AB sample, 29 were classified not likely disabled, 269 were classified probably disabled, and 314 were classified likely disabled.
Using a Rasch-constructed measure of disability
In a fourth attempt to develop a strategy for classifying nonbeneficiary household members, all of the content-relevant questionnaire items from the Household Screener were subjected to a Rasch analysis (Wright & Stone, 1979). Under the Rasch model, the response to any given item is a function of the respondent’s position on the latent trait plus certain parameter(s) associated with that item (survey question). From responses to the Household Screener items, Rasch modeling estimates the subject (respondent) and item (question) parameters using individual response patterns. In essence, the Rasch approach used the full set of Household Screener items (including the SF-12 and Census items) to determine whether there was any subset that could provide a meaningful measure of disability.
To address measure development and measure validation, we used a carefully selected subset of the pilot study data to develop the Rasch measure. We decided that, where possible, we would not use the sets of data used for validation in the development of the Rasch measure of disability to avoid tainting the subsequent validation exercise. The full screener data set for the pilot study consisted of 7,465 completed screeners divided into four groups: 467 beneficiaries, 70 allowed nonbeneficiaries, 255 nonallowed nonbeneficiaries, and 6,673 nonbeneficiaries (nonfolder cases for whom no medical determination with regard to disability status was made). The first three groups comprise the data sets used to assess the validity (estimates of sensitivity and specificity) of the original Household Screener algorithm and were used later to evaluate the four alternative algorithms introduced in this article. For purposes of developing the Rasch model of disability, we decided to treat the 467 beneficiaries as known disability cases and to take them plus a random sample of 500 nonbeneficiary nonfolder cases (from the remaining group of 6,673) as the data set for the model development. The entire group of 467 disability beneficiaries was included in this data set to cover the wide range of conditions represented in the SSA regulations governing eligibility.
Although, ideally, the data sets used for development would not be used for validation to avoid tainting the validation exercise, it was not possible in this case. Because we used the entire group of disability beneficiaries in the data set for the Rasch model development, we had no “nondevelopment” beneficiary data for validation and thus sacrificed the potential for complete independence in the validation. However, the allowed nonbeneficiary group would remain intact for this purpose. 4
The Rasch analysis using the data set of 967 cases proceeded through several iterations before a satisfactory measure of disability was constructed. The final measure has an internal consistency (reliability) of .85. Table 7 presents the final set of 26 items, along with the original and modified scale points used in the Rasch Analysis. As indicated in the table, the scale points for all but two items were reduced to a dichotomy, scored 0 and 1. The functional limitations items and the IADL items each began with four scale points, and the ADL items began with five scale points; all of them ended with only two scale points. The Health Limits Light Activities and Health Limits Strenuous Activities items each began with four and ended up with three scale points.
We computed a Rasch score for each of the 7,465 individuals for whom Household Screener data existed. We calculated the scores by summing item scores over the 26 items selected by the Rasch analysis. The minimum possible was 0 and the maximum was 28 (recall two items were scored 0, 1, or 2).
As with the setting of the SF-12 cut points, we used binary recursive partitioning to determine cut points for the Rasch scores, using the algorithm implemented in S-PLUS. The procedure was applied to a data set that consisted of the entire sample of 467 beneficiaries combined with a new random sample of 500 nonbeneficiary nonfolder cases.
We adjusted the binary recursive partitioning algorithm to yield two cut points, enabling classification of the Rasch scores into the three nonbeneficiary study groups. Individuals with a Rasch score of 0 to 2 are classified as not likely disabled. There were 481 in that category: 19 beneficiaries and 462 nonbeneficiaries. Those with Rasch scores of 3 or 4 are classified as possibly disabled. There were 72 in that category: 50 beneficiaries and 22 nonbeneficiaries. Those with Rasch scores from 5 to 28 are classified as likely disabled. There were 414 in that category: 398 beneficiaries and 16 nonbeneficiaries. With this classification, 19 of the 467 (4%) of the beneficiaries are classified as not likely disabled.
Results
Initial Analysis
To evaluate the responses to the Household Screener, we first assessed the reliability of the disability status of individuals reported by the household head to the disability status reported by the individual in the follow-up interview. A high level of agreement between the disability status reported by the household head and the individual provides us with an indication that household head is able to adequately respond for individuals within the household.
The results suggest that the Household Screener identifies about the same number of SSA disability beneficiaries as that reported in SSA publications, and it performs a little better than the NHIS questions. The study yielded two opportunities to assess Household Screener sensitivity to the identification of disability beneficiaries. One opportunity was to check the status of beneficiaries “salted” into the study. The Household Screener identified 94.4% of those beneficiaries who were listed as living at the “salted” addresses. The second opportunity to assess Household Screener sensitivity was provided by participants who gave written permission to have their SSA records reviewed. Of the 540 cases for which records were searched, the screener yielded an overall accuracy of 95%.
We then evaluated the validity of our five methods for classifying individuals into three groups: likely disabled, possibly disabled, and not disabled. We used three validation groups: SSDI beneficiaries, individuals identified as allowed nonbeneficiaries, and nonallowed nonbeneficiaries. We report the sensitivity and specificity of each of the five methods for each validation group.
Reliability
We repeated the questions used in the Household Screener during an in-person follow-up interview for the express purpose of investigating the consistency of responses from one application to the next. Several variables relate to the consistency of the responses between the screener and follow-up interviews, including the type of respondent (i.e., proxy vs. self) and the mode of data collection (telephone vs. in-person).
The Household Screener classification results based on interviews with household reporters were compared with the classification results obtained from interviews directly with the individual. No significant differences were found suggesting that the household reporter did as well providing information about disability as the target person himself or herself.
We also compared the Household Screener classification results obtained over the telephone to classification results obtained in person, in cases where both were available. Again, the results revealed that the consistency of the disability classification was the same whether the interview was conducted in person or over the telephone.
Table 6 presents the results obtained by applying the original and each of the four alternative algorithms to the Household Screener data for the 467 beneficiaries and the AB sample. Note that the cut points for the SF-12 and Rasch algorithms were developed using the beneficiary group, whereas the cut points for the revised Household Screener and Census algorithms were set independently of that group. The ability of the SF-12 and Rasch algorithms to classify the disability status of beneficiaries is therefore likely overestimated.
Percentages of Beneficiaries Assigned to the Three Disability Classes by the Original and Four Alternative Algorithms.
Note: SF-12 = Short Form-12.
Note: Sample sizes below 467 reflect dropped cases due to missing data for questions comprising the particular algorithm.
Recall that the SF-12 and Rasch cut points both were constructed using the 467 beneficiaries.
The percentage of beneficiaries who were identified by each algorithm as not likely disabled, which we refer to as the false negative rate, ranged between 4.1% and 16.7%. Similar to the method used by Houtenville, Erickson, and Bjelland (2009) to calculate false negative rates using SSI and survey data, we use SSDI data combined with each screener algorithms to calculate false negative rates. The Census and the Rasch algorithms had the lowest false negative rates, at 5.8% and 4.1%, respectively. The Rasch measure performed best by this criterion. It also captured the most beneficiaries in the likely disabled group (85.2%). In general, the revised Household Screener algorithm functioned better than did the original, missing only 11.8% of the beneficiaries as opposed to the original algorithm that missed 16.7% of them. The revised algorithm performed better than the SF-12 but not as well as either the Census or Rasch algorithms.
Validity
The allowed nonbeneficiary validation group was the purest of the three validation groups, but it contained only 70 nonbeneficiaries who met the SSA medical eligibility criteria for disability. Table 7 presents the results of applying the various algorithms to the allowed nonbeneficiary group. Success in identifying allowed nonbeneficiaries is in many ways similar to the results found with the beneficiaries. We mainly measured success by the smallest percentage of false negatives (i.e., “alloweds” classified as not likely disabled). The Census algorithm resulted in the largest false negative classification with 15.7%. The Rasch algorithm resulted in the smallest percentage of false negative classifications with 4.3%. The other three algorithms missed between 7.1% and 14.3%. Given the false negative rates produced by the various alternative algorithms using these data, it appears that only the Rasch and SF-12 algorithms have the broad sensitivity to disability needed for screening purposes.
Percentages of Allowed Nonbeneficiaries Assigned to the Three Disability Classes by the Original and Four Alternative Algorithms.
Note: SF-12 = Short Form-12.
When applied to the nonallowed nonbeneficiary validation group, a successful algorithm would result in a large percentage of cases classified as not likely disabled, a small-to-modest percentage classified as possibly disabled, and a very small-to-negligible percentage classified as likely disabled. Table 8 presents the results of applying the alternative algorithms to the nonallowed nonbeneficiary group. First, a look at the key criterion involving the true negative rates suggests that the Census and Rasch algorithms are the most successful in correctly classifying people into the not likely disabled group. They classified 60.2% and 57.5% not likely disabled, respectively. The two Household Screener algorithms were considerably less successful in this regard, with the revised Household Screener algorithm classifying 27.2% as not likely disabled and the original algorithm so classifying 23.2%. The SF-12 classified a very low 6.7% as not likely disabled.
Percentages of Nonallowed Nonbeneficiaries Assigned to the Three Disability Classes by the Original and Four Alternative Algorithms.
Note: SF-12 = Short Form-12.
The false positive rates reveal a substantially different pattern among the algorithm alternatives. The SF-12 and Census algorithms were most successful with regard to this criterion. They classified 13% and 17.7% as likely disabled, respectively. The Rasch algorithm comes in a not-too-distant third with a false positive rate of 24.2%. We found that the original Household Screener algorithm and the revised Household Screener algorithm did not perform well. The original Household Screener algorithm had a false positive rate of 67.3% and the original Household Screener had a false positive rate of 43.3.
Discussion
All of the four algorithm alternatives improve on the original Household Screener algorithm in their ability to differentiate between persons who met the SSA medical eligibility criteria for disability and those who did not. The revised Household Screener functioned better than the original, but not as well as the Census, SF-12, and Rasch algorithms. The Census, SF-12, and Rasch algorithms are also shorter, which may result in reduced burden on respondents and lower costs for administering the screener.
The Rasch algorithm offered the greatest improvement over the original screener and appears to be superior to the other alternatives. With regard to sensitivity to persons with known disabilities, the Rasch algorithm missed only 4.1% of the beneficiaries and only 4.3% of the allowed nonbeneficiaries, compared with the original algorithm, which missed 16.7% and 11.4%, respectively. The Census algorithm came closest to the Rasch algorithm, missing 5.8% and 15.7%, respectively.
The specificity of the Rasch algorithm was also more impressive than that of the other algorithms. The true negative rate (60.2%) was better than the true negative rate for the Census algorithm (57.5%) and far better than that of any of the other algorithms (6.7%, 23.2%, and 27.2%). It does well in classifying nondisabled persons as not likely disabled and also does well in classifying disabled persons as possibly or likely disabled.
The six-item Census algorithm offers the least amount of respondent burden of all the alternative screener algorithms. It requires less than a minute to administer. It performed somewhat better than the Household Screener algorithms and the SF-12 algorithm. Its major weakness was the number of people with a disability that it missed: almost 6% of the beneficiaries and nearly 16% of the allowed nonbeneficiaries. Although it could be argued that the Census algorithm provides the greatest household screening advantage given its very short length, the losses in sensitivity and specificity outweigh the potential minimal gain that could be achieved by shortening the questionnaire by just a few minutes compared with Rasch items.
The SF-12 performed better than the original and revised Household Screeners, but not as well as the Census and Rasch. A major weakness was the very small true negative rate (6.7) of nonallowed nonbeneficiaries identified as not likely disabled. Our use of the SF-36 was more promising, however, with a lower rate of false-positives among the beneficiaries in the not likely disabled group. In this case, gains in sensitivity could outweigh the negative impacts of adding 24 items.
Limitations of the Study
Improvements to the data on the size and composition of the working-age population that is at risk of entering the SSDI program can play an important role in helping SSA plan for potential changes in workloads and allocate resources to provide good service to the public. This pilot gave us the opportunity to compare various sets of survey questions in terms of each set’s ability to accurately predict Social Security disability, and it provided the best evidence to date on the means of identifying our population of interest. However, it used a limited sample in a limited number of sites. To determine whether any of these algorithms are truly useful, testing on a wider scale is necessary.
It is important to note that the presence of a disability is not the only factor that establishes potential eligibility for disability benefits under the Social Security programs and that SSDI beneficiaries must have significant and recent work histories. An estimate based on disability alone may, therefore, overestimate the number of potential SSDI beneficiaries. Individuals meeting the SSA disability definition and not meeting nonmedical SSDI qualifications may, however, still qualify for federal disability benefits under the SSI program if they meet the SSI income and resource eligibility rules.
Another limitation of this study is the low response rate of the participants, which may lead to biased estimates. However, “the nonresponse rate of a survey alone is not a very good predictor of the magnitude of the bias” (Groves, 2006, p. 662). In some sense, the observed response rates can actually be considered high, given that they reflect the percentage of people who completed each of the three steps of the study, from beginning to end. Although there is no reason to believe this level of nonresponse leads to any bias in the study, we nonetheless urge caution when extrapolating these results outside this particular study. However, the results from this pilot project provide the first evidence that alternative algorithms offer a more reliable method of classifying individuals as nondisabled or disabled. Future, large-scale applications of this are required to confirm these results.
Conclusion
The primary purpose of this study was to examine the possibility that we could monitor future changes in disability in a cost-effective manner through household- or self-reported survey data. Although we must weigh the potential costs of such an effort with its benefits, we believe that these initial results hold enough promise to warrant further exploration.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) declared receiving the following financial support for the research, authorship, and/or publication of this article: Co-authors Frey, Riley, Gonin and Kalton received funding by the Social Security Administration for this research.
