Abstract
Background
Cross-national comparisons provide a window on the aging experience across varying societal contexts (Schoeni & Ofstedal, 2010). With the proportion of persons aged 60 and above worldwide projected to increase from 11% in 2007 to 22% by 2050 (Kinsella & He, 2009; United Nations, 2007), cross-national research on social, economic, and cultural variability in aging is more relevant than ever. A unique contribution of cross-national studies is the opportunity to identify aspects of the disablement process that are modifiable through policy or interventions, including behavior change and modifications to living environments (Figueras & McKee, 2012).
Longitudinal studies suggest that a significant proportion of older people experience cognitive decline (Yaffe et al., 2009), and that cognitive capacity, or the global ability to think and process information, is predictive of functioning in self-care and other activities of daily living (Njegovan, Hing, Mitchell, & Molnar, 2001). Prior research has shown that cognitive skills, such as memory, reasoning, processing speed, may be modifiable by training individuals on strategies to improve cognitive performance (Ball et al., 2002; Gross & Rebok, 2011; Willis et al., 2006). For example, training older adults in mnemonic techniques of rehearsal, association, categorization, or imagery has improved performance on memory tests (Gross et al., 2012). Recognition that cognitive ability plays an important role in functioning (e.g., Gross & Rebok, 2011) has led to inclusion of cognitive measures in population-based studies of older people. One of the earliest to include such measures was the Asset and Health Dynamics of the Oldest Old Survey (AHEAD; Herzog & Rodgers, 1999; Herzog & Wallace, 1997), but cognitive assessments are now part of assessments in several large international studies of aging. Among these are the Health and Retirement Study (HRS) and the English Longitudinal Study on Aging (ELSA), which have a shared focus on health and economic issues related to aging. Pooling existing data across these major surveys, done appropriately, can foster novel opportunities for international research into the aging experience.
Two concerns often arise in pooling data to conduct cross-national research—whether items that are identically worded are being interpreted and responded to similarly across surveys and how to analyze a construct of interest when the number and content of items differ across surveys. For cognitive assessments, the logistical challenges of creating and norming cognitive assessments that are comparable cross-culturally yet appropriate in different contexts are well recognized (Ferraro, 2002; Hendrie, 2006; Nell, 1999). For example, a project to harmonize the Cambridge Cognitive Examination (a neuropsychological test battery used in dementia diagnosis; Roth, Huppert, Mountjoy, & Tym, 1999) across seven European countries used an intensive iterative consensus process to achieve cultural and translational comparability (Verhey et al., 2003). This type of cross-national standardization is costly and time-consuming to implement. Vignettes offer another methodology for harmonizing measures across international surveys (Bago d’Uva, O’Donnell, & van Doorslaer, 2008; King, Murray, Salomon, & Tandon, 2004; Salomon, Tandon, & Murray, 2004). In this approach, respondents are asked to rate a common set of vignettes, known as anchoring vignettes, and then provide a self-assessment along the trait of interest. As the underlying trait level in the anchoring vignettes is constant across different raters, comparing an individual’s ratings on the anchoring vignettes along with his or her self-rating provides a means to adjust for differences in how individuals approach the rating process. However, as vignette assessment data have to be collected for each item of interest, this approach to harmonization is not applicable after primary data collection is completed.
Varying objectives and breadth of issues being covered by different surveys often limit the set of items included to measure constructs such as cognitive performance, resulting in use of different items and different number of items, and a limited set of common items across surveys. This holds true even for HRS and ELSA where greater coordination of content has been achieved than is typical. Faced with varying items across data sets, researchers often exclude non-common items and collapse categories to achieve a set of comparable items that can be pooled (Pluijm, 2005). However, this approach leads to loss of information because it ignores the measurement contribution of excluded items.
Item response theory (IRT) offers two major advantages for harmonizing constructs, such as cognitive performance, that are assessed through multiple items. First, IRT can leverage common items across surveys to align scores along the same scale and generate comparable scores while retaining the additional information from available survey-specific items. This can potentially enhance score precision, particularly in the range of performance assessed by these survey-specific items. Second, IRT allows differential item functioning (DIF; Hambleton, Swaminathan, & Rogers, 1991; Holland & Wainer, 1993), perhaps due to cultural differences in interpretation, to be identified and accounted for during scoring. Empirical studies that ignore this information during scoring may introduce unnecessary measurement error or conflate actual group differences with measurement artifacts, potentially limiting the validity of their findings.
In this study, we investigated the value of IRT-based strategies for harmonizing measures of general cognitive performance across international surveys on aging using data from HRS and ELSA. We compared measurement consequences of using an IRT score: (a) based solely on the set of nine items common to both surveys; (b) using the common set of nine items, but adjusting for differentially functioning items; and (c) using all available items from each survey, with adjustment for items that show DIF. The main hypothesis tested is the following:
IRT scores based on all available items from each survey, adjusted for DIF, will have better measurement precision than IRT scores based on the common set only, whether adjusted for DIF or not.
Method
Data Sources
Data are drawn from the HRS and ELSA. These international surveys are both longitudinal and share focus on the social, economic, and health aspects of the lives of people aged 50 and above. The HRS is nationally representative of the United States and has been ongoing since 1992 (Juster & Suzman, 1995). The ELSA is representative of the United Kingdom and is one of a family of surveys patterned after the HRS that now extends to many countries (Banks, Marmot, Oldfield, & Smith, 2006). In this study, we used closely aligned years of data with HRS data from 2002 and ELSA data from March 2002 to March 2003 (Wave 1, Release 2).
Our analyses focus on participants aged 65 or above who were administered cognitive assessments in each survey. Final sample sizes for our analyses were 9,471 in HRS and 5,444 in ELSA. Age, sex, and education of the sample by survey are presented in Table 1. The dichotomous education variable uses items specific to each survey: HRS provides items on completed education and degrees; ELSA provides a seven-level categorical variable (500 cases classified as “foreign/other” were excluded).
Sample Characteristics by Survey (Unweighted).
Note. HRS = Health and Retirement Study; ELSA = English Longitudinal Study of Aging.
Measures
Nine measures were common to both surveys. Four of these items assessed orientation to date (i.e., today’s date, month, year, and day of week). Respondents were asked, “What is today’s date?” Credit was given for the correct date, month, year, and day of week separately. Probes were used for components not reported spontaneously. In addition, three numeracy items, one related to disease prevalence, another to savings, and the third on lottery winnings as well as two word recall items, immediate and delayed, were included in both HRS and ELSA.
Each survey also fielded unique items. The HRS fielded 16 cognitive items that were not found in ELSA. These items included counting backward (from 20 and from 86), serial subtraction from 100 by 7, defining five words from the Wechsler Adult Intelligence Scale (WAIS-R), naming the president and vice president, and recognizing and naming two familiar items (cactus and scissors). The four items administered only in ELSA included two numeracy items. One involved calculating a half price discount. The other involved determining the original cost of a used car priced at two thirds of its new car price. Naming animals and letter cancelation were the two remaining ELSA-only items. Further details on these assessments are available from the documentation for each survey (http://hrsonline.isr.umich.edu; http://www.esds.ac.uk/longitudinal/access/elsa/l5050.asp).
An important IRT assumption is that a set of items measures a single, or unidimensional, construct (Embretson & Reise, 2000; Stout, 1990). We used exploratory factor analysis to examine this assumption for the set of cognitive assessments in our samples. This analysis was conducted separately for HRS and ELSA as contemporary restrictions in missing data estimation prevent testing for dimensionality for HRS and ELSA items together (Curran et al., 2008). At present, no definitive criterion for determining unidimensionality exists. However, “sufficient” unidimensionality for IRT analysis (McHorney & Cohen, 2000) may be demonstrated if proportion of the variance explained by the first factor is ≥20% (Reckase, 1979) and if the ratio of eigenvalues between the first and second factor is ≥4 (Reeve et al., 2007). Finally, strong factor loadings (>0.40) observed for all items on the first factor provide support of sufficient unidimensionality for valid IRT modeling. As a sensitivity analysis, we also implemented bifactor models to examine whether study results would substantially change if excess correlation among the numeracy items were accounted for. Bifactor models were implemented using Mplus (Version 7.11; Muthén & Muthén, 1998-2012).
Score Derivation
Using the cognitive measures available from both surveys, we created three alternative sets of IRT scores based on (a) common-items set, (b) DIF-adjusted common-items set, and (c) and DIF-adjusted all-items set. The first score set was generated using IRT-estimated parameters for the nine items that were fielded in the two surveys, assuming no DIF was present. The second set uses the same common items, but DIF was evaluated and adjusted for during IRT modeling and scoring. The final set of scores adjusted for identified DIF items and added survey-specific items to the item set for parameter estimation and scoring. For all IRT modeling and scoring, we used the HRS sample as the reference group. Parameter and score estimates were scaled to the HRS (with HRS group mean set to 0; each unit of the scale = 1 standard deviation of the HRS sample).
Estimating item parameters
The first step in generating each IRT score set is the estimation of item parameters for both binary (e.g., correct, incorrect) and ordinal (e.g., partial credit for WAIS vocabulary items: 2 for completely correct, 1 for partially correct; quartiles for naming animals and letter cancelation; score ranging between 0 and 7 for immediate or delayed word recall) items. We collapsed scores >7 on the two word recall items to 7 and created quartiles based on the distribution of the letter cancelation and naming animal tests to facilitate IRT modeling. We used the graded response model (GRM; Samejima, 1969), which can accommodate both binary and ordinal items with different numbers of categories, to model item characteristics and to generate scores. The GRM estimates one discrimination (a) and k − 1 boundary location (b1 . . . bk-1) parameters, where k = number of response categories, for each item. The a parameter reflects the ability of an item to discriminate among persons with different levels of underlying cognitive performance. Higher a values indicate better discrimination. For binary items, the GRM is equivalent to the two-parameter logistic model, and the item location parameter (b) is the point on the cognitive performance trait scale where the probability of responding correctly to the item is 50%. For ordinal assessments, k − 1 binary comparisons are created from k response categories. For example, the letter cancelation item is categorized in quartiles: The three boundary parameters were based on the first quartile relative to all others (b1), the first two quartiles relative to the last two (b2), and the first three quartiles relative to the last (b3). We implemented IRT models using Multilog (Thissen, Chen, & Bock, 2003).
DIF identification
Likelihood ratio (LR) difference tests were used to test whether item parameters functioned differently by survey. As part of these tests, we first identified a set of “anchor items,” or items that do not demonstrate DIF (Teresi et al., 2007), among the nine common items. Identification of anchor items involved iterative LR tests to identify and exclude items that show DIF. In these LR tests, an assumption is made that all items other than the item being tested serve as adequate anchors in the initial round. Subsequent LR tests are performed only within the set of preliminary “anchor items” identified by the previous LR test. Additional items identified with DIF are excluded from the anchor set. This process is repeated until the set of anchor items include no items demonstrating DIF.
Using the final set of anchor items, we tested for a difference in discrimination and location parameters by survey for each non-anchor item. To complement the statistical testing for DIF, we examined the magnitude of the DIF for each item by comparing the item characteristic curves (ICCs) for each survey. The ICCs are estimated from the model in which item parameters showing DIF are freely estimated for each survey. These curves plot the probability of endorsing the item over the range of cognitive performance. Differences in these curves for the two surveys reveal the magnitude and direction of the DIF at the item level. Non-overlapping ICCs by survey indicate DIF; coincident curves reflect absence of DIF.
Given the large samples in our study, we considered both results from statistical LR tests and the magnitude of the DIF identified through graphic analysis in determining whether the item should be modeled separately by survey before scoring. Specifically, if DIF was identified after the Benjamini–Hochberg adjustment for multiple comparisons (Benjamini & Hochberg, 1995; Thissen, Steinberg, & Kuang, 2002), we generated graphic displays of the ICC curves for each group to illustrate the nature of the DIF over the entire range of cognitive performance. For significant DIF items, we also produced expected score difference along the latent trait. To facilitate comparison between binary and ordinal items with different score range, we standardized the score range by dividing by the item score range. A between-group difference of 0.16 in the expected score for a binary item (e.g., 1 = correct, 0 = incorrect) would have a scaled difference of 0.16; a 0.89 difference for an ordinal item with a score range of 3 would have a scaled difference of 0.30. These score differences were subsequently used to evaluate the importance of the DIF.
To ensure that only important DIF was mapped, any item with scaled difference of ≥0.10 (Perkins, Stump, Monahan, & McHorney, 2006) at any point along the latent trait was selected for further evaluation using the standardization methods described by Dorans and Kulik (2006). This approach evaluates the overall impact of the score difference across the range of latent trait to determine whether the item should be modeled for DIF. The sample size for ELSA at each score level served as the weight to calculate the standardized p-difference (STD PDIF), which can range between −1.0 and 1.0. We determined items with STD PDIF values between −0.05 and 0.05 to have negligible DIF as recommended (Dorans & Kulik, 2006).
IRT scoring
Common-items set scores
For these scores, we estimated only one set of parameters for each item and assumed no DIF existed. Using these parameters, scores were estimated for each respondent based on their responses to the nine common items.
DIF-adjusted common-items set scores
These scores account for DIF identified among the nine common items. If an item did not demonstrate DIF, only one set of parameters is estimated for the item but items with DIF are modeled separately for HRS and ELSA respondents. Scores estimated for each respondent using this set of parameters is adjusted for DIF.
DIF-adjusted all-items scores
These scores are based on parameters estimated using all available cognitive assessments from the two surveys, with separate parameters estimated for items showing DIF.
Analysis
We compared the standard error associated with score estimates at each point along the cognitive performance spectrum among the scoring methods. The size of the standard error is influenced by the discrimination and the number of items located within a region of the underlying trait. Higher discrimination and more items located at a given score reduce standard error for that score (because items measure best at their location described by the b parameter). We hypothesized that adding survey-specific items would improve measurement precision and thus lower standard error of the score estimates.
The practical impact of the different scoring strategies cannot be inferred directly from the standard error functions, as these functions do not account for the distribution of cognitive performance in the HRS and ELSA samples. Specifically, if most survey respondents are located in a region of the trait where the differences in standard errors by scoring method are small, minimal differences in the standard error for the overall sample would be observed. To examine overall impact, we compared, for each sample, the average standard errors across the different scoring methods.
Results
Age and gender distribution of HRS and ELSA respondents were similar, although a slightly higher percentage of HRS respondents were aged 85 and above and female. Differences in education were more pronounced, with HRS respondents having greater educational attainment than ELSA respondents (Table 1).
Unidimensionality
Table 2 presents the results from the exploratory factor analysis. The goal is to determine whether sufficient unidimensionality exists for valid IRT modeling. For both HRS and ELSA, loadings on the first factor were ≥ 0.40 for all items. In addition, the proportion of variance explained by the first factor was 70% for HRS and 80% for ELSA and the ratio of the eigenvalue of the first factor to the second factor was 5.32 and 4.93. These results all exceed suggested criteria (McHorney & Cohen, 2000; Reckase, 1979; Reeve et al., 2007) and indicate that IRT modeling for these items is appropriate.
Dimensionality of Cognitive Items in HRS and ELSA.
Note. HRS = Health and Retirement Study; ELSA = English Longitudinal Study of Aging; WAIS = WECHSLER Adult Intelligence Scale.
Factor retained (eigenvalue > 1.0); exploratory principal factor analysis.
Item Parameters and DIF Findings
Table 3 presents the discrimination (a) and location (b) parameters for items that were unique to each survey or that did not show DIF across surveys. Item discrimination varied, ranging from 0.70 (WAIS vocabulary) to 3.08 (serial subtraction by 7: fourth task). Furthermore, most of the location parameters are negative. This indicates that measurement of cognitive performance in these surveys is generally better in the more impaired range, because easier items have location parameters of lower numerical value than harder items. Finally, 16 items were fielded solely in the HRS. Therefore, we expect measurement precision to be greater for this sample given the information contributed by these extra items.
Item Parameters for Survey-Specific Items and Items That Show No DIF.
Note. A (column 3) refers to item discrimination, a higher value of A reflects a stronger relationship of the item to cognitive performance; B (columns 4-6) indicates the item location or where the item (or item category) measures best on the cognitive performance trait, for example, a lower B value indicates the item taps an “easier” functioning task. DIF = differential item functioning; HRS = Health and Retirement Study; ELSA = English Longitudinal Study of Aging.
“Wechsler Adult Intelligence Scale” (word from Word List 1/word from Word List 2). Word List 1 or 2 randomly assigned to respondent.
The pattern of location parameters observed is as expected. For example, in the serial subtraction tasks (in which the respondent is asked to subtract 7 from 100 and continue to subtract 7 from the prior answer), the easiest task is the first subtraction (b = −0.99), and the hardest is the last subtraction (b = −0.18). Naming the president was easier (b = −2.44) than naming the vice president (b = −0.73). Counting backward from 20 is also easier (b = −2.13) than counting back from 86 (b = −1.43).
Of the nine common items, three demonstrated DIF across the two surveys (Table 4). Day of date and delayed word recall demonstrated discrimination and location DIF, while the immediate word recall demonstrated DIF only for location. For all three items, the location parameter values were higher (less negative or more positive) for ELSA respondents suggesting that these items are more challenging for ELSA respondents compared with HRS respondents with the same level of cognitive performance.
DIF in Common Cognitive Items From the Final All-Items Model.
Note. A (column 4) refers to item discrimination, a higher value of A reflects a stronger relationship of the item to cognitive performance; B (columns 5-11) indicates the item location or where the item (or item category) measures best on the cognitive performance trait, for example, a lower B value indicates the item taps an “easier” functioning task. DIF = differential item functioning; HRS = Health and Retirement Study; ELSA = English Longitudinal Study of Aging.
Score Comparisons
Figure 1 presents the standard error functions for the three scoring methods for both a unidimensional model and a bifactor model. Compared with the two common-items scores, the scores based on all available items had consistently smaller standard errors (or greater measurement precision), with the greatest difference at the lower end of cognitive performance between 0 and −2.5, the region of the trait where most of the survey-specific items are located (Table 3). The standard error functions were comparable for the two common-items scores, although the two scores alternated having higher standard errors across different cognitive performance levels. The standard error patterns for the three scores were generally similar in the bifactor model, suggesting that excess covariation among numeracy items had limited influence on the measurement precision results.

Standard error by scoring method and IRT model.
Impact of DIF Adjustment and Including All Items on Measurement Precision
The average standard errors of the three scores for HRS and ELSA respondents are presented in Table 5. For HRS respondents, average standard errors become progressively smaller from the common-items scores to the DIF-adjusted common-items scores to the DIF-adjusted all-items scores. In contrast, the average standard error was modestly smaller for the common-items scores compared with the other two DIF-adjusted scores for ELSA respondents. Most results were comparable whether the unidimensional or bifactor model was used, except for the all-items DIF-adjusted score in ELSA respondents. For this score, average standard errors were slightly larger than those for the common-items score with the unidimensional model but were modestly smaller when the bifactor model was used.
Average Standard Errors for HRS and ELSA Respondents by Scoring Approaches.
Note. HRS = Health and Retirement Study; ELSA = English Longitudinal Study of Aging; DIF = differential item functioning.
The difference in the behavior of the three scores between HRS and ELSA samples appears to be related to the number and location of survey-specific items, the shifts in item location resulting from DIF adjustment, and the underlying distribution of cognitive performance of each sample (see Figure 2). In the ELSA sample, DIF adjustment shifted item locations higher for the common items, resulting in higher standard errors at lower cognitive performance levels but lower standard errors at higher cognitive performance levels for the two DIF-adjusted scores. Furthermore, the four ELSA-specific items primarily contribute to measurement precision at higher cognitive performance levels (Table 3). How DIF adjustment and addition of ELSA-specific items affect measurement precision depends on the level of cognitive performance. ELSA respondents in Region A of the cognitive performance trait, shown in Figure 2, have the smallest average standard errors with the common-items scores, modestly larger standard errors for all-items DIF-adjusted scores, and the largest standard errors for the common-items DIF-adjusted scores. The situation differs for ELSA respondents with higher cognitive performance in Region B of the trait, for whom the all-items DIF-adjusted scores produced the smallest standard errors. As a large proportion of ELSA respondents are at lower levels of cognitive performance where the common-items score had the smallest standard errors, the average standard errors for the ELSA sample were lowest for these scores (Table 5).

Cognitive performance trait distribution effects on sample standard error.
Although the pattern of standard error functions for HRS respondents differs, the same issues appear to influence the average standard errors for the HRS sample reported in Table 5. The two common-items scores were generally similar, although the DIF-adjusted scores had slightly smaller standard errors. However, the addition of HRS-specific items substantially reduced standard errors for the all-items DIF-adjusted scores, particularly at lower levels of cognitive performance (trait <0.0), consistent with the item locations of most HRS-specific items (Table 3). From Figure 2, it is apparent that the practical impact of different scores again depends on the region of the trait HRS respondents are located in. For HRS respondents located approximately at −1.0 on the trait (Region A), the all-items DIF-adjusted scores had substantially smaller standard errors than the common-item scores (with and without DIF adjustment). In contrast, the standard errors for all three scores were very similar for HRS respondents located near 0.5 (Region B). A large proportion of HRS respondents had trait levels <0.5 where the all-items DIF-adjusted scores performed best. These findings from Figure 2 appear to account for the pattern of average standard errors for the HRS sample in Table 5.
Discussion
Our findings demonstrated that IRT methods can be effectively utilized for harmonizing cognitive performance assessments across two major international surveys. Our study provided insight into the effects of this strategy on score comparability and measurement precision. First, differential item function was observed for cognitive measures, highlighting the often unrecognized role that measurement non-equivalence can play in international group comparisons, even when common test items are used. In particular, substantial DIF was found for the two word recall items, which have been included in a number of cross-country comparisons of cognitive performance (Oksuzyan et al., 2010; Rohwedder & Willis, 2010; Skirbekk, Loichinger, & Weber, 2012). Assuming measurement equivalence when the same items or set of items are used can conflate true group differences with measurement artifacts in how groups respond to measures. Adjusting for DIF has different implications for measurement precision, depending on the level of cognitive performance respondents possess. However, accounting for differentially functioning items before pooling HRS and ELSA data for comparative studies is still important to ensure score validity.
Second, our findings showed that the number and item location of available survey-specific items interact with the distribution of the underlying trait to influence the effect of using all available measures for estimating scores. For example, in the HRS, adding the 16 survey-specific items greatly improved measurement in the lower range of the cognitive performance trait where the location parameters for these items indicate these items are most informative. Because nearly half of the HRS sample in our study were in the lower cognitive performance levels, using the all-items DIF-adjusted scores substantially lowered the average standard error for the entire sample. This resulted in greater overall measurement precision for HRS respondents. In contrast, ELSA contributed only 4 items with location parameters in the upper range of the cognitive performance trait. Because only a small proportion of ELSA respondents were in this region, adding these items did not reduce the average standard errors for the overall sample compared with using the common-items scores. In fact, location parameters for the three DIF items shifted toward higher cognitive performance (i.e., these items were more difficult for the ELSA respondents). This improved measurement at upper cognitive performance levels, while reducing measurement precision at lower cognitive levels. However, because most ELSA respondents were in the lower cognitive performance regions, the average standard errors for the DIF-adjusted common-items scores were higher than that for the common-items scores without DIF adjustment. The situation was similar for the all-items DIF-adjusted scores in the ELSA sample. The addition of the four ELSA-specific items improved measurement at the upper cognitive performance region. However, as only a small proportion of ELSA respondents have cognitive performance at the higher levels where the ELSA-specific items performed best, average standard errors for the all-items DIF-adjusted scores was similar to that for the DIF-adjusted common scores. Regardless of average standard error of scores observed for each sample, the DIF adjustment correctly apportions error to appropriate parts of the range.
The consequence of DIF adjustment and adding survey-specific items will differ for individuals at different levels of the trait even within the same sample. For HRS respondents, measurement precision for individuals at the lower end of cognitive performance would be substantially improved with the addition of the HRS-specific items, while scores for HRS respondents with high cognitive performance levels would not differ much regardless of which score was used. In contrast, although using the four ELSA-specific items would improve score precision for ELSA respondents with higher cognitive performance, it would not have the same effect for ELSA respondents at lower cognitive performance levels.
Our IRT analysis also offers insight into the measurement properties of cognitive items included in two major surveys on aging. For the most part, the common items demonstrated relatively good discrimination. However, both common and survey-specific items provided the most information in the lower range of cognitive performance. Few cognitive performance measures assessed the less impaired range of functioning (as an item that only very high functioning individuals would respond to correctly would do). This suggests that the measures being fielded by these surveys do better at discriminating among individuals at lower cognitive performance levels than individuals at higher cognitive performance levels. From a clinical or policy perspective, it may be appropriate to ensure good measurement properties for individuals with more severe cognitive impairment. However, the current set of measures is less useful for studying individuals with good to excellent cognition.
Our study has several limitations. First, findings are applicable only for the cognitive items from the surveys examined in the study. The practical value of the three scoring strategies may change under different sample distributions and balance of common and survey-specific items with different item parameters. Adding more difficult items, which discriminate among people at less impaired levels of cognitive performance, for example, could improve the precision of the scale among community-living persons. Second, although evidence of differential item functioning suggests that pooling data for the common cognitive items from HRS and ELSA without accounting for DIF would produce non-comparable scores, our analysis does not provide an explanation for the observed DIF. Specifically, although the DIF observed suggests response differences between U.S. and U.K. populations, whether the cause is cultural or due to methodological differences cannot be determined. Finally, to improve our ability to link across the two surveys, we analyzed a broad set of cognitive assessments, including several numeracy measures, as a unidimensional trait. Therefore, the latent trait in our study may not reflect the more complex structure of cognitive performance reported in other studies (Herzog & Wallace, 1997; McArdle, Fisher, & Kadlec, 2007). However, our analyses indicate that the items we used have sufficient unidimensionality for IRT modeling and are adequate for investigating the value of an IRT approach to linking cognitive measures across surveys. Furthermore, findings from the bifactor model that account for the numeracy items and the unidimensional model were similar, suggesting that dimensionality issues did not substantially affect study findings and conclusions.
International surveys represent an important opportunity for cross-national investigations into issues associated with aging and cognitive performance. Our study demonstrated the feasibility and value of using IRT methods to improve comparability and precision of cognitive performance scores by accounting for DIF and utilizing all measures available in each survey. This approach may be useful for harmonizing cognitive measures across other nationally and internationally representative surveys, including surveys conducted in other languages, such as the Mexican Health and Aging Study (MHAS), one of the sister surveys of the HRS. Future investigations should also examine the contexts, such as the number of common items and the distribution of common and unique items, where these methods would produce the greatest measurement benefits. However, our study provides the methodological underpinnings for conducting valid cross-national research on cognitive performance and the needs of the cognitively impaired.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the National Institute on Aging (Grant AG032502).
