Abstract
Background:
Cognitive test-retest reliability measures can be used to evaluate meaningful changes in scores.
Objective:
This analysis aimed to develop a comprehensive set of test-retest reliability values and minimal detectable change (MDC) values for a cognitive battery for community-dwelling older individuals in Australia and the U.S., for use in clinical practice.
Methods:
Cognitive scores collected at baseline and year 1, in the ASPirin in Reducing Events in the Elderly clinical trial were used to calculate intraclass correlation coefficients (ICC) for four tests: Modified Mini-Mental State examination (3MS), Hopkins Verbal Learning Test-Revised (HVLT-R), single-letter Controlled Oral Word Association Test (COWAT-F), and Symbol Digit Modalities Test (SDMT). 16,956 participants aged 70 years and over (65 years and over for U.S. minorities) were included. ICCs were used to calculate MDC values for eight education and ethno-racial subgroups.
Results:
All four cognitive tests had moderate (ICC > 0.5) to good (ICC > 0.7) test-retest reliability. ICCs ranged from 0.53 to 0.63 (3MS), 0.68 to 0.77 (SDMT), 0.56 to 0.64 (COWAT-F), 0.57 to 0.69 (HVLT-R total recall), and 0.57 to 0.70 (HVLT-R delayed recall) across the subgroups. MDC values ranged from 6.60 to 9.95 (3MS), 12.42 to 15.61 (SDMT), 6.34 to 8.34 (COWAT-F), 8.13 to 10.85 (HVLT-R total recall), and 4.00 to 5.62 (HVLT-R delayed recall).
Conclusion:
This large cohort of older individuals provides test-retest reliability and MDC values for four widely employed tests of cognitive function. These results can aid interpretation of cognitive scores and decline instead of relying on cross-sectional normative data alone.
Keywords
INTRODUCTION
Established cognitive tests are important diagnostic and prognostic tools for clinicians to assess cognitive function in patients with neurodegenerative disease across different cognitive domains, such as verbal anterograde episodic memory, language, executive function, or psychomotor speed. Scores from a single time point can be measured relative to age, gender, or education-adjusted norms and may be adequate for diagnosis of mild to moderate disease. However, the demonstration of cognitive decline over multiple clinic visits is a more revealing outcome, particularly if serial testing began from a high baseline. The ability to differentiate between a meaningful change in test scores on repeated administration and the typical variability in test performance seen in cognitively healthy individuals is essential for detection of such mild to very mild incident disease. To this end, test-retest reliability measures and minimal detectable change (MDC) scores can be used to evaluate the significance of changes in scores.
For the evaluation of cognitive decline and potential dementia in older individuals, it is relevant to explore the test-retest reliability and MDC scores of a number of widely used cognitive function tests, including the Modified Mini-Mental State examination (3MS), Symbol Digit Modalities Test (SDMT), Hopkins Verbal Learning Test-Revised (HVLT-R), and single-letter Controlled Oral Word Association Test (COWAT-F). Previous studies have assessed the variation in scores for a single cognitive test and are consequently somewhat limited in scope and clinical utility. Moreover, they have usually focused on specific populations, such as patients with multiple sclerosis or a recent history of stroke [1, 2]. Furthermore, such studies have often been based on small sample sizes (n < 50) and/or employed test-retest intervals either shorter, at days to weeks, or longer, at several years, than the interval of approximately 6 to 18 months that commonly occurs in the real-world setting. Although these small, brief studies suggest moderate to good test-retest reliability for the tests examined in this study (Pearson’s correlation coefficient or intraclass correlation coefficient (ICC) greater than 0.7 for all tests), these findings are not directly applicable to clinical practice [3–5]. Their clinical utility is further limited by the absence of MDC scores, which complement the reliability measures. MDC scores provide a threshold to distinguish random change compared with real change in a participant’s cognitive function, rather than looking at the reliability of a test in general [6].
Previous test-retest reliability studies have not conducted subgroup analyses to account for factors that are likely to influence the test-retest reliability due to sample size limitations. Normative data generated from the ASPirin in Reducing Events in the Elderly (ASPREE) clinical trial has shown that 3MS, HVLT-R, and SDMT scores vary across gender, education, age, and ethno-racial group [7–9]. Other studies examining individual trajectories of cognitive decline also found differences related to age and race [10, 11]. Normative data for COWAT-3 letter (FAS) show differences by gender, education, and age [12]. Furthermore, many of these studies report Pearson correlation coefficients as their measure of reliability rather than ICCs. The Pearson coefficients ignore systematic bias between time points and only measure how well the relationship can be described in linear terms [13, 14]. Additional factors that can influence test-retest reliability include the effects of learning (practice effects) and variations in test administration [15].
The ASPREE clinical trial provided an opportunity to assess the test-retest reliability of four commonly employed cognitive tests, each assessing different domains of cognition, and to overcome some of the limitations of prior studies. The large cohort of 19,114 initially healthy community-dwelling older people in Australia and the U.S. underwent repeat administration of a cognitive assessment battery allowing for assessment of test-retest reliability over a 1-year interval, a time frame that is both clinically relevant and helps reduce any learning effects [5]. The bi-national study’s size and recruitment also allowed for stratification based on ethno-racial category and education level. Moreover, the administration of the tests within the context of a clinical trial ensured a standardized procedure, reducing variation between each visit’s conduct.
The aim of this analysis was to determine the test-retest reliability and MDC for four commonly used cognitive tests in community-dwelling older individuals over a one-year period according to race and ethnicity, gender, and within education subgroups. The results will help clinicians interpret observed cognitive scores in individual patients by being able to better identify true decline against a background of expected variation.
METHODS
Study design and participants
This study utilized data collected as part of the ASPREE clinical trial, a randomized double-blinded placebo-controlled trial that assessed the effects of low dose (100 mg) daily aspirin in community-dwelling people aged 70 years or over (or 65 years and over for U.S. African-American and Hispanic populations). The trial was conducted in accordance with the principles of the Declaration of Helsinki and approved by local institutional review boards at each study site, with all participants providing written informed consent. Men and women were recruited through general practices in Australia and through academic and clinical centers in the U.S. Key exclusion criteria included: diagnosed dementia, a score below 78 on the 3MS examination at baseline, evidence of cardiovascular disease, major physical disability, and serious illness likely to cause death within the next five years. A total of 19,114 participants across Australia (n = 16,703) and the U.S. (n = 2,411) were recruited and assessed at baseline visits between 2010 and 2014, then completed annual in-person visits and quarterly telephone calls. Participants were followed for a median of 4.7 years. Full details describing the study design and sampling procedure, including the full exclusion criteria, have been previously published [16].
The baseline cognitive testing, described in more detail under Assessments and measures below, consisted of the 3MS at an initial screening visit followed by other tests (SDMT, HVLT-R, and COWAT-F) at an eligibility visit 4 weeks later. Unlike the 3MS scores at the screening visit, performance levels on the other cognitive tests were not potentially exclusionary at the eligibility visit. The four cognitive tests were then re-administered at study visits at years 1, 3, 5, and a close-out visit. All cognitive tests were conducted by accredited staff during face-to-face assessments that occurred at general practices, community venues, clinical trial study sites, or at the participant’s home. Staff were instructed in Standard Operating Procedures by senior research staff, including two of the authors (SGO, ES –behavioral neurologist). ES then assessed their performance on mock administration and accredited those who reached a satisfactory standard with no major errors in administration or scoring. Those who did not meet this standard were re-trained in the problem aspects of administration and/or scoring by senior research staff. Research staff were re-accredited annually to ensure data collection consistency and quality [16]. Questions or concerns regarding difficult scoring (e.g., of distorted pentagons in the 3MS) were decided by one of the senior staff or authors (ES; AM; SGO).
Cognitive test scores from baseline and the first annual visit (year 1) were used for this reliability analysis. From the original 19,114 participants in the ASPREE study, participants were excluded if they did not complete the four cognitive tests or depression screening at both baseline and year 1 (n = 1748), did not fit into one of four pre-specified country and race/ethnicity subgroups (n = 364), or completed cognitive testing in Spanish (n = 46). The analysis therefore included 16,956 participants.
Assessments and measures
Test-retest reliabilities were estimated in eight subgroups based on the cross-classification of country, race and ethnicity (White Australian, White U.S., African American U.S., and Hispanic/Latino U.S.) by education level (≤12 years and > 12 years). The four cognitive tests were as follows: 3MS, to assess global cognitive function. This is a commonly used screening tool for cognitive impairment and dementia [17]. It consists of 34 questions and tasks that cover a number of cognitive domains including verbal recall and fluency, language, visual construction, and attention. The total number of correct answers can range from 0 to 100, with participants in ASPREE required to have a score of 78 or higher at baseline [7]. SDMT, to measure psychomotor speed [18]. This is a substitution task which involves writing down the corresponding number for each of a random sequence of 10 symbols using a coding key [8]. An individual’s score reflects the number of correct substitutions within 90 s. Version 2 was used with the author’s permission [19]. COWAT-F, to assess language and executive function [20]. Over 60 s, participants were required to generate as many words beginning with the letter ‘F’ as they could spontaneously, excluding proper nouns and words generated by addition of the suffixes ‘-ed’ or ‘-ing’. A single letter rather than the usual 3 letters (‘FAS’ or ‘CFL’) was used to minimize time and expense. There is a high internal consistency between the individual letters [21]. HVLT-R (PAR Inc, Lutz Fl), to assess verbal learning and memory [22]. Participants were read a list of 12 nouns at 2-s intervals from three semantic categories (four words from each) and then asked to recall them freely. The trial was then repeated twice to yield the total immediate recall score which ranged from 0 to 36. A delayed recall component was administered after a short delay of 20–25 min, during which other (non-verbal) tasks were completed, providing a measure of episodic memory (scores ranging from 0 to 12) [8]. A recognition task was then administered, but unlike the total and delayed recall measures, the six different available recognition forms were not fully equivalent. Form 6 was used throughout, and only the total recall and delayed recall were analyzed for this study [22].
Analysis
Intraclass correlation coefficients (ICCA,1) were calculated using a two-way mixed effects model with an absolute agreement type [23]. ICC values were interpreted qualitatively using published cut-offs [24]. The minimal detectable change (MDC) values were calculated for each of the measures using the following formulae [6, 25]:
SEM = the standard error of measurement;
SD = the standard deviation of the test (SD of baseline test scores used for this analysis); and
r = the reliability coefficient of the test (ICC used for this analysis).
Paired t-tests and effect size measures (Cohen’s d) were used to assess systematic bias between baseline and year 1 measures, using the following formula:
M1 = mean of test scores at year 1 visit
M0 = mean of test scores at baseline visit
spooled = pooled standard deviation of scores at baseline and year 1 visits
Bland-Altman plots were used to visualize agreement between the measures taken at the two time points. The Bland-Altman limits of agreement were obtained by first calculating within-participant differences, between their baseline and year 1 annual visit scores, and then calculating the limits as the mean difference±1.96 times standard deviation of the differences [26].
A sensitivity analysis was performed to assess the effect of depression and dementia on the results. Dementia was adjudicated by consensus of an expert committee of geriatricians, neurologists, and neuropsychologists, from both the U.S. and Australia, according to DSM IV criteria [27]. Participants who reached an adjudicated dementia endpoint within the first 3.25 years of follow-up in the ASPREE trial were excluded from the sample for the sensitivity analysis; the time frame was set to allow for flexibility in the precise timing of the year 3 follow-up visit [28]. Participants who had elevated depression scores, defined as≥10 on the Center for Epidemiological Studies Depression Scale (CES-D-10) at baseline and/or year 1, were also excluded for the sensitivity analysis, as depression can confound performance on cognitive tests [29]. Overall, 2,207 participants were excluded for the sensitivity analysis (n = 232 with dementia, n = 1,936 with depression, and n = 39 with dementia and depression). Additionally, analyses were performed for men and women separately to assess gender differences.
Analyses were performed in R version 4.0.2 (R Core Team, 2020) and Stata version 16 (Stata Corporation; College Station, TX).
RESULTS
Demographic characteristics of the 16,956 participants included in the analysis are presented in Table 1. Mean values, standard deviations, ICCs, and effect sizes are presented in Tables 2 through 6 for each cognitive test. For each test, ICCs and MDC values were calculated for all included participants and within the eight specified subgroups, determined by country, race/ethnicity, and education level. Figure 1 presents Bland-Altman plots for each test along with 95% limits of agreement.
Description of baseline demographics and comorbidities in 16,956 ASPREE participants
SD, standard deviation; n, number. *Individuals excluded in sensitivity analyses. **3MS limited to > 77 by ASPREE inclusion criteria at baseline.
Summary statistics and 1-year test-retest reliability measures for Modified Mini-Mental State Examination (3MS)
SD, standard deviation; ICC, intraclass correlation coefficient; SEM, standard error of measurement; MDC, minimal detectable change. *From a paired t-test comparing Baseline and Year 1 scores.

Bland-Altman Plots and Limits of Agreement for Cognitive Test Scores at Baseline and Year 1 visits. All cognitive scores are integer values, which would result in overplotting, hence random noise (uniformly distributed in X-Y coordinates and with a maximum of±0.4) has been added to better show frequency at given average/difference combinations. 3MS, Modified Mini-Mental State examination; SDMT, Symbol Digit Modalities Test; COWAT-F, single-letter Controlled Oral Word Association Test; HVLT-R, Hopkins Verbal Learning Test-Revised. Solid line is the mean difference. Dashed lines are the 95% limits of agreement.
3MS
A summary of the baseline and year 1 3MS test scores is presented in Table 2. The high mean and small standard deviation at baseline reflect the inclusion criteria for ASPREE, requiring participants to have a score of at least 78/100 to be eligible. For change from baseline to year 1, the overall ICC value was 0.64 with a corresponding MDC value of 7.53. Within the defined subgroups, the ICC values ranged from 0.53 to 0.63, while the MDC values ranged from 6.60 to 9.95. Paired t-tests showed a statistically significant increase between the baseline and year 1 assessments in all subgroups except White U.S./≤12 Years Education (effect size of 0.09) and African American/≤12 Years Education (effect size of –0.03). The effect sizes in the other six subgroups ranged from +0.10 to +0.40.
SDMT
A summary of the baseline and year 1 SDMT scores is presented in Table 3. The overall ICC value was 0.78 with a corresponding MDC value of 13.14. Within the subgroups, the ICC values ranged from 0.68 to 0.77, while the MDC values ranged from 12.42 to 15.61. Paired t-tests showed a statistically significant decrease between baseline and year 1 only in the White U.S./> 12 years Education subgroup, with an effect size of –0.05. The effect sizes for all other subgroups ranged from –0.03 to +0.06.
Summary statistics and 1-year test-retest reliability measures for Symbol Digit Modalities Test (SDMT)
SD, standard deviation; ICC, intraclass correlation coefficient; SEM, standard error of measurement; MDC, minimal detectable change. *From a paired t-test comparing Baseline and Year 1 scores.
COWAT-F
A summary of the baseline and year 1 COWAT-F scores is presented in Table 4. The overall ICC value was 0.63 with a corresponding MDC value of 7.72. Within the subgroups, the ICC values ranged from 0.56 to 0.64, while the MDC values ranged from 6.34 to 8.34. Paired t-tests showed a statistically significant increase between baseline and year 1 in all subgroups. The effect sizes ranged from +0.13 to +0.36.
Summary statistics and 1-year test-retest reliability measures for single letter Controlled Oral Word Association Test (COWAT-F)
SD, standard deviation; ICC, intraclass correlation coefficient; SEM, standard error of measurement; MDC, minimal detectable change. *From a paired t-test comparing Baseline and Year 1 scores.
HVLT-R, total recall
A summary of the baseline and year 1 HVLT-R total recall scores is presented in Table 5. The overall ICC value was 0.68 with a corresponding MDC value of 8.66. Within the subgroups, the ICC values ranged from 0.57 to 0.69, while the MDC values ranged from 8.13 to 10.85. Paired t-tests showed a statistically significant increase between baseline and year 1 in all subgroups except African American/≤12 years Education (effect size was –0.07). The effect sizes for the other seven subgroups ranged from +0.12 to +0.29.
Summary statistics and 1-year test-retest reliability measures for Hopkins Verbal Learning Test-Revised (HVLT-R), total recall score
SD, standard deviation; ICC, intraclass correlation coefficient; SEM, standard error of measurement; MDC, minimal detectable change. *From a paired t-test comparing Baseline and Year 1 scores.
HVLT-R, delayed recall
A summary of the baseline and year 1 HVLT-R delayed recall scores is presented in Table 6. The overall ICC value was 0.70 with a corresponding MDC value of 4.23. Within the subgroups, the ICC values ranged from 0.57 to 0.70, while the MDC values ranged from 4.00 to 5.62. Paired t-tests showed a statistically significant increase between baseline and year 1 in all subgroups. The effect sizes ranged from –0.12 to +0.33.
Summary statistics and 1-year test-retest reliability measures for Hopkins Verbal Learning Test-Revised (HVLT-R), delayed recall score
SD, standard deviation; ICC, intraclass correlation coefficient; SEM, standard error of measurement; MDC, minimal detectable change. *From a paired t-test comparing Baseline and Year 1 scores.
Results from the sensitivity analyses, which excluded participants with dementia diagnoses or depression, were similar to the primary analyses, with ICC values ranging from 0.52 to 0.66 (3MS), 0.66 to 0.77 (SDMT), 0.59 to 0.63 (COWAT-F), 0.51 to 0.69 (HVLT-R total recall), and 0.56 to 0.70 (HVLT-R delayed recall). Detailed results from the sensitivity analyses are presented in Supplementary Tables 1–5. Similar patterns were also observed when analyses were performed separately in men and women, see Supplementary Tables 6–10.
DISCUSSION
The purpose of these analyses was to develop a comprehensive set of test-retest reliability measures and minimal detectable change scores for four commonly used cognitive assessments across race/ethnicity and education subgroups, administered 1 year apart. Each of the cognitive assessments evaluated is used to study a different domain: 3MS, to assess global cognitive function; SDMT, to measure psychomotor speed; COWAT-F, to assess language and executive function; and HVLT-R, to assess verbal learning and memory (both immediate and delayed/episodic). All test-retest measures were calculated in the same population of community-dwelling adults over the age of 70 years (age 65 for U.S. minorities), using the one-year interval between baseline and the year 1 annual visit in the ASPREE clinical trial. The ICC values for the five cognitive test scores and sub-scores ranged from 0.51 to 0.77 across the different scores and subgroups, indicating moderate to good test-retest reliability for all cognitive tests assessed. The highest ICC values, indicating the highest test-retest reliability, were seen for SDMT (0.68 to 0.78) and HVLT-R delayed recall (0.57 to 0.70). Within the subgroups, White Australian and White U.S. populations had higher ICC values than U.S. African American or Hispanic/Latino on every cognitive test except the SDMT.
The ICC values were used for the calculation of MDC values, which can serve as a reference to assess true change beyond the expected variability. A difference in score between two time points that exceeds the MDC indicates real change, rather than expected variation, and should prompt further examination of cognitive decline even if the scores remain above defined threshold values. MDC values for all four tests varied within the subgroups, providing further motivation for the subgroup analyses and allowing for more specific MDC values to be used to identify incident cognitive impairment.
There are publications examining the test-retest reliability and MDC values of these four cognitive tests, although many report only a correlation coefficient. Two prior publications have assessed 3MS test-retest reliability in older adults (65 years and older) but did not use the same measures as this analysis, preventing direct comparison [5, 30]. Bassuk et al. calculated the test-retest reliability for a 3-year interval (range 0.9 to 4 years) in a sample of 228 individuals and reported a correlation of 0.78. The ASPREE cohort’s ICC value was 0.64, indicating lower test-retest reliability but using a much larger sample size and a more consistent time interval [30]. Tombaugh calculated reliable change index-difference scores (RCI-diff) in a sample of 160 participants without cognitive impairment at any time point from the Canadian Study of Health and Aging; the reported RCI-diff was 8.76–13.22 for a 5-year interval [5]. The comparable MDC value for the ASPREE cohort was 7.53, but over the shorter time interval of 1 year.
Test-retest analyses for the other cognitive tests have not been reported in equivalent populations to our analysis. For SDMT, a study in patients with multiple sclerosis had a correlation coefficient of 0.89 to 0.97, while a second study in stroke patients had an ICC of 0.89—both of these estimates are significantly higher than the reliability seen in our analyses, but both of these were conducted in much younger, health-challenged populations over intervals of only 1 or 4 weeks [1, 2]. Similarly, for COWAT, Ross et al. reported an ICC of 0.84, but conducted the analysis in healthy undergraduate students (mean age 21 years) and with an interval of approximately 45 days. Ruff et al. reported a correlation coefficient for the COWAT of 0.74 in a sample of 120 participants with an average age of 40.5 years and an interval of 6 months. A study of HVLT-R test-retest reliability examined both total and delayed recall (along with several other domains) and reported RCIs of 8.09 and 3.47, respectively. However, the average participant age was 35 years, and the total sample size was just 41 participants, limiting its comparability to the ASPREE population [3–5, 31].
Overall, there are few studies of test-retest reliability in any large older adult population, highlighting the importance of our analyses to provide a reference set of test-retest reliability measures. For the clinician and researcher, these reliability measures can be used to monitor patients or study participants for early signs of cognitive decline which could warrant further investigation. In a patient, a difference in scores between two visits that exceeds the corresponding MDC would indicate a real difference with 95% certainty. Identifying early cognitive changes in the prodromal phase of dementia provides opportunities for symptom management and treatments to slow the progression of the disease. Beyond an individual level, MDCs can be used to determine the proportion of a population that has had a significant change, which is a useful metric in clinical studies [6].
Strengths and limitations
The large size of the ASPREE trial enabled subgroup analyses, providing more meaningful MDC values across different ethno-racial backgrounds and education levels. Prior research has indicated that cognitive scores differ significantly based on these factors, but analyses of test-retest reliability and MDC normally do not account for them due to lack of sample size or diversity in the study populations. Furthermore, the time interval of 1 year used for the trial is more congruent with the time frame operative in clinical practice (usually 6 to 18 months in Australian clinical practice), compared with the very short test-retest intervals seen in many other studies. Finally, because data collection occurred as part of a clinical trial, there is extensive baseline demographic information and follow-up beyond the test-retest interval, allowing us to confirm that our results were robust to factors that can significantly impact cognitive test scores, indicated by the results when participants with diagnosed dementia or depression were excluded.
Limitations include the restriction in our analyses to individuals with complete covariate data and cognitive data from baseline and the year 1 study visit, although those participants with these missing data points (n = 1794) comprised a relatively small group (9.4%) relative to the full sample. Individuals who did not fall into one of four ethno-racial groups (n = 364) were also excluded from all analyses. Furthermore, the ASPREE participants are a comparatively healthy cohort because of the trial’s exclusion criteria requiring participants to be free of cognitive impairment and conditions that would likely lead to death within five years of study recruitment. As a result, a ceiling effect was observed as many participants achieved the maximum scores on 3MS and HVLT-R. This analysis also did not adjust for practice effects, which were likely the cause of the statistically significant increased scores between baseline and year 1 tests. Finally, because education was recorded as a categorical variable with the lowest category being < 9 years, additional analyses to explore test-retest reliability in this lowest education group were not possible. Also, as none of the participants in ASPREE were illiterate, further study is needed to examine the properties of these tests in communities where illiteracy is more prevalent.
There are also assumptions inherent to the traditional calculation of MDC scores used for these analyses that are important to note. The MDCs assume that reliability is constant for the entire range of values observed on each cognitive test, when floor and ceiling effects can reduce precision at either extreme. They also provide only a limited interpretation of a meaningful change, providing a threshold value that indicates the minimum score change that can be distinguished from expected variation with 95% certainty. However, changes below this threshold should not be definitively attributed to measurement error alone and could be caused by very early stages of cognitive decline. Despite these limitations, this is still the established method in many disciplines.
Conclusion
Within the ASPREE study, the test-retest reliability of common cognitive tests in a large community sample over a clinically relevant interval (1 year), and across country, race and ethnicity, and education subgroups found all tests to have moderate to good test-retest reliability (ICC > 0.5). Additionally, the calculation of minimal detectable change scores enables application in clinical practice, to provide indices which reveal real change in score versus changes resulting from normal ageing. For patients who may have high scores relative to age or gender-adjusted norms, these MDC scores can be especially useful to detect early signs of cognitive decline.
Footnotes
ACKNOWLEDGMENTS
Supported by grants (U01AG029824 and U19AG062682) from the National Institute on Aging and the National Cancer Institute at the National Institutes of Health, by grants (334047 and 1127060) from the National Health and Medical Research Council of Australia, and by Monash University and the Victorian Cancer Agency.
