Abstract
Background:
The CERAD Word List Memory Test (WLMT) is widely used in the assessment of older adults with suspected dementia. Although normative data of the WLMT exist in many different regions of the world, normative data based on large population-based cohorts from the Scandinavian countries are lacking.
Objective:
To develop normative data for the WLMT based on a large population-based Norwegian sample of healthy older adults aged 70 years and above, stratified by age, gender, and education.
Methods:
A total of 6,356 older adults from two population-based studies in Norway, HUNT4 70 + and HUNT4 Trondheim 70+, were administered the WLMT. Only persons with normal cognitive function were included. We excluded persons with a diagnosis of mild cognitive impairment (MCI) and dementia, and persons with a history of stroke and/or depression. This resulted in 3,951 persons aged between 70 and 90 years, of whom 56.2% were females. Regression-based normative data were developed for this sample.
Results:
Age, gender, and education were significant predictors of performance on the WLMT list-learning subtests and the delayed recall subtest, i.e., participants of younger age, female sex, and higher education level attained higher scores compared to participants of older age, male sex, and lower level of education.
Conclusion:
Regression-based normative data from the WMLT, stratified by age, gender, and education from a large population-based Norwegian sample of cognitively healthy older adults aged 70 to 90 years are presented. An online norm calculator is available to facilitate scoring of the subtests (in percentiles and z-scores).
Keywords
INTRODUCTION
Memory decline is common in old age but may also be an initial sign of dementia [1]. Differentiating between normal and abnormal memory decline can be challenging. Human memory is complex and comprises several distinct memory systems that involve different networks in the brain [2–4]. Episodic memory, i.e., the conscious recollection of a personal experience that contains information on what has happened and where and when it happened [5], is particularly vulnerable to the effects of aging. Decline in episodic memory performance among older adults is not fully understood but has been linked to volume reductions in the medial temporal lobe, including the hippocampus, and the prefrontal cortex [6, 7]. Impairment in episodic memory is also frequently seen in psychiatric and physical disorders and is a hallmark of dementia due to Alzheimer’s disease (AD) [8–10].
The consequences of severe impairment in episodic memory may be serious for a person’s basic and complex activities of daily living (ADL) functioning, as seen in AD. However, amnestic mild cognitive impairment (aMCI) should also be taken seriously by the clinician as it is a strong risk factor for subsequent progression to dementia [11–13]. Although most definitions of a diagnosis of aMCI exclude impairment in ADL, a substantial number of studies have shown that complex ADL tasks are negatively influenced by impaired memory function [14–16]. Furthermore, aMCI may also affect health-related quality of life [17, 18].
Hence, accurate assessment of episodic memory functioning becomes important and should be included in the examination of older patients when there is suspicion of cognitive impairment or dementia. Word list tests are probably the most commonly used measure of episodic memory performance, and several alternatives exist, e.g., the California Verbal Learning Test-II (CVLT-II) [19], the Auditory Verbal Learning Test (AVLT) [20], the Hopkins Verbal Learning Test-Revised (HVLT-R) [21], and the word list for the Repeatable Battery for the Assessment of Neuropsychological Status (RBANS) [22].
Another alternative is the Word List Memory Test (WLMT) of the Consortium to Establish a Registry for Alzheimer’s Disease (CERAD) neuropsychological test battery [23], which was developed especially for assessment of patients when AD is suspected. The battery is widely used in both clinical and research settings and has been translated into at least 24 different languages worldwide according to the official CERAD homepage (https://sites.duke.edu/centerforaging/cerad/). Its clinical utility for detecting AD in elderly patients is well documented, and the test battery may also be useful for identifying aMCI which may be a precursor to AD [24–27]. Among the tests included in the CERAD neuropsychological test battery, the WLMT is shown to be the most sensitive for screening for aMCI and early AD. The WLMT consists of five subtests—list-learning trials 1–3, delayed recall, and delayed recognition—and, of these, the sensitivity of the delayed recall subtest is best documented [25–28].
A prerequisite for distinguishing normal from pathological performance on a cognitive test is the availability of representative normative data; this also holds true for the CERAD WLMT. Preferably, the normative data should be collected in the country or culture in which the test will be administered, as there are substantial differences in cognitive performance throughout different regions and countries [29–31] and also among cultures within a country, e.g., the ethnic majority versus minority groups [32–35]. Furthermore, the normative data should be relatively new as studies based on single countries or regions have shown improvements in cognitive test scores from generation to generation [36, 37]. Several reasons for this positive trend, often referred to as “the Flynn effect” [38], have been proposed, including better schooling and education systems, improved hygiene and nutrition status, reduction in cardiovascular diseases, more complex and stimulating working lives, and increased working life participation among women [39]. Normative data should be stratified by age, sex, and education, as all three variables have a strong effect on different cognitive test measures, including episodic memory, at least in Western populations [40, 41].
Related to the representativeness issue is the question of sample composition, i.e., which inclusion/exclusion criteria should be applied to make the normative data most useful in clinical settings? Broadly, there are two different methodological approaches; either to include cognitively healthy older adults only and exclude individuals with conditions known to affect cognitive performance, or to include typical older adults and use less stringent/no exclusion criteria which implies that persons with cognitive impairment (and sometimes also dementia) are included in the normative sample [42–44]. As cognitive impairment is common among older people typical aging normative data will necessarily lower the reference range for cognitive health and increase the risk of under-diagnosing cognitive impairment [44]. Therefore, healthy aging normative data is by the majority of authors considered most relevant in clinical questions related to determining severity and etiology of cognitive impairment—even though there may be a risk of over-diagnosing cognitive impairment by following this approach [42–45].
We searched the literature for normative CERAD WLMT data and found a total of 17 studies published in English and one study in German from 2000 up to 2021. The majority of the studies are from North America [46–54], but normative data for the CERAD WLMT have also been developed in other regions of the world, including countries in Europe [26, 55–58], Asia [59, 60], Africa [29], the Middle East [61], and Latin America [59]. Besides involving different regions/countries, the studies differ with respect to study setting (population, community, clinic), study populations (e.g., age groups, education level, language, ethnicity), the applied inclusion/exclusion criteria, and sample sizes. Thus, direct comparisons of the normative data are difficult to make. Although the studies, with some exceptions [29, 59], exclude participants with cognitive impairment and/or dementia, relatively few studies comprise participants from unselected samples [26, 59], and the normative samples are often small. In fact, nine out of the 18 studies have fewer than 325 participants [26, 61]; only five are population-based with more than 1 500 participants [29, 59]. Of these, the normative data comprise older adults, aged 60 years and above, with the exception of Hankee et al. [47], which predominantly includes younger and middle-aged adults.
In general, the studies report a significant effect of age and education on test performance for total list learning and delayed recall of the WLMT; i.e., persons of younger age and/or higher education attain better scores than persons of older age and/or lower education on both. For sex, the results are less consistent, but the tendency is that women perform better than men, at least on total list learning [29, 56–60].
It is worth noting that there can be large differences in the reported normative scores between regions/countries, as exemplified by the studies of Luck et al. (2018) from Leipzig, Germany [56] and Gray et al. (2021) from villages in Nigeria and Tanzania [29]. For a female aged 70 to 74 with formal education, the mean delayed recall score differs by more than four words (7.7 versus 3.6) between the two studies. Albeit an extreme example, this discrepancy underlines the importance of applying normative data with caution, i.e., the use of cognitive tests, including the CERAD WLMT, should be restricted to populations comparable to where they were developed, as differences in culture, language, and the quality of education systems between countries/regions will affect the inhabitants’ performance.
Among the European studies, only one is from Scandinavia [55]. This study comprises demographically adjusted normative data for the CERAD WLMT in a Norwegian sample of 227 healthy persons aged 40 to 80 years, mean age 63.1 years (standard deviation (SD) 8.6). Although the normative data cover an important age interval, the sample is very small and does not include older adults above 80 years of age. Since individual differences in cognitive performance increase with age, normative data of older adults should preferably be based upon large population-based samples to make reliable assessments. Ideally, such data should also include persons older than 80 years. The latter point is particularly important as age is the strongest risk factor for dementia, with an estimated prevalence of 5.6% in the age group 70–74 years in Norway in 2019, increasing to 48.1% in the age group 90 + [62].
In this study, we aim to present normative data for the CERAD WLMT from a large population-based Norwegian sample of healthy older adults aged 70 years and above, stratified by age, sex, and education.
MATERIALS AND METHODS
Data were obtained from two related population-based studies, the Trøndelag Health Study, fourth wave (HUNT4 70+) and the HUNT4 Trondheim 70 + [62, 63]. Both studies included older adults aged 70 years and above. While the design and methods of the two studies are identical, the geographical recruitment areas differ. The HUNT4 70 + study covers the northern part of Trøndelag county in Norway. This region is characterized by rural areas and small towns with fewer than 25,000 inhabitants. In many ways, the population in the northern part of Trøndelag is representative of the Norwegian population in general, but the region has no larger cities and a low share of immigrants, and the number of people with higher education is below the national average [64, 65]. To compensate for these limitations, the HUNT4 Trondheim 70 + study was initiated in 2018. The HUNT4 Trondheim 70 + study covers Trondheim, the third largest city in Norway, with 207,000 inhabitants as of 2021, situated in the southern part of Trøndelag county. Data collection for the HUNT4 70 + study and the HUNT4 Trondheim 70 + study was carried out from September 2017 to March 2019 and from October 2018 to June 2019, respectively. Details of the HUNT study has been described previously [62, 65].
Participants
In the HUNT4 70 + study, all adults aged 70 years and above with a registered address in the municipalities of the northern part of Trøndelag county (n = 19,403) were invited by mail to participate (and reminded once by phone if needed). Of these, 9,930 persons (51.2%) took part in the study.
In the HUNT4 Trondheim 70 + study, invitations were sent to all adults aged 70 years and above with a registered address in three zones of the city, Lade, Strindheim, and Stranda (n = 5,087). These city zones were picked as they were judged to be relatively representative of Trondheim in general. A total of 1,745 persons (34.3%) participated in the study.
Hence, our sample initially comprised a total of 11,675 persons aged 70 years and above. All underwent a clinical evaluation with the purpose of determining the prevalence of MCI and dementia (see Assessments). However, only participants with a score≥22 on the Montreal Cognitive Assessment (MoCA), i.e., persons scoring in the borderline/normal range on the MoCA test, were offered the CERAD WLMT. The reason for offering the CERAD WLMT was to obtain further information about their memory performance to facilitate the diagnostic evaluation. In a recent publication [66], we have shown that the normative MoCA-score varies between 22 and 27 points in the same sample depending on age, sex, and education level, supporting the relevance of the chosen cut-off for this purpose.
A total of 7,229 participants had a MoCA score≥22 and were eligible for inclusion, but only 6,356 of those were administered the CERAD WLMT. Reasons for not administering the test were lack of capacity at the test station, participant’s refusal to take the test, or participant judged unsuited by the test personnel, e.g., due to hearing loss or lack of motivation. As we wanted to include only persons with normal cognitive functioning in the normative data, we excluded persons diagnosed with mild cognitive impairment (MCI) (n = 1,535) or dementia (n = 40) according to DSM-5 criteria (minor or major neurocognitive disorders) [67] or cognitive impairment without an etiological diagnosis (n = 5). After excluding these, we also excluded persons with a total score≥8 on the depression subscale of the Hospital Anxiety and Depression Scale (HADS) [68] (n = 299) or a history of stroke (n = 303). These groups had significantly lower scores on both total list learning and delayed recall of the CERAD WLMT compared to the included sample, except for persons with a history of stroke, for whom the score was significantly lower on total list learning only. Finally, as we wanted to include only persons with valid results on all three list-learning trials and the delayed recall part of the CERAD WLMT in the normative data, we excluded persons with incomplete or invalid test scores on any of the CERAD WLMT subtests (n = 192). This left us with a total of 4,007 persons who were included in the statistical analysis.
Assessments
Assessments were done at the test station or at the participant’s home and were carried out by health personnel and nursing students, who had all undergone a two-day training program on administering the tests and questionnaires, including the CERAD WLMT.
The CERAD WLMT
The Norwegian version of the CERAD WLMT developed by Liv Barnett in 2004 was applied [23]. The CERAD WLMT assesses learning and memory for new verbal information and consists of a list of ten words (nouns). According to the instructions, the participants were asked to read each word out loud as it was presented in writing. The list was presented three times in total (trial 1, trial 2, and trial 3), with the words in a different order each time. After each trial, the participants were asked to recall as many words as possible, irrespective of the order. After a delay interval of 10 min, the participants were again asked to recall the words from the list. Scoring was done for each list-learning trial (trial 1, trial 2, trial 3) and for the delayed recall trial, based on the number of words correctly recalled (maximum score 10 for each trial). A total list-learning score was calculated based on the total number of words correctly recalled from trials 1 to 3 (maximum score 30). In addition, we also calculated the word list savings score ((delayed recall score/list-learning trial 3 score) x 100), i.e., delayed recall score as a percent of learning trial 3 score, which is a measure of the relative amount of information remembered over the delay interval. The recognition trial, which normally forms part of the CERAD WLMT, was not administered as part of our study due to time/capacity constraints.
Other assessments
The process and assessment tools/questionnaires used for diagnosing participants with MCI and dementia, which led to exclusion from our normative study have been described in detail elsewhere [62]. In summary, all participants underwent a clinical evaluation that included a structured interview for assessment of subjective cognitive decline and cognitive testing with the MoCA [69]. MoCA is a cognitive screening test that covers different cognitive domains, including short-term memory, visuospatial abilities, executive functions, attention, concentration and working memory, language, and orientation to time and place. It has a maximum score of 30. MoCA has been shown to be sensitive for detecting MCI and early dementia, but adjustments for age, sex, education, and cross-cultural considerations are necessary [66, 70]. As mentioned earlier, participants with a MoCA score≥22 were also administered the CERAD WLMT to obtain information about possible memory deficits not detected by MoCA.
In cases where the participant reported substantial subjective cognitive decline and/or obtained a delayed recall score on the WLMT in the lower range (70–79 years:<4 words; 80–89 years:<3 words; and 90+:<2 words), a structured interview with a next of kin was conducted by telephone. Information about cognitive changes (symptoms, debut, course), ADL, and neuropsychiatric symptoms were collected. Instrumental and personal ADL functioning were assessed by the Instrumental Activities in Daily Living (IADL) and the Physical Self-Maintenance Scale (PSMS) [71]. The presence of neuropsychiatric symptoms was evaluated with the Neuropsychiatric Inventory Questionnaire (NPI-Q) [72].
Symptoms of depression were assessed with the Hospital Anxiety and Depression Scale (HADS) [68] based on self-report. The HADS consists of 14 items, each scored from 0–3 points, seven items related to anxiety (maximum score 21) and seven items related to depression (maximum score 21). The validity of HADS is well documented, and we used the recommended cut-off score of 8 points on the depression subscale (HADS-D) to categorize participants with and without significant symptoms of depression [73, 74]. According to the systematic review by Wu et al. (2021) [74], the sensitivity and specificity to screen for major depression is 74% and 84% respectively for our applied cut-off of 8 points on the HAD-D.
A history of stroke was based on each participant’s self-report. We judge that as stroke is such a dramatic life event, most people will recall it without any problems. A risk, though, may be false positives [75].
The diagnostic process of MCI and dementia
A diagnosis of MCI (minor neurocognitive disorder) or dementia (major neurocognitive disorder) according to DSM-5 criteria [67] was made independently by two clinicians belonging to a diagnostic work-up group of nine clinicians (geriatricians, old age psychiatrists, neurologists). The diagnoses were based on all the information collected from the interviews and cognitive testing of the participants, and the structured interviews with next of kin (that included questions of cognitive changes, ADL functioning and neuropsychiatric symptoms)—see above. All clinicians were experienced with comprehensive scientific and clinical expertise in the dementia field. All collected data were available for both clinicians, and in cases where there was disagreement about the diagnosis, a third clinician was consulted.
According to DSM-5 a diagnosis of neurocognitive disorder requires evidence of a modest (MCI) or significant (dementia) cognitive decline in one or more of the following cognitive domains: complex attention, executive function, learning and memory, language, perceptual–motor, or social cognition. In our study, evidence of decline in learning and memory was based on the attained results of the CERAD WLMT and relevant subtests of MoCA (orientation, memory, and delayed recall) while evidence of decline in any of the other cognitive domains was based on the results of the other subtests of MoCA (visuospatial/executive, naming, attention, language, abstraction), complemented by the results from interviews with participant and next of kin.
We did not apply specific cut-off scores on the cognitive tests to determine level of cognitive decline. The reason being that recommended cut-offs vary considerably between studies, and will depend on several factors, including age, educational level, and sex. Instead, z-scores for our sample were generated and normalized by using means and SDs from other recent normative data for MoCA [76, 77] and the CERAD WLMT delayed recall [56, 57]. These z-scores were applied by the clinicians to guide in the diagnostic evaluations. As for the DSM-5 criterion of functioning in everyday activities, interference with independence was defined as having problems with at least one of the activities described in the IADL or PSMS assessment scales which were administered to next of kin. In addition, the clinicians had available data from the interviews with the participants regarding subjective cognitive decline and perceived effect on activities in daily living.
To optimize the reliability of the diagnostic evaluations, the clinicians participated in a seminar in advance where the rules and criteria for diagnosis were reviewed, and diagnostic training was undertaken. As part of the training all clinicians independently evaluated 50 cases and classified them into four categories according to the DSM-5 criteria: no cognitive impairment, MCI (minor neurocognitive disorder), dementia (major neurocognitive disorder), or “could not be classified”. Overall, the reliability of agreement between the clinicians was substantial (Fleiss’ kappa 0.70, 95% confidence interval (CI) 0.66–0.74), and almost perfect for dementia (Fleiss’ kappa 0.90, 95% CI 0.79–0.90). The cases were subsequently discussed by the clinicians in a plenary session to harmonize the classification further. For more details of the diagnostic process, see Gjøra et al. (2021) [62].
Statistics
Patient characteristics and outcome variables were presented as frequencies and percentages, or means and SDs, as appropriate. To assess associations between CERAD test scores and age (categorized as 70.0–74.9, 75.0–79.9, 80.0–84.9, and 85.0+), sex, education (categorized as compulsory (≤10 years), secondary (11–13 years) and tertiary (≥14 years)), and marital status, linear regression models were estimated. To build the model for assessing normative data for CERAD test scores, the same model with age as a continuous variable was estimated. A backward elimination approach was applied. For each score, a model including all possible higher-order interactions (maximal model) was estimated first. Potential non-linearity in continuous age was assessed through higher-order components. Next, all possible models were estimated by eliminating interactions one at a time and applying Bayes Information Criterion (BIC), where a smaller value means a better model, at each step. Thorough residual diagnostic testing was performed. The normality was assessed by inspecting histograms and Q–Q plots. Heteroscedasticity was assessed by inspecting boxplots and by Levene’s test. In most models, the residuals were symmetrically distributed. Only residuals from models for CERAD learning trial 3 and % savings showed slight skewness. Minor heteroscedasticity issues in some of the models were identified. However, these were caused only by a few larger values contributing to greater variance. Several transformations of scores were considered; however, these did not improve the model. The model with robust standard errors did not alter the results. Hence, original scale was kept. No non-linearities with respect to age were identified.
Only patients with no missing values on relevant covariates were included in the regression analyses, resulting in a slightly smaller sample size. The results with p-values below 0.05 were considered statistically significant. The statistical analyses were performed in STATA v17.
Ethics
The study uses data from the HUNT4 70 + and the HUNT4 Trondheim 70 + studies. Both were approved by the Regional Committee for Medical and Health Research Ethics in Norway (REK South East D 82985), the Norwegian Center for Research Data (NSD 791342) and the Norwegian Data Inspectorate. Oral and written consent was obtained from all participants. In cases where the participant was judged to have reduced capacity to consent, informed consent was obtained from the closest next of kin.
RESULTS
Characteristics of the sample
Demographic characteristics and cognitive scores of the sample are presented in Table 1. The mean age of the participants was 75.8 years (SD 4.7) with a range of 70.0 to 96.7 years, 56.2% were females, and 66.1% were married. The mean MoCA score of the participants was 25.8 (SD 2.0) with a range of 22 to 30. For comparison, the mean MoCA score of our initial sample before exclusion was 22.4 (SD 5.0) with a range of 0 to 30 (n = 11,675 of whom 778 had missing MoCA score).
Demographic characteristics and cognitive scores of the sample, N = 4,007
*Total N = 3,976; **Total N = 3,994; Compulsory, ≤10 years; Secondary, 11–13 years; Tertiary, ≥14 years.
There were no statistically significant differences in demographic characteristics between the HUNT4 70 + sample and the HUNT4 Trondheim 70 + sample, except for a higher number of participants in the HUNT4 Trondheim 70 + sample with tertiary education (≥14 years). Participants in the HUNT4 Trondheim 70 + sample also attained a significantly higher score on all CERAD WLMT subtests (p < .001), except for CERAD WLMT savings %. No significant difference in total MoCA score between the two samples was found.
Results of linear regression models
The results of the linear regression model assessing the effects of gender, education level, and (categorized) age on performance of the CERAD subtests are presented in Table 2. All three demographic variables were significantly associated with performance on all the WLMT list-learning subtests, i.e., list-learning trials 1, 2, 3, and total list-learning (trials 1-3) and the WLMT delayed memory subtest in both the bivariate and the multiple models. The results showed that female sex, youngest age (age group 70.0–74.9 years) as compared to older, and highest education level (tertiary) as compared to secondary education were associated with higher scores on all the WLMT list-learning subtests. The marital status was not associated with performance of the WLMT list-learning (and memory) subtests and could be excluded from the models according to BIC.
Results of linear regression models, N = 3,994
cont. Results of linear regression models, N = 3,994
For the WLMT memory variables, i.e., WLMT delayed memory subtest and % savings, the results of the linear regression analysis were much the same as for the WLMT learning subtests. Again, female sex and youngest age as compared to older were associated with better scores on both variables. Education level was also significantly associated with performance on the WLMT delayed recall subtest, but for the % savings variable, a significant difference was found only between participants with higher (tertiary,≥14 years) and participants with medium (secondary, 11–13 years) education levels.
For the purpose of developing normative data, i.e., z-scores, the same models were estimated with age entered as a continuous variable. Participants aged 90 years+were excluded from the models due to relatively small sample size (n = 56). The z-score can be calculated as the difference between the observed subtest score and the subtest score estimated from the regression models in Table 3, divided by the model’s root mean square error (RMSE).
Results of linear regression models, N = 3,951 (continuous age≤90)
3a cont. Results of linear regression models, N = 3,951 (continuous age≤90)
1Regression coefficient and standard error presented due to interaction.
Table 4 presents age-, sex-, and education-specific normative scores of the WLMT subtests derived from the regression models in Table 3, rounded to the nearest digit. The scores are presented as means (0 SD) and deviations of –1 SD, –1.5 SD, and –2 SD. As a rule of thumb, a score between –1 SD and+1 SD is usually considered within the normal/average range, a score between –1 SD and –2 SD indicates below average range, and a score below –2 SD implies significantly below average range. However, precaution in the interpretation of test scores is always required. Evaluation of a patient’s test score should be done on an individual basis, taking into consideration the patient’s estimated premorbid cognitive functioning level, among other factors.
* Age-, sex-, and, education-specific normative scores of the
C, Compulsory; S, Secondary; T, Tertiary.
(Continued) Age-, sex-, and education-specific normative scores of the
C, Compulsory; S, Secondary; T, Tertiary.
(Continued) Age-, sex-, and education-specific normative scores of the
C, Compulsory; S, Secondary; T, Tertiary.
(Continued) Age-, sex-, and education-specific normative scores of the
C=Compulsory; S = Secondary; T = Tertiary.
(Continued) Age-, sex-, and education-specific normative scores of the
C, Compulsory; S, Secondary; T, Tertiary.
(Continued) Age-, sex-, and education-specific normative scores of the
C, Compulsory; S, Secondary; T, Tertiary. *Comment on grey markings in Table 4: Marked values indicate slightly underestimated limits. If it would make sense to present the results with decimals, these values would lie somewhere between the value above and the value below the marked area. As the test result is an integer, the limits are presented as such, but they should be understood and interpreted as a kind of grey zone.
DISCUSSION
In this paper, we present population-based normative CERAD WLMT data from a large sample of cognitively healthy Norwegian older adults aged 70 years and above. We trust that these normative data will fulfill a need among Norwegian clinicians involved in assessments of older adults they suspect may have cognitive impairment or dementia. The normative data will also likely be relevant for Swedish and Danish colleagues as education and social welfare systems are quite similar in the Scandinavian countries. Although the WLMT is widely used, normative data from the Scandinavian region have been available only for ages up to 80 years based on a small sample [55]; for patients above 80 years of age, clinicians have had to rely on normative data from other countries/regions.
As expected, age, education level, and sex had a significant effect on performance on the learning and recall subtests of the WLMT in our data. Younger age, female sex, and higher education level were associated with higher scores on all the WLMT subtests. These effects are relatively profound, as illustrated with an example: A male patient, aged 90 years, compulsory education (≤10 years), with a raw score of 18 words on the total learning (trials 1–3) subtest and 5 words on the delayed recall subtest obtains results in the above-average/superior range in our normative data, i.e., 90th percentile (z-score 1.26) on total learning and 72nd percentile (z-score 0.58) on delayed recall. Correspondingly, a female patient, aged 70 years, with tertiary education (≥14 years) obtains results in the below average range with the same raw scores, i.e., 13th percentile (z-score –1.12) on total learning and 8th percentile (z-score –1.40) on delayed recall, indicative of mild amnestic cognitive impairment.
The robust effects of age and education are in correspondence with previous normative studies of the WLMT [26, 54–61]. We suggest that the age-related effects seen in our normative data primarily reflect normal aging processes, as all participants with a diagnosis of MCI and dementia, and participants with symptoms of depression and/or a history of stroke, were excluded from the analyses. Although there are inter-individual differences, decline in episodic memory performance in normal aging is well-documented, probably debuts on average at around age 60, and typically comprises both encoding (learning) and retrieval (recall) processes [6, 78]. The decline is associated with several structural and functional changes in the brain, including volume reductions in the gray and white matter, neuronal shrinkage, reductions in synaptic contacts, decreases in the concentrations of neurotransmitter substances, and neuronal loss. Such changes are typically most prominent in the frontal and the medial temporal cortex, including the hippocampus, and in the putamen, thalamus, and accumbens [78, 79].
Education level is established as a significant factor predicting cognitive performance, including episodic memory functioning, throughout the entire adult lifespan, from younger adulthood to older age, independent of gender, race, society, and birth cohort [80]. Higher education level not only predicts better episodic memory functioning but may also delay age-related cognitive change and the onset of neurodegenerative illnesses like AD, possibly as a consequence of a larger brain and cognitive reserve, which may act as a buffer against cognitive decline [81–83]. In fact, a meta-analysis found a dose-dependent effect of education; i.e., every additional year of education was associated with a 7% reduced risk of all-cause dementia [84].
The significant effect of sex in our study also accords well with other normative studies of the WLMT, at least in Western populations [55–57]. Interestingly, the female advantage is reported to be weaker or even absent in low- and middle-income countries, e.g., Nigeria and Tanzania [29], Cuba, Dominican Republic, Venezuela, Peru, Mexico, China, and India [59]. What could be the reasons for these differences? In general, a female advantage in verbal memory tasks is well documented. In a meta-analysis by Asperholm et al. (2019) which included 612 studies from 54 countries across 40 years (n = 587,691), females outperformed males in 42 out of 45 countries [85]. Also, the magnitude of the female advantage was positively associated with time- and country-specific social progress indicators. In their bivariate analysis, gender equality, population education and employment, and gross domestic product (GDP) per capita were all significant predictors of the female advantage, but in the multiple model, only population education and employment remained significant. The authors suggest that women’s verbal episodic memory performance may benefit more than men’s from education and labor market participation. Norway is among the most gender-equal countries in the world, and education and labor market participation is highly independent of gender [86]. Organized education also has a long history in Norway. In 1889, a seven-year primary school education became compulsory for all children. From the 1960 s, more women started earning university degrees and taking part in occupational life [87]. In line with the arguments of Asperholm et al., these are factors that may have fostered the female advantage in verbal episodic memory functioning seen in our data.
How well do our normative data fit with other normative data from comparable samples in Western countries/cultures? As mentioned, direct comparisons are difficult to make due to differences in study settings, study populations, inclusion/exclusion criteria, and sample sizes. These are all factors that may explain the noticeable variations in the normative data observed among the studies [26, 55–57]. For our purposes, the most relevant are probably the two population-based studies by Luck et al. (2018) and Luck et al. (2009) from Germany [56, 57], which both present normative data of a large number of home-dwelling older adults free of dementia (aged 60–79 years and 75–98 years, respectively), stratified by age, sex, and education—the norms are used in many Norwegian memory clinics today. We notice that the normative data of Luck et al. (2018) are “stricter” than ours in the sense that they report higher mean values (but a tendency to somewhat larger SDs) for the majority of the WLMT variables in relevant age groups, independent of sex and education level. As an illustrating example, mean (SD) of total list-learning (trials 1-3) and delayed recall for males with no university degree in the age group 75–79 is 19.1 words (3.5 SD) and 6.7 words (1.8 SD) in Luck et al. (2018), compared to 17.1 words (3.1 SD) and 5.2 words (1.4 SD) in our material. Interestingly, our data seems to be more in line with Luck et al.’s earlier study (2009). Still, the tendency is that Luck et al. (2009) report higher means (and larger SDs) on most variables in both age groups 75–79 and 80 + years. We see no obvious reason for the observed differences between Luck et al. (2018) and our study but suggest that these possibly could be explained by several factors. Differences in the participation rates could be one. In Luck et al. 33% of the invited persons agreed to take part in the study, while the corresponding figure in our study was 51.1% in the HUNT4 70 + sample (9930/19403) and 34.3% in the HUNT4 Trondheim 70 + sample (1745/5087). This could imply that the Luck-sample recruited a cognitively healthier group of older adults initially. Also, the participants in their study were recruited from one large German city, Leipzig, only, while our study comprises in the main participants recruited from rural areas and small towns (<25,000 inhabitants) of the middle of Norway (northern part of Trøndelag county). Finally, our applied strategies for inclusion/exclusion of participants were different. Particularly, we excluded participants with MCI (minor neurocognitive disorder) according to DSM-5, while Luck et al. did not apply MCI specifically as an exclusion criterion, but excluded participants with serious medical, neurological, or psychiatric disorders/conditions that could affect cognitive performance. Initially, one would believe that our strategy should lead to higher average scores, but without a closer comparison of the data it is hard to know exactly how these differences in selection strategy may turn out. Anyhow, the differences underline the importance of applying normative data representative of the population in which the CERAD WLMT is used.
As regards the already existing Norwegian normative data from Kirsebom et al. (2019) [55], these are fairly comparable to ours for the age interval 70 to 80 years, both for total list-learning and delayed recall—but there is a trend that the Kirsebom et al. norms are slightly “stricter”. A reason could be that Kirsebom et al. partly recruited participants through advertisements in media, which may have attracted a higher share of cognitive “super-performers” to their study. We also note that our normative data seem to differentiate better than that of Kirsebom et al.’s at the lower and upper ends of the scales.
Composition of the normative sample is a crucial point in the development of normative data. In this respect, the question whether persons with MCI should be included or excluded from the normative data, is a tricky one. We chose to exclude persons with a diagnosis of MCI (and dementia, symptoms of depression, and a history of stroke). Other studies have chosen different strategies. For example, in the German study by Luck et al. (2018) [56], the researchers chose only to exclude persons with a selected number of serious medical, neurological, and psychiatric disorders/conditions known to affect cognitive performance (including dementia) but did not exclude persons with milder forms of cognitive impairment who may have fulfilled the criteria of MCI (minor neurocognitive disorder), as they wanted the normative data to reflect the “true” level of cognitive functioning in the general population of dementia-free older adults. In the study by Gray et al. (2021) [29] from sub-Saharan Africa, on the other hand, the researchers chose to include both persons with MCI and dementia in the normative data and argue that exclusion of these conditions would have made the normative data less representative. There are two reasons why we consider exclusion of MCI to be the best alternative. Firstly, as MCI is a common condition among older adults, inclusion of these persons in the normative data will unavoidably lead to lower mean scores and larger standard deviations, i.e., increasing the likelihood that a person’s test score will be interpreted as being within the normal range, when, in fact, the person is cognitively impaired. This argument has been emphasized by several authors previously (e.g., Stricker et al. (2019) [88], Thomann et al. (2017) [45], Martin et al. (2017) [42], Green (2000) [44]). Secondly, MCI is a condition which is important to diagnose. Although some persons with MCI may revert to a normal condition, the large majority will remain stable (and possibly affect their complex ADL functioning and health related quality of life), or subsequently convert to dementia. In a systematic review [89], the authors reported an overall reversion rate of 18% from MCI to normal condition. Hence, an objection to our choice of strategy could be that persons with MCI who may later revert to a normal condition were excluded from the normative data. Still, we would argue that the number of such potential “reverters” most likely are relatively small, and that most of our participants with MCI cannot be considered cognitively normal.
With regards the proportion of participants with MCI in our sample, 24%, it is quite high compared to several other studies, but not all [90, 91]. In our case, it is important to take into consideration the age interval of the sample, 70–90 years. It is documented that the prevalence of MCI increases with older age, from around 10% in the age group 70–74 years to around 37% in the age group 85 + years [91]. Another reason for the high prevalence of MCI is probably the fact that we applied a threshold value of –1.0 SD for MCI to guide the diagnostic evaluations, which is the suggested by DSM-5 [92]. In other studies, threshold values of –1.5 or –1.0 SD have been used. Clearly, choice of threshold value will affect the prevalence, e.g., in the Swedish normative data for the MoCA test [76], 51% and 39% had z-scores≤–1.0 and≤–1.5, respectively. Still, we do not believe that our choice of threshold value has led to too strict reference values. Our view is supported by the fact that the normative data of both Luck et al. (2018) [56], Luck et al. (2009) [57], and Kirsebom et al. (2019) [55] are somewhat “stricter” than ours—as discussed above.
A strength of our study is the large number of older adults included in the development of the normative data and that they were recruited from a non-selected community sample. The exclusion criteria that were applied should ensure that our sample predominantly comprises older adults with normal cognitive functioning who were free of MCI or dementia. However, our study also has some limitations. First, we do not know how representative our sample is for the Norwegian or Scandinavian population of cognitively healthy older adults as a whole—the sample was drawn from the northern part of Trøndelag county and three city zones of Trondheim city only, and the participation rate was moderate (51.2% and 34.3%, respectively. Second, the CERAD WLMT was included among the tests/examinations for diagnosing MCI and dementia, which may have affected the clinicians’ diagnostic evaluations—preferably, another memory test should have been used for this purpose. Third, the cognitive tests that we administered did not fully cover all the cognitive domains that may be affected in neurocognitive disorders according to the DSM-5 criteria. Particularly, this objection applies to the social cognition domain. Fourth, although the diagnoses of MCI and dementia were done by experienced clinicians based on a comprehensive clinical evaluation and were made according to the DSM-5 criteria for minor and major neurocognitive disorder, data of the diagnostic accuracy (sensitivity/specificity) of our procedure is lacking. Also, we did not have access to brain imaging and fluid biomarkers, which most likely would have increased the diagnostic accuracy. In addition, we did not have access to examination of blood to rule out serious physical disorders that could cause MCI. However, the incidence of such disorders in home dwelling people would have been limited and thus not biased the results. Fifth, it would have been desirable to have in-depth information about the health status of the participants, and how they compare to individuals with a similar demographic background—particularly regarding serious medical, neurological, and psychiatric conditions that are known to affect cognitive performance. Unfortunately, we did not have access to such data. However, by applying our strategy of excluding persons with MCI, a history of stroke, symptoms of depression, and dementia, we believe that the majority of participants with conditions affecting cognitive performance have been excluded from our sample. Sixth, the recognition subtest of the CERAD WLMT was not administered in our study, and normative data for this subtest are therefore unfortunately lacking. However, we would argue that this is not a critical point, as a marked ceiling effect of the recognition subtest is well documented among cognitively healthy older adults [55–57]. For assessments of persons with suspected dementia, the recall subtest of the CERAD WLMT is the most sensitive, while the recognition subtest may be less affected in the earliest phases of AD [93]. Seventh, our normative data comprise persons in the age range 70–90 years only. Considering the number of younger people who are examined by GPs or memory clinics for memory complaints and/or suspected dementia, we suggest that population-based normative data for persons younger than 70 years, representative for the Scandinavia region, also should be developed.
To facilitate the scoring of the WLMT subtests for clinicians, we have developed a scoring calculator that is available in both English and Norwegian at http://www.aldringoghelse.no/ceradwordlist. By inserting the raw scores of the WLMT subtests and entering the age, gender, and education level of the patient, the calculator will produce z-scores and percentiles from the normative means.
Footnotes
ACKNOWLEDGMENTS
Data for this study were obtained from the Trøndelag Health Study, fourth wave (HUNT4 70+) and the HUNT4 Trondheim 70 + . The Trøndelag Health Study (HUNT) is a collaboration between HUNT Research Center (Faculty of Medicine and Health Sciences, Norwegian University of Science and Technology NTNU), Trøndelag County Council, Central Norway Regional Health Authority, and the Norwegian Institute of Public Health.
The study was commissioned by the Norwegian Directorate of Health and financed by the Norwegian Health Association. There were no restrictions in regard to the research conduct.
