Abstract
Background:
Language impairment in Alzheimer’s disease (AD) has been widely studied but due to limited data availability, relatively few studies have focused on the longitudinal change in language in the individuals who later develop AD. Significant differences in speech have previously been found by comparing the press conference transcripts of President Bush and President Reagan, who was later diagnosed with AD.
Objective:
In the current study, we explored whether the patterns previously established in the single AD-healthy control (HC) participant pair apply to a larger group of individuals who later receive AD diagnosis.
Methods:
We replicated previous methods on two larger corpora of longitudinal spontaneous speech samples of public figures, consisting of 10 and 9 AD-HC participant pairs. As we failed to find generalizable patterns of language change using previous methodology, we proposed alternative methods for data analysis, investigating the benefits of using different language features and their change with age, and compiling the single features into aggregate scores.
Results:
The single features that showed the strongest results were moving average type:token ratio (MATTR) and pronoun-related features. The aggregate scores performed better than the single features, with lexical diversity capturing a similar change in two-thirds of the participants.
Conclusion:
Capturing universal patterns of language change prior to AD can be challenging, but the decline in lexical diversity and changes in MATTR and pronoun-related features act as promising measures that reflect the cognitive changes in many participants.
Keywords
INTRODUCTION
Changes in language and speech have been established as one of the earliest manifestations of cognitive decline in Alzheimer’s disease (AD) [1–3], with AD sufferers experiencing word finding difficulties [4–6], impoverished lexical diversity [7–9], and changes in syntax [10, 11], semantics and discourse [5, 12–14], and fluency and acoustics [15–18].
As the language-based classification tasks between the patients already diagnosed with AD and the healthy population have achieved high accuracy [19], current research is primarily focused on understanding the earliest changes in language that may signal AD and could contribute to developing a non-invasive tool that could aid disease prediction and early detection [20].
Several studies suggest that language features could act as biomarkers for the early detection of the disease [10, 21–28]. Detecting dementia early would allow for timely interventions that could slow disease progress (for example, a recent study on lecanemab shows some slowing of cognitive decline [29]) and lessen the financial and emotional burden of the patients and their families [1, 31]. Considering the aging population, research into early markers is becoming increasingly relevant [2, 31].
Although changes in language have been seen as promising indicators for early detection, and a reflection of the regression of cognitive abilities, fewer studies have focused on the earliest changes in language compared to the later stages [1, 32]. Similarly, longitudinal changes in spontaneous speech have received relatively little attention [31, 32]. The available longitudinal datasets (such as ADReSSo Challenge corpus [32] and Pitt corpus [32]) mostly span a short time period and are based on picture description or semantic verbal fluency. While these datasets are extremely valuable and allow for controlling the content of the speech data by using structured speech tasks, collecting spontaneous speech data that spans over a longer period of time and that is not based on visual cues also has its advantages. For example, it has been proposed that the earliest manifestations of cognitive decline can appear years or even decades before the diagnosis [34–36]; having a longitudinal corpus that consists of speech data from several decades can contribute to detecting these earliest pre-diagnostic changes. Using spontaneous conversational speech also allows data collection in a more naturalistic setting, reflecting the everyday challenges that the AD sufferers face in communicative situations [37], while causing less stress for the participants which is often experienced during cognitive testing [31, 38]. Although collecting spontaneous speech data in an everyday situation is becoming more feasible as speech can easily be gathered using mobile devices, and the process requires minimal instructions and equipment [32, 38–40], there is a lack of longitudinal datasets spanning over a longer time period as these are time consuming and expensive to collect [3, 39].
Previous studies have tackled the issue of limited availability of longitudinal language data by comparing the writings of authors, some of whom eventually develop AD [41], or the transcripts of press conferences [2, 42]. Comparing the writings of Iris Murdoch, Agatha Christie, and Phyllis Dorothy James, Le and colleagues [41] found that Murdoch showed signs of impoverished vocabulary and syntax in her novels associated with her later dementia diagnosis and proposed that based on the change in Christie’s writings, she too was likely to suffer from dementia. Berisha and colleagues [42] showed that President Reagan, who was later diagnosed with AD, used less unique words and more low imageability verbs than President Bush, and that the number of unique words, fillers and nonspecific words changed significantly over time in President Reagan’s press conferences, but not in Bush’s, suggesting that the early signs of dementia were evident in Reagan’s speech prior to the diagnosis. Fang and colleagues [2] found differences in the length of sentences, unique, non-specific, and special words, and the ratio of depth to width of the sentences’ parsing tree when comparing the news conference transcripts of Reagan and Bush.
While these studies provide an insight into the individual cases of language changes in dementia, they still represent small, specialized samples, and the need to identify the changes that are generalizable and representative of many cases remains [37, 43].
To tackle this issue, we introduce two independent corpora that consist of the transcripts of interviews with public figures, half of whom eventually developed AD. The interviews span over several decades in both corpora and include 10 and 9 AD-healthy participant pairs. While we acknowledge that there have been recent developments in detecting AD from acoustic speech signals, such as using the extraction of higher order spectra features [44], non-linear features [45], and emotional factors [31], there are several reasons why we have focused only on the linguistic features in the current study. First, our dataset is based on Youtube videos with extensively varying audio quality, making the extraction of the acoustic features challenging. Second, sophisticated acoustic methods can be hard to interpret or relate back to clinically observable phenomena, which make clinicians hesitant to use them. Third, this work is aiming to replicate the methods from [41, 42] who have used written text and focused on linguistic features in their analysis.
The dataset based on the first corpus includes 34 linguistic features that have been identified as the most informative by previous research, and the dataset based on the second corpus consists of 300 extracted linguistic features from a wide range of language areas. We aim to understand the extent to which the longitudinal changes in language that manifest prior to AD diagnosis are generalizable in a group of individuals with AD, and which linguistic features, if any, show consistent patterns across the individuals who later in life receive AD diagnosis compared to the matched healthy controls. We first replicate the methods from Berisha and colleagues [42], and then explore several different approaches to improve the generalizability of the patterns of language change: using an alternative set of language features, age correlations, and compiling single features into aggregate scores.
MATERIALS AND METHODS
We used two separate corpora of interview transcripts, featuring public figures who eventually develop AD, and their paired controls. We first replicated the methods from Berisha and colleagues [42], and second, proposed alternative approaches for analyzing longitudinal speech data. The two corpora were collected independently, first by one of the authors of the paper (UP), and second by Winterlight Labs, the organization of another author of this paper (JR). As the corpora were collected separately by different researchers at different times, the methods used to gather, process, and analyze the data vary. We carried out the same experiments on both corpora to maximize the validity of the results. The methods used to construct the two corpora as well as the design of the current study are explained in the next sections.
Participants
We used two separate corpora consisting of speech samples from famous people who eventually developed AD, and paired controls. The recordings were completed over several decades (on average, samples from 4 different decades of the participant’s life were included). All data were publicly available on YouTube. The two corpora had 5 overlapping AD participants, resulting in the use of some of the same speech samples. See an overview of participants’ demographic information in Supplementary Table 1.
Corpus 1
Corpus 1 included 10 individuals with a later AD diagnosis, 5 male and 5 female, and their paired controls. The participant selection process followed two steps: first, public figures who have died of AD were identified using internet search and Wikipedia pages such as “Deaths from Alzheimer’s Disease”; second, the availability of public speech data in English was considered for each participant, and those with more available data were selected. The control participant for each individual was matched based on demographic data to account for a similar background and reduce the number of factors affecting language use. We matched for the year of birth (within a 5-year range), place of residence or growing up, and occupation where possible. The collection of the data in Corpus 1 was approved by the University of Cambridge Humanities and Social Sciences Research Ethics Committee.
Corpus 2
Corpus 2 included 9 individuals and their matched controls (paired based on the same criteria as in Corpus 1). The individuals were identified through internet search, and clips of them speaking (e.g., interviews, public appearances, press conferences) that were found on YouTube were used as a data source. Two participant pairs were female and seven were male. Corpus 2 was collected by Winteright Labs.
Materials
Corpus 1
All available interviews from the included participants were manually transcribed. The final dataset consisted of the transcripts of 135 public YouTube interviews that were recorded over a period starting from 37 years before the diagnosis and up to 2 years after the diagnosis, and in the matched age-range of the control participants.
Corpus 2
Corpus 2 consisted of 405 manually transcribed interviews and monologues that were recorded over a period starting from 46 years before the diagnosis and up to 13 years after the diagnosis.
See Supplementary Fig. 1 for an overview of the distribution of available recordings over time in the two corpora.
Design
Corpus 1
A total of 34 linguistic features that had been identified as the most informative by previous literature were extracted from the transcripts. The motivation behind the feature selection was to investigate whether the features that have shown the clearest difference between the AD and the control group in prior research also produce measurable and generalizable patterns of language change in longitudinal speech data.
As some feature values were dependent on the transcript length (significant Pearson correlation between the feature value and the number of tokens), we adopted a method of capping the transcripts at an even number of words for extracting only the text-length-sensitive features, a method also used by Le and colleagues [41]. During the capping process, we established a separate word limit for each participant pair that would allow keeping the longer samples while also maintaining at least 3 samples per participant to allow investigating longitudinal change. This resulted in each participant pair having a different length of the capped transcripts that were used for calculating the text-length-sensitive features only but allowed within-pair comparison as both the HC and the AD participant samples were the same length.
See Table 1 for details of the extracted features, including the studies that have previously reported these features to be informative, and the information about whether the feature was text-length-sensitive and therefore extracted using the capped transcripts.
Features extracted from Corpus 1
Corpus 2
300 linguistic features were extracted from the speech samples, including lexical, syntactic, and semantic features, such as the proportions of different parts-of-speech (POS) tags, vocabulary richness statistics, syntax tree features, coherence features measured using cosine distances, and sentiment scores. The length of the speech samples in this corpus was not limited.
Procedure
We conducted 5 experiments. The first two experiments replicated the methods from Berisha and colleagues (2015) [42] on two larger datasets to understand whether their findings are generalizable to a larger group of people who develop AD. The following three experiments proposed alternative ways of data analysis by looking at different language features, age instead of transcript index, and compiling single language features into aggregate scores.
Experiment 1
The aim of this experiment was to investigate whether the AD and HC groups show differences in the use of the language features reported by Berisha and colleagues [42], as well as to find out how many participant pairs independently show differences between the AD and the HC participant. We first conducted the independent t-tests in both datasets using all the available samples to compare the performance of the AD and HC groups in the features similar to unique words, low imageability verbs, fillers, and non-specific words that were used by Berisha and colleagues. In Corpus 1, these features were type:token ratio (TTR), frequent verbs, fillers, and indefinite nouns and pronouns respectively. In Corpus 2, these features were TTR and moving average TTR (MATTR), light verbs, interjections, and pronouns respectively. Then, we looked at the differences at the participant pair level.
Experiment 2
In the second experiment, we replicated Berisha and colleagues’ approach of analyzing the Pearson correlation between the transcript index (assigned based on the order in which the samples were recorded) and the feature value, using the same language features as in Experiment 1 and both group and single participant pair analyses. The aim of this experiment was to explore the patterns of change in the language features over time.
Experiment 3
In Experiment 3, we used the same methodology as in Experiment 2, but introduced new language features that have been identified as informative in previous studies: average word length, noun:pronoun ratio, particle frequency, noun frequency, word frequency and constituency average depth. The aim was to understand whether using different language features improves the generalizability of the patterns of language change.
Experiment 4
In Experiment 4, we investigated the Pearson correlations between participants’ age and the values of the language features that had been used in the previous experiments. The aim was to investigate whether using participants’ age instead of transcript index improves the results by accounting for the different time intervals of the recordings.
Experiment 5
In Experiment 5, we introduced an approach of compiling single language features into aggregate scores based on the previous literature. The aim of this experiment was to tackle the issue of single language features failing to show highly generalizable patterns of language change across the participants, and to understand whether there is a benefit to using a sum score of a group of features relating to a certain type of impairment instead of single language features. Using aggregate scores also improves the interpretability and accessibility of the results as the single language features are often technical, specific, and difficult to interpret. The following five aggregate scores were computed: 1) lexical diversity, 2) word finding difficulty, 3) discourse, 4) syntactic, and 5) POS scores. The lexical diversity aggregate consisted of vocabulary richness indices (Brunet index [55], Honoré statistic [56]), pronoun proportions, TTR features, hapax legomena, indefinite nouns, total number of words, speech rate, and maximum utterance length (based on [2, 58]). The word finding difficulty aggregate score consisted of the proportion of function and open class words, fillers, repetitions and stutters, pronouns, indefinite nouns, word frequency, filled and unfilled pauses, and their ratio to words, and hesitations (based on [4, 59–62]). The syntactic aggregate consisted of utterance length and constituency features (based on [47, 63]). The discourse aggregate score consisted of local coherence features calculated using cosine distances, TTR features, adpositions, pronouns, conjunctions, hapax legomena, indefinite nouns and open:closed class ratio (based on [13, 58]). The parts-of-speech aggregate score consisted of the proportions and ratios of nouns, pronouns, verbs, and adjectives (based on [4, 57]).
To calculate the aggregate scores, all single features belonging to the aggregate were z-scored, the polarity of the features was considered, and the average of the polarized z-scores was used as the aggregate score. The syntactic aggregate score was not constructed for Corpus 1 due to the nature of the data and the extracted features.
We compared the AD and HC groups as well as the independent participant pairs in both corpora using independent t-tests. Transcript index and age correlations with aggregate scores to analyze longitudinal change were conducted only in Corpus 2, as in Corpus 1, some of the features in each aggregate were text-length-sensitive and therefore calculated based on capped transcripts, resulting in often having only three data points per participant for which the aggregate score was available.
RESULTS
Experiment 1
In this experiment we replicated the independent t-tests from [42], focusing on the features relating to unique words, low imageability (LI) verbs, fillers, and nonspecific words. We conducted t-tests on group level, comparing all samples from the AD and HC groups, and on the participant pair level, comparing the samples of the matched individuals in each pair. To follow the methods from [42], when comparing the results on participant pair level, we did not lower the p-value for significance testing although multiple pairs were tested. Therefore, the potential occurrence of false positives must be acknowledged. Details of results are given in Table 2.
Independent samples t-test between AD and HC participants
AD, Alzheimer’s disease; HC, healthy controls; SD, standard deviation; dataset B, Berisha et al. (2015) results comparison; TTR, type:token ratio; MATTR_20, moving average type:token ratio in 20-word window; LI, low imageability.
We found no significant differences between the HC and AD participant groups in either corpus when looking at unique words, measured in TTR and MATTR features, or LI verbs, measured in frequent and light verbs.
When looking at the participant pairs separately, we found that 3 out of 9 AD participants differed significantly in the expected direction from the matched HC participants in the MATTR values in Corpus 2 (MATTR was measured in 10-, 20-, 30,- 40-, and 50-word windows but as all MATTR features acted similarly, we only report the findings of the 20-word window in Table 2 to avoid repetition). For TTR and LI verb features, 0 or 1 participant pair differed in the expected direction in both corpora.
While the use of fillers and nonspecific words did not differ significantly in [42], both datasets in the current study showed significant differences when comparing the AD and HC groups in fillers and pronouns, but not in indefinite nouns. However, fillers in Dataset 1 differed in unexpected direction in group comparison and no differences were found in participant pair comparison, and only 1 pair differed significantly in the expected direction in Dataset 2.
The use of pronouns was significantly higher in the AD group in both datasets, with 2 and 3 participant pairs showing expected patterns in Dataset 1 and 2 respectively.
Experiment 2
In Experiment 2, we replicated the transcript index and feature value correlations from [42] using Pearson correlation. We compared the number of AD and HC participants that showed significant correlations.
While the use of unique words differed significantly in the AD participant in [42], TTR values only differed in the expected direction in 1 or 2 AD participants in the current datasets, and in 1 HC participant in both datasets. However, MATTR values showed promising results with more than half of the AD participants showing significant correlation between the feature value and transcript index in Dataset 2 (as MATTR features act similarly, only the 20-word window results presented in Table 3 to avoid repetition).
Transcript index correlations between AD and HC participants
AD, Alzheimer’s disease; HC, healthy controls; dataset B, Berisha et al. (2015) results comparison; TTR, type:token ratio; MATTR_20, moving average type:token ratio in 20-word window; LI, low imageability.
In line with [42], features related to LI verbs did not show significant correlation in the expected direction with transcript index in the AD participants. While fillers showed significant change with transcript index in the single tested participant pair in [42], only one AD participant differed in the expected direction in the two larger corpora of the current study.
From the nonspecific word category, pronoun use changed significantly in the expected direction in 3/9 participants in Dataset 2. See Table 3 for details.
When interpreting these results, it must be kept in mind that the participant-level correlations of the text-length-sensitive features (shown in Table 1) in Dataset 1 can be based on very few (minimum 3) observations, and only serve as an indication. We have not included the r value for these features in Table 3 as due to the low number of observations, the correlation can be exaggerated.
Experiment 3
As the features used in Experiment 1 and 2 did not show consistent patterns across all pairs, in this experiment we looked at an alternative set of features that have been informative in previous studies [64] (average word length, noun:pronoun ratio, particle frequency, noun frequency, word frequency and constituency average depth) and tested their correlation with the transcript index. The results were as promising as those of the previous experiments, with the strongest results emerging in the ratio of pronouns to the sum of pronouns and nouns in Dataset 2 where 4 out 9 AD participants changed significantly in the expected direction. 3/9 of the AD participants showed significant correlation in the expected direction between the transcript index and the average word length, word frequency and the constituency average depth in Dataset 2. See Table 4 for details.
Alternative features correlating with transcript index in AD and HC
AD, Alzheimer’s disease; HC, healthy control.
As the average word length and noun:pronoun ratio in Dataset 1 were text-length-sensitive, resulting in very few observations on participant-pair-level due to the capped transcripts, the r values of these correlations have not been included in the table, and the number of correlations in these two features should be treated with caution.
Experiment 4
To account for the fact that the interviews were not recorded with consistent time intervals, in this experiment we used participants’ age instead of transcript index to measure Pearson correlation with the features from previous experiments. However, no significant improvements in generalizability were achieved. See Table 5 for details.
All features correlating with participants’ age in AD and HC
AD, Alzheimer’s disease; HC, healthy control; TTR, type:token ratio; MATTR_20, moving average type:token ratio in 20-word window.
As the correlations for TTR, fillers, pronouns, average word length, noun:pronoun ratio and noun frequency were calculated on very few data points in Dataset 1 due to the text-length sensitivity of these features, the r values for these features have not been included in Table 5, and the number of correlations for these features in this dataset should be interpreted with caution.
Experiment 5
To address the issue of generalizability demonstrated by the single features, we compiled aggregate scores of lexical diversity, word finding difficulty, discourse building, syntactic complexity, and POS-related features in this experiment.
First, we looked at the differences between the AD and HC groups, and the single participant pairs using independent t-tests as we did in Experiment 1 with single features. We found that word finding difficulty scores differed significantly between the AD and the HC group in both datasets, however, only Dataset 2 showed greater word finding difficulty in the AD group, and this tendency did not carry on to the individual participant pair level. A maximum of 2 participant pairs per dataset showed significant difference in the expected direction across aggregate scores (lexical diversity, discourse, and POS aggregates). See Table 6 for details.
Independent t-test between AD and HC participants using aggregate scores
AD, Alzheimer’s disease; HC, healthy control; SD, standard deviation; POS, part-of-speech.
Second, we looked at Pearson correlations between the aggregate scores, and age and transcript index in the participant pairs in Dataset 2. Dataset 1 was excluded from this experiment due to the limited data availability because of transcript capping.
We found that lexical diversity scores correlated with age and transcript index in the majority of the AD participants. While the number of AD participants that showed a significant correlation between the aggregate score with both transcript index and age was higher in all aggregates compared to the number of HC participants, it didn’t exceed half the group for any other aggregate scores, with between 1 and 4 significant correlations. See Table 7 for details.
Transcript index and age correlations with aggregate scores
AD, Alzheimer’s disease; HC, healthy control; POS, part-of-speech.
DISCUSSION
The current study focused on the generalizability of longitudinal language change in AD. We started by replicating the methods from [42], who compared the speech of President Bush (HC) and President Reagan (AD) looking at unique words, LI verbs, fillers, and nonspecific words. Instead of using just one participant pair, we included two similar corpora of spontaneous speech recordings with public figures, consisting of 9 and 10 AD-HC participant pairs. As we could not replicate the results of [42] or find other generalizable patterns using these methods, we proposed three alternative approaches to data analysis: 1) using different single features, 2) using age instead of transcript index, and 3) compiling single features into aggregate scores. While the universal patterns of language change representative of many cases were challenging to capture, MATTR, pronoun-related features, and lexical diversity aggregate score showed the most promising results.
The flowchart on Fig. 1 presents an overview of the methods and results of the study.

Flowchart of the overview of the study. AD, Alzheimer’s disease; MATTR, moving average type:token ratio.
While [42] found that the use of unique words and LI verbs was significantly different between the two participants, we failed to replicate these results in larger groups in either corpus. Looking at the number of individual pairs where the AD and the HC participant’s performance differed significantly in the expected direction, type:token ratio in 20-word window (MATTR_20) in Dataset 2 performed best, but still only 3 out of 9 participant pairs showed the pattern captured by [2, 42], suggesting that the results are generalizable to one third of the participants. When considering the potential effect of multiple comparisons and the chance of false positives, this number could be even smaller.
While [42] did not find a significant difference in the use of non-specific words and fillers when comparing the speech of Bush and Reagan, [2] reported a difference in the use of non-specific words. The current study found significant group differences in the use of fillers and pronouns in both datasets, but the use of fillers differed in an opposite direction in the two datasets. The number of individual pairs with significant differences in the expected direction in these features was the highest in the pronoun category in Dataset 2, with one third of the participants with AD using significantly more pronouns than their matched controls. These findings suggest that although group differences between the HC and the AD participants may appear, the differences are often not seen on individual participant pair level in more than half of the pairs. These findings support the critique by [65], who argues that the statistically significant mean differences between the groups of individuals with AD and healthy controls should not be looked at as representative of the individual participants, as the standard deviations often overlap [65] argue that there is no “typical” AD sufferer and that focusing the average AD participant’s performance can overlook the actual abilities of the individual.
Next, we replicated the transcript index and language feature correlations from [42] who found that the use of unique words, fillers, and nonspecific words changes significantly over time in Reagan’s, but not in Bush’s speech. Looking at the individual pairs, we failed to replicate these results in the majority of the pairs. The feature showing the most promising results was the MATTR_20, changing in 5 out of 9 AD participants in Corpus 2. However, this feature also changed similarly in two healthy participants. MATTR features were proposed by [66] to overcome the text-length-sensitivity of TTR.
Based on Experiment 1 and 2, the findings of the current study failed to replicate the results of the previous study, suggesting that either a) change in speech in AD is heterogeneous, or b) different approaches to capture the change are needed. Next, we explored three alternative approaches: 1) using different language features, 2) using age instead of transcript index to analyze longitudinal change, and 3) grouping the features into aggregate scores instead of using single features.
Switching to different single language features known as informative from previous studies or using age instead of transcript index to track longitudinal change did not improve the universality of the patterns of language change compared to the methods used in Experiments 1 and 2, with less than half of the AD participants showing statistically significant results. The best performing feature in these experiments was the number of pronouns divided by the sum of nouns and pronouns in Dataset 2 where 4 out of 9 AD and 2 out of 9 HC participants changed in the expected direction. The lack of generalizability in language patterns across participants is also addressed in [65] who stresses that there is a great variability from one AD sufferer to another, resulting from the differences in personal history, pathological process and social relationships potentially affecting the individual markers of cognitive abilities.
Due to the inconsistency of single language features showing generalizable changes in the speech of the AD participants, we compiled the single features into aggregate scores based on the literature on known language dysfunctions in AD and their manifestations in speech. We focused on lexical diversity, word finding difficulty, discourse building, syntactic complexity, and POS-related aggregates. We found that only word finding difficulty showed significant differences between the AD and the HC group in both datasets, and even then, the direction of the change was inconsistent. The correlations between the transcript index or age and the aggregate scores were more promising, with all aggregates changing in the expected and consistent direction in a larger number of the AD than the HC participants. Lexical diversity aggregate score showed the best results, with two thirds of the AD participants declining significantly over time.
One of the limitations of this study is the uncontrolled nature of the speech data. While the included samples consisted of free, naturalistic speech, offering a more realistic reflection of the participants’ condition, speech content and the potential scriptedness of interviews or speeches could not be controlled for as the samples were pre-recorded. One way to control the speech content in the future is using tasks like picture description or conducting structured interviews. However, this approach would compromise the naturalistic nature of the speech data, and its resemblance to an everyday conversational situation. Structured speech tasks are also known to cause more stress for the participants with dementia. Similarly, as the data was collected from YouTube videos, it was not possible to include the participants’ medical history and clinical characteristics, or control for potential confounders. The recordings were also conducted in different time intervals, contributing to the data sparsity issue illustrated in Supplementary Figure 1. The time period covered in the current study was also notably longer than in the original study, potentially decreasing the comparability of the results. However, the longer timespan can also be seen as an improvement of the dataset as it covers a longer time period and allows investigating long-term changes in speech. The recordings also varied in audio quality due to the differences in set-up and available technology at the time as some of the samples were recorded decades ago which is one of the reasons we have not included the acoustic features in the analysis and focused only on the transcript-based linguistic features like the study we were replicating, despite the recent developments and promising classification accuracy in studies using acoustic speech signal analysis [31, 45]. While the acoustic features have been promising, they are often more difficult to understand and raise the question of interpretability in a clinical setting. Another limitation was the sample size—although we included more participants than the original study, the datasets were still relatively small, with 10 and 9 AD-HC participant pairs in the two corpora. The length of the transcripts used in the current and the original study also differed; while [42] used 1400-word transcripts, the transcript lengths in the current study varied, and were mostly shorter due to data availability. Although the data in both corpora were challenging and at times noisy, the findings across the two corpora demonstrate consistency, contributing to the reliability of the results.
In the future, a controlled longitudinal dataset with consistent time intervals, length, and audio quality could be collected. Larger and higher-quality datasets could contribute to detecting more universal patterns of language change in the people who eventually develop AD, which would aid developing a non-invasive, cheap, and accessible language-based screening tool.
As the current study found little generalizability in the patterns of language change in the individuals with AD, different analysis methods for longitudinal change could also be developed. The current study suggests that the significant group differences in the language features are often not evident on individual participant pair level, resulting in the majority of AD participants not replicating the difference suggested by group comparisons. Future studies should take this into account when interpreting the results based on mean values, as the mean may not be representative of the individual cases. Similarly, focusing on participant level analysis to understand language change in AD in more detail, and finding more generalizable language features or feature groupings could be considered. In the current study, some aggregate scores showed promising results, and future studies could focus on developing comprehensive measures for tracking change in lexical diversity and word finding difficulty in spontaneous speech. Compiling the features into group scores could also be improved by more detailed analysis into the importance of the individual single features in the sum score calculation, for example, by placing more weight on the MATTR_20, which showed the most promising results in the current study when calculating the lexical diversity aggregate score. It would also be relevant to analyze how the longitudinal nature of the data affects the computation of aggregate scores and feature weights, for example, how long before the diagnosis does the importance of a single feature become apparent.
In sum, this paper contributes to understanding the long-term change in language use in the individuals who later in life develop AD by using longitudinal spontaneous speech datasets where speech data spans over several decades, and investigates whether the longitudinal linguistic change previously identified in a single participant is generalizable to 10 and 9 participant pairs in the two corpora, or whether any other generalizable patterns can be detected. We highlight the issues of generalizability of language change by comparing the individual trajectories of the participants and show that language decline in AD is not homogeneous. We propose that different methods to detect more universal patterns should be developed, and demonstrate the potential of using aggregate scores, especially lexical diversity. We also show that, out of the single language features used, the most universal patterns of language decline are captured by MATTR and pronoun-related features. As little generalizability was found, this study encourages acknowledging that the manifestations of cognitive decline in language can vary from one individual to another. As the classification tasks are already quite accurate, understanding the longitudinal change and the uniqueness of each individual’s language change is a crucial research direction as it contributes to developing tools for detecting the earliest signs of cognitive decline, disease prediction and tracking, taking the individual differences into account.
All in all, while the change in language in AD seems to be heterogeneous across participants, the current study found that the most informative single features were MATTR and pronoun-related features. Aggregate scores captured change over time better than the single features, with lexical diversity scores showing the most promising results.
Footnotes
ACKNOWLEDGMENTS
We thank the employees of Winterlight Labs who helped with the data collection and processing in Corpus 2 of the current study.
FUNDING
This work was supported by Economic and Social Research Council (ESRC) Cambridge Doctoral Training Partnership (DTP) grant number ES/P000738/1. For the purpose of open access, the author has applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising from this submission.
CONFLICT OF INTEREST
Jessica Robin is an employee of Winterlight Labs, Inc. Ulla Petti has previously been an intern at Winterlight Labs, Inc.
DATA AVAILABILITY
The data supporting the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.
