Abstract
Background:
Previous studies explored the use of noninvasive biomarkers of speech and language for the detection of mild cognitive impairment (MCI). Yet, most of them employed single task which might not have adequately captured all aspects of their cognitive functions.
Objective:
The present study aimed to achieve the state-of-the-art accuracy in detecting individuals with MCI using multiple spoken tasks and uncover task-specific contributions with a tentative interpretation of features.
Methods:
Fifty patients clinically diagnosed with MCI and 60 healthy controls completed three spoken tasks (picture description, semantic fluency, and sentence repetition), from which multidimensional features were extracted to train machine learning classifiers. With a late-fusion configuration, predictions from multiple tasks were combined and correlated with the participants’ cognitive ability assessed using the Montreal Cognitive Assessment (MoCA). Statistical analyses on pre-defined features were carried out to explore their association with the diagnosis.
Results:
The late-fusion configuration could effectively boost the final classification result (SVM: F1 = 0.95; RF: F1 = 0.96; LR: F1 = 0.93), outperforming each individual task classifier. Besides, the probability estimates of MCI were strongly correlated with the MoCA scores (SVM: –0.74; RF: –0.71; LR: –0.72).
Conclusion:
Each single task tapped more dominantly to distinct cognitive processes and have specific contributions to the prediction of MCI. Specifically, picture description task characterized communications at the discourse level, while semantic fluency task was more specific to the controlled lexical retrieval processes. With greater demands on working memory load, sentence repetition task uncovered memory deficits through modified speech patterns in the reproduced sentences.
INTRODUCTION
With the aging population worldwide, prevalence of Alzheimer’s disease (AD) keeps increasing. Yet, to date, an effective treatment for AD is still lacking. Over the years, researchers have attempted to mitigate the severe impact of AD through early and accurate diagnosis, in the hope of delaying its onset and progression. This endeavor germinated an active area of research—exploring the use of noninvasive biomarkers such as cues lying in speech and language to reliably identify the presence of mild cognitive impairment (MCI), an “intermediate state of cognitive function between the changes seen in aging and those fulfilling the criteria for dementia and often AD” [1].
Development of such a clinically feasible tool to detect MCI is justified by the fact that language is strongly linked to a wide range of cognitive functions, including working memory, attention, and executive control [2], which are telltale signs of one’s cognitive status. Given that “cognitive-linguistic function is a strong biomarker for neuropsychological healthy in many dimensions” [3], the production of spoken language can provide insight to broader aspects of cognitive status. In fact, previous studies have indicated that even though memory loss may seem to be a typical symptom of MCI, deficits in speech and language are apparent at this stage [4]. Earlier studies reported declining performance on naming and verbal fluency tasks in MCI individuals [5–8]. Recent studies examined deficits in connected speech production and observed alterations in all aspects of language production: phonetic-phonological, lexico-semantic, morpho-syntactic, and discourse-pragmatic. It is well-documented that speech produced by patients with MCI was characterized by reduced productivity [9], limited lexical variety [10, 11], decreased syntactic complexity [12], and greater signs of disfluency [13]. The impairments have been qualitatively associated with the widespread atrophies distributed in the anterior temporal lobes, left temporoparietal junction, and frontal premotor circuits of the MCI brains, which are actively engaged during speech production [14]. To complement the findings based on Indo-European languages, a pioneering study examining Chinese patients with MCI revealed similar patterns in their connected speech, with decreasing linear trends in semantic contents and syntactic complexity, and increasing linear trends in disfluency [15]. As language impairment would in turn affect the spoken output, another Chinese group working on MCI-related language impairment suggested reduced utterance and protracted silence in connected speech as a possible biomarker of cognitive decline, which could well predict participants’ performance in neuropsychological tests [16].
A burgeoning stream of studies have attempted to translate these observations into simple and specific features and made use of machine learning techniques to distinguish MCI individuals from the healthy counterparts [17–19]. In a recent study by Tóth et al. [20], acoustic features including hesitation ratio, speech tempo, utterance length, and the number of silent and filled pauses were extracted from connected speech elicited by immediate and delayed recalls of two videoclips to differentiate between 48 clinically diagnosed MCI patients and 38 healthy controls. Based on the features, an F1-score of 78.8%using Random Forest (RF) classifier was obtained. Asgari et al. [21] examined the speech content from the unstructured conversation in 14 MCI and 27 cognitively intact participants and constructed 68-dimensional feature vectors based on words’ occurrence in each of the Linguistic Inquiry and Word Count (LIWC) category. Using LIWC-based linguistic features, they obtained an accuracy of 84%with support vector machine (SVM) classifier. A recent study [22] elicited spontaneous (describing the previous day) and semi-spontaneous (immediate and delayed recall of one-minute animated film) connected speech to differentiate across 25 patients with MCI, 25 with mild AD, and 25 healthy controls. The best performing MCI-versus-control classification (accuracy = 86%, F1-score = 85.7%) was obtained when combining temporal parameters, automatic acoustic and speech-related markers, and semantic linguistic features to train the classifier. Automatic extraction of information content in picture description task was realized based on topic models trained on word embeddings [23]. Using features defined by the clustering and number of information content units to tell apart MCI from healthy individuals, Fraser and colleagues obtained classification accuracies up to 63%and 72%on English (19 MCI and 97 healthy controls) and Swedish corpus (31 MCI and 132 healthy controls), respectively. Themistocleous et al. [24] applied a series of fully connected deep neural networks with hidden layers ranging from one to 10 to learn the acoustic properties (including vowel duration, vowel formants, and fundamental frequency) of connected speech produced by 25 MCI patients and 30 healthy controls in a reading task. The best performing architecture reached an accuracy of 83%. To summarize, existing literature shows the state-of-the-art development of cognitive assessments through connected speech production and reveals the feasibility of using speech and language features to detect early symptoms of cognitive decline. Even though it is difficult to draw direct comparison across different studies, combining lexico-semantic features reflecting complex language production with their physical manifestation in speech signal may reinforce their strong points and enhance the performance of machine learning classifiers [22].
It should be noted that studies of this kind typically employed single or similar type of spoken task, which might not adequately capture all aspects of language function. Such limitations may become especially obvious when dealing with MCI, as patients with MCI do not exhibit specific patterns of language deficits possibly due to the lack of characteristic distribution of brain atrophy [14, 25–28]. This strongly contrasts with primary progressive aphasia (PPA), which has selective impairment of language and clear distinctions between subtypes [29]. A previous study [30] reported heterogeneous language decline in early-stage AD with more predictable patterns of decline in PPA. Hence, to accurately predict MCI status from speech and language, it is essential to reduce reliance on a single measurement.
In the light of these findings, a number of studies employed more than one spoken task and concatenated speech and language features in an early-fusion approach to detect patients with MCI. For instance, König et al. [31] selected the optimal subset of vocal features extracted from speech recordings associated with four spoken tasks (countdown, picture description, sentence repetition, and semantic fluency), which were used to train a single SVM classifier. A cross-validation accuracy of 79%±5%was achieved when combining continuity-reflecting vocal features (e.g., duration of voice and silence segments and their relative proportion) in countdown and picture description task. In a more recent study, König et al. [32] extended the study to differentiate across a larger group of participants (27 AD 44 MCI, 56 healthy control, and 38 mixed type dementia) using up to nine spoken tasks (sentence repetition, picture naming of three animals together with a description of photograph showing one of them in the natural environment, phonetic verbal fluency, semantic verbal fluency, countdown, story narration of a pleasant and an unpleasant event, as well as a spontaneous description of what happened yesterday) selected or modified from conventional test batteries. The optimal MCI-versus-control classification accuracy was 86%with continuity-reflecting vocal features selected only from connected speech production tasks.
The study by Fraser et al. [18] represented one of the very few which investigated how information from various sources could be optimally combined to achieve the best classification result. In the study, features across multiple modalities (audio, text, eye-tracking, and comprehension questions) were extracted from three different tasks (picture description, reading silently, and reading aloud), which were combined in several possible architectures: In the “early fusion” configuration, feature vectors were concatenated into a long vector to train a single classifier, while with the “late fusion” approach, separate classifiers were trained at the mode or task level and were further incorporated to produce the final prediction. Encouraging results were obtained in the task-fusion configuration (AUC = 0.88, accuracy = 0.83), which outperformed the one merging features at the early stage (AUC = 0.79, accuracy = 0.84). Besides, reading silently with eye-movement features had the most predictive ability compared with other tasks and modes. Fraser and colleagues’ study has several implications for advancement in this field of research. From an engineering perspective, it sees improvement in system performance with a modular architecture that can be easily generalized to incorporate different tasks and features. More importantly, since classification accuracies are available at all levels of analysis, it is possible to assess which task and types of features are more sensitive to predict cognitive status.
The present study is a follow-up work of our pilot study [15] which reported MCI-induced speech and language alterations in connected speech production. In that study, we examined the applicability of the widely-used language features to distinguish between Chinese speaking patients with MCI and their healthy counterparts. We also compared our results with studies examining English and other Indo-European language speaking population. In this follow-up work, we extended our investigation to the pathological speech and language patterns in different spoken tasks, attempting to link them with the decline of specific cognitive functions and discover the most salient features that could distinguish between MCI and cognitively intact individuals based on converging evidence across different tasks. Besides, following Fraser et al. [18], we implemented a cascaded approach to investigate the predictive power of each spoken task as well as their combination, as our ultimate goal is to design an assessment tool that could faithfully and accurately detect the early signs of cognitive decline.
The three tasks used in the present study included: 1) picture description, 2) semantic fluency, and 3) sentence repetition, each of which measured a different aspect of speech production, and they should tap into nonoverlapping cognitive processes [2, 33]. Specifically, picture description task was used to elicit connected speech production, which approximated everyday communication. While not constrained to a certain linguistic level, this task had the natural advantage in providing a greater deal of measurable dimensions, capturing deficits in lexical retrieval, and semantic and syntactic processing. Altogether, these features may reflect a net sum of changes among diverse cognitive functions. Semantic fluency task, on the other hand, is more specialized in assessing the controlled retrieval of lexical items. Unlike picture description task that allowed alternative ways to convey the same information, semantic fluency task limited lexical retrieval to the single-word level, and restricted word selection within a certain superordinate semantic category. This constraint puts forward additional demands in executive function, as participants have to monitor and inhibit non-target words and track between previous responses while generating new ones. In addition, since there is no external visual cue to guide speech production, it may put forward differential cognitive demand as compared to the picture description task [34, 35]. A sentence repetition task was also included here, which is highly reliant upon working memory and language comprehension. To successfully reproduce the sentence, participant had to memorize the segments of sentences, understand their meaning, and integrate them into meaningful sequence. More importantly, compared with the other two tasks, sentence repetition task was the only one that required memory recall during speech production, which might possibly capture the memory deficits commonly found in patients with MCI. In addition to the selection of spoken tasks, there is one more thing that differentiate our study from Fraser et al. [18]. That is, all features extracted in the present study were entirely based on audio signals and transcriptions, which enabled us to develop portable diagnostic tools to facilitate universal screening in the future.
The following hypotheses for the present project included: 1) combining predictions from single task classifiers would boost the final classification accuracy, as each task provide complementary information to the depict the participant’s cognitive status; 2) the probability estimates would be correlated with cognitive test battery scores (i.e., Montreal Cognitive Assessment, MoCA), as our model captures multidimensional cognitive functions based on multiple spoken tasks.
In addition, to illustrate the task-specific contributions, we provided further analysis on extracted features, which will not only facilitate the understanding of speech and language alterations associated with MCI, and but also inspire engineers towards building an automatic and knowledge-based diagnostic system.
METHODS
Participants
Approval by the Institutional Review Board of Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences was obtained before the study commenced. Informed consent was signed by each participant upon his/her arrival for speech data collection.
Fifty-four patients with MCI (35 F and 19 M) were recruited from the Memory Clinic of Beijing Tiantan Hospital, Capital Medical University. All individuals were diagnosed and confirmed by a neurologist who was blind to the study. Categorizing individuals as patients with MCI was based on an array of extensive assessments including cognitive test (MoCA-Beijing Version), neuropsychological tests, psychiatric evaluation and magnetic resonance imaging (MRI) of the brains. Details of criteria for inclusion and exclusion are provided in the Supplementary Material. Since categorization was based on comprehensive indices, there is no cut-off MoCA score for MCI patients.
A cohort of 60 cognitively intact elderly (45 F and 15 M) was recruited from local communities and served as healthy controls. They were required to undergo a series of screenings before data collection. Only those with a MoCA score over the education-corrected cutoff (26 out of 30 points) were included.
Other inclusion criteria for participants in both pathological and healthy control groups included: 1) they were native speakers of Mandarin Chinese, 2) they had received at least primary school education (6 years education), 3) they had not suffered from psychiatric/neurological disorders or head injuries, and had no history of taking neuroleptic medicine.
For subsequent analyses, only participants with audio recordings available in all three tasks (i.e., picture description, semantic fluency, and sentence repetition) were included, which led to the final corpus of 50 participants in the MCI group (31 F and 19 M) and 60 in the healthy control group. Demographic information, including age, gender, and education level, as well as the education-corrected MoCA score, is outlined in Table 1. However, due to the limited pool of participants, the two groups were not matched for education level, which can be viewed as a limitation of the study. To determine whether alterations of speech and language patterns were result from the disease despite presence of other possible factors such as literacy skills and education level, further analyses were carried out (see Section “Statistical analyses on pre-defined features”).
Demographic information and MoCA scores by group (mean and standard deviation)
*p < 0.05, **p < 0.01, ***p < 0.001. 1MoCA score of two MCI patients are not available.
Task design and procedures
All participants had their speech recorded during the execution of three spoken tasks (see Fig. 1), which were chosen based on previous studies using speech and language to detect early symptoms of MCI (e.g., [31]). Each task has greater reliance on distinctive cognitive processes and provided unique contribution to predict cognitive status. During the experiment, speech samples produced by the participants were recorded in a noise-attenuated room using an external sound mixer connected to a professional microphone, which was placed 10 cm from the lips. The samples were digitized with 16,000 Hz sampling rate and 16-bit/sample resolution. Instructions, prompts, and stimulus images were presented on the screen using a Python-based toolkit (https://www.psychopy.org/) [36].

Schematic diagram of the task-fusion configuration.
In the picture description task, connected speech was obtained by asking the participants to describe everything they saw in the picture with as many details as possible. The three stimulus images were widely used in studies of aphasic and other impaired speech, which are “Cookie Theft” picture description from the Boston Diagnostic Aphasia Examination [37], “Cat Rescue” used in [38], and “Picnic Scene” from the Western Aphasia Battery Revised [39]. Unlike neuropsychological tests where participants are allowed to describe the picture for as long as they prefer, a one-minute time constraint was in place, which would allow us to compare from a temporal perspective, thereby making feature extraction uniform and smooth. Besides, speaking under time pressure may increase the sensitivity of MCI detection. At the same time, we also noticed that the modification may make it difficult to compare our results with other studies. Yet, based on our preliminary findings [15, 40] that most participants were able to complete the picture description within one minute, the influence should be minimal. During speech production, participants were encouraged to continue if they produced very limited speech output. All instructions and interventions were included in the one-minute recording session.
In the semantic fluency task, participants were asked to produce as many words as they could within one minute for each of the three semantic categories: fruits, animals, and provinces in China. Before experimental trials, participants completed a two-trial block for practice, which was not used for further analysis. To avoid the potential memory effect, we selected two different semantic categories (vehicles and sports) for practice. Since participants were familiar with the task requirement after this two-trail block, no more instructions were given during the experimental trials.
In the sentence repetition task, participants were asked to repeat 18 sentences one at a time, of which the length increased from three to 25 Chinese characters (three to 15 words). The level of difficulty ranged from 1 to 5, depending on the number of words, semantic hierarchies, and dependency distance. Out of the 18 sentences, five were chosen from the Chinese Rehabilitation Research Center Standard Aphasia Examination (CRRCAE) [41], one from the Chinese version of Western Aphasia Battery (WAB) [39], and four from Aphasia Battery of Chinese (ABC) [42]. The remaining sentences were designed by two Chinese linguists in the laboratory. Each trial of the task proceeded as follows. The participant was presented with a sentence on the screen for a certain duration which was determined by the level of difficulty. During this period of time, the participant read the sentence silently and needed to memorize it as well as possible. After its disappearance, a written verbal cue appeared, prompting the participant to repeat the sentence they saw earlier. The task was self-paced, with examiners pressed the key to start the next trial. The participant completed a four-trial practice block with a different set of sentences before the experimental trials. In total, the three tasks lasted 20–25 minutes.
Manual transcription
Recorded audio files were segmented and further transcribed verbatim using TextGrid in Praat (http://www.fon.hum.uva.nl/praat/). The transcription strictly followed the preestablished guidelines. For picture description task, narratives were first transcribed at utterance and sentence levels. Unrelated utterances (e.g., questions about the task) and non-verbal phenomena (e.g., coughing, throat clearing, and giggling) were annotated with special markers but were excluded from the current analyses. Specifically, lexical, syntactic, filled pause, and temporal features were extracted on utterances directly related to the picture stimuli, which could truly reflect their performance. Filled pauses were re-annotated with on a third tier following utterances and sentences. For semantic fluency task, manual transcriptions were conducted in a similar way, producing parallel tiers at the utterance, key word (i.e., words of a certain semantic category), and filled pause level. Speech samples associated with the sentence repetition task was transcribed according to the participant’s reproduction, maintaining elements that were inserted and substituted. In all cases, temporal information was aligned to the start and end of each segment.
Feature extraction
After the processing of speech recordings, features covering the task-specific manifestation of MCI in speech production were extracted and were briefly described in the following part. A compact list of features is provided in Table 2. The complete version with definitions of extracted features is reported in the Supplementary Material.
Summary of speech and language features (see the Supplementary Material for details)
Features extracted from connected speech
Following previous investigation of connected speech production, a total of 53 features corresponding to the lexical, syntactic, fluency, and temporal level of speech production was extracted from transcribed speech obtained from the picture description task. The levels were defined based on where the feature was situated, i.e., words, sentences, filled pauses, and the amount of time elapsed in each utterance.
Lexical features. The lexical feature of a speech sample is concerned with the quantity and kinds of words people use to describe the picture. A simple approach is to count the number of words and unique words (vocabularies) in narratives. Words can further be classified into ‘open class’ words, through which content is conveyed, and ‘closed class’ words, serving as functional or grammatical particles. The lexical distribution of words is derived from the proportion of specific part of speech (POS) (e.g., noun, verb, pronoun, quantifier, etc.) among all words produced, which may indicate difficulties in accessing certain classes of words. Lexical properties also include the richness of vocabulary in a speech sample, which can be assessed using the type-token ratio (TTR), Brunet’s index, and Honoré’s statistic. The frequency of vocabularies according to their norms in a large corpus is another metric that quantifies how informative a discourse is. Failing to access specific terms usually result in an overuse of high-frequency words.
Syntactic features. Syntactic features capture how independent words are arranged to create sentences. Given that Chinese is highly word order dependent, sentence structures can be measured based on the syntactic and grammatical relationships between words. Following constituency parsing, the number of noun phrases consisting of a determiner (DT) and a noun (N) was counted, which could presumably be used to describe the local complexity of a sentence. In addition, the number of grammatical coordination that forms compound sentences, as well as the dependency distance that measures the linear distance between words within a sentence were also extracted based on dependency parsing. These two measures were designed to describe the global complexity of sentences.
Fluency features. Speech fluency refers to the smoothness of the production, which may indicate hesitation due to word-finding difficulties or deficits in discourse planning. Following related work, speech fluency was measured based on the number and duration of filled pauses within the utterance. Specifically, the pure number of filled pauses, as well as its proportion to the total words, were tallied. Similarly, the duration of filled pauses was normalized by the phonation time.
Temporal features. Continuity of speech was quantified by deriving the time taken to produce words, as well as the length of silent segments. Taken these pure numbers together, the ratio of silence and phonation time was calculated, which could be used to reflect the productivity of narratives and may indicate deficits at other linguistic levels (i.e., lexical and syntactic). Temporal features also include speech rate, which was defined as number of words produced in a second. In addition, articulation rate (utterance divided by syllable count) was also added to the feature set in order to reflect phonetic-motor planning.
In the present study, both POS tagging and sentence parsing were completed using a Chinese NLP toolkit (https://github.com/hankcs/HanLP). Lexical and syntactic features were then extracted based on these automated annotations. Feature extractions were carried out through multiple iterations of the entire speech corpus, each of which over a window of five sentences (segment-based feature extraction). For each iteration, the window slid one step to the right, adding additional information to the new feature vector. As a result, each participant contributed multiple input vectors from each task, significantly increasing the size of the feature set for subsequent training.
Features extracted from timed word retrieval
Even though semantic fluency task is widely used as a subsection of neuropsychological test, only simple measures, such as the total number of words provided, are considered in the clinical setting. Here, we extended these superficial measures to multidimensional features to capture the fine-grained speech and language patterns in this task. Specifically, a total of 25 features based on the number of words relevant to a certain category, as well as word position information within the one-minute recording session, were included for each of the three subtasks of semantic fluency.
Key word features. Performance of semantic fluency task is commonly evaluated based on how many words an individual could retrieve within a restricted time (such as 60 seconds). Besides, time on the task is another relevant factor that needs to be considered, as the cognitive process underlying semantic fluency task varies as a function of time (automatic processing in the first 15 seconds and controlled processing in the remaining 45 seconds) [34, 35]. It follows that features were defined as the number of words produced within the 15, 30, 45, and 60 second block.
Fluency features. In addition to the fluency measures extracted to analyze narrative speech, the duration of silence between filled pauses and the words immediately followed were obtained. This period of time may indicate cognitive activities such as semantic activation, lexical preparation, phonological encoding, and articulation.
Temporal features. Features at this level were replicated entirely from those extracted from connective speech production, which were used to reflect speech continuity. The only difference is that temporal features were obtained over the whole task session (i.e., one-minute duration).
Features extracted from sentence immediately recalled
In clinical setting, the performance of sentence repetition is scored based on the correctness of reproduction. Here, we captured the finer-grained pattern with a variable depicting how closely the repeated sentence follows the one presented on the screen, combined with a number of acoustic parameters for intensity, intonational contour, and power spectrum of the speech signal. These features were extracted for each sentence and were concatenated into a single long vector with the number of words, semantic hierarchies, and dependency distance that define the sentence’s level of difficulty. The rationale is straightforward: more difficult sentences have stronger power to predict the output class compared with the simpler ones that are prone to the ceiling effect. Therefore, sentences should be treated differently when sentence-based feature vectors are taken together to train a single classifier.
Edit distance. Edit distance is defined as the minimum number of edit operations (deletion, insertion, and substitution) to transform one sentence into another, which is used to quantify the dissimilarity between the repeated sentence and the standard one. Based on the definition, poorly repeated sentences that require more edit operations result in longer edit distance.
Acoustic features. Following Meilán et al. [43], an array of acoustic features were extracted, which were grouped into the following categories: (i) intensity measured by the difference of maximum and minimum amplitude in each utterance and their standard deviation (SD); (ii) intonational contour measured on the basis of alterations in fundamental frequency (i.e., F0 SD); (iii) power spectrum measured by the bandwidth of the third formant (F3), and the long-term average spectrum (LTAS); and (iv) temporal features depicting the utterance and silence in sentence production.
It should be noted temporal features were considered in all three tasks, while acoustic features measuring the change of intonational contour and power spectrum were limited to the sentence repetition task. Our primary consideration is that Chinese is a tonal language, where words are associated with specific variations in vocal pitch. Because of that, the differential intonational contour of two sentences may result from the combination of words with different tones, going beyond the cognitive function itself. Hence, in an attempt to eliminate the confounding factor of speech content, we considered prosodic features only in the sentence repetition task, where relatively constant sentence-level utterance could be obtained with standard stimuli guiding the production.
Classification
To predict the mental states of a participant (MCI versus normal), separate classifiers were trained for each of the three tasks, from which the probability estimates at the individual level were averaged to produce the final prediction. The schematic diagram of this task-fusion configuration is shown in Fig. 1.
We have the following considerations when designing the task-level classifiers. In picture description task, features extracted from each subtask were used to train a single classifier. The rationale is to capture the universal changes in connected speech production that were not constrained to a certain picture stimulus. Following the same rationale, features extracted in each of the 18 sentences were used to train a single classifier in sentence repetition task. Such an architecture could predict the cognitive status in a situation where participants failed or accidentally omitted to reproduce some of the sentences, thereby improving the generalization ability of the model. In semantic fluency task, separate classifiers were trained for each of the subtasks, from which predictions were combined through a voting process. The major consideration is that the three superordinate word categories are intrinsically different, and there is a lack of generalized patterns compared with picture description task. Neither is there a universal level of difficulty as the performance of production depends on the participant’s familiarity with the certain word category. Hence, we designed three independent classifiers to capture subtask-specific patterns. Combining their predictions, the task-level fusion may reflect participants’ overall status of semantic memory and retrieval ability.
For picture description task, principal component analysis (PCA) was first implemented to identify the latent factors from the 53 shortlisted features, which were extracted at the segment level in each of the three subtasks. The amount of variance to be explained was set as 0.95. The aim of this step was to eliminate redundancy in the original features, as PCA converted the set of correlated variables to be uncorrelated using orthogonal transformation. The resulting principal components (PCs) were then taken as inputs to a machine learning classifier, which output the probability estimate that the feature vector came from a patient with MCI (coded as 1) or a normal control participant (coded as 0). Predictions at the individual level were obtained using probability scores averaged across all input samples of a participant with a decision threshold of 0.5.
For semantic fluency task, separate classifiers were trained for each of the subtasks with latent speech and language variables transformed by PCA. The resulting probability estimates were subsequently averaged to produce the task-level prediction for each individual participant.
For sentence repetition task, sentence-based feature vectors were taken together to train a single classifier. PCA was not carried out in this case, as variables in the original feature set were independent from each other.
We were aware of other approaches for feature selection, such as Least Absolute Shrinkage and Selection Operator (LASSO) and many other filtering methods. Considering the multi-domain nature of the tasks, PCA was selected to investigate the latent pattern of speech and language features that could set the two groups apart after our classification experiments (see section “Identification of latent factors”).
To evaluate the performance of the strategy, three classifiers including SVM, RF, and logistic regression (LR) were implemented. Our expectation was to observe similar patterns with the three classifiers if the strategy is truly robust.
In this study, scikit-learn toolkit [44] was used to train the SVM classifier with a radial basis function (RBF) kernel, which allowed for an output of probability score with Platt scaling [45]. The RF and LR classifier were also trained using this toolkit. In all cases, hyperparameters were set as default without optimization.
The criteria used to evaluate the performance of the current model include accuracy, precision, recall and F1-score. Five-fold cross-validation was performed for each model, in which data from 80%of speakers (i.e., training set) were used to train the models, while the remaining 20%(i.e., testing set) were retained for evaluation. Within each fold, input vectors obtained from the same speaker could either occur in the training set or testing set, but not both.
RESULTS
Classification of MCI versus healthy controls
As illustrated in the “Classification” section, we had three task-level classifiers for each of the spoken task (picture description, semantic fluency, and sentence repetition) and independent subtask-level classifiers for semantic fluency task. As classification performances were directly compared across different tasks and the final-fusion configuration, we provided results at both task- and final fusion-level in Table 3. It should be noted that the classification metrics for semantic fluency task were obtained from the average predictions across the three subtask classifiers (SF_01, SF_02, and SF_03) at the individual level. Details of the sub-classifiers’ performances are presented in the Supplementary Material.
Classification results (average 5-fold accuracy, precision, recall, and F1-score) of different classifiers in each single task and final fusion
Acc, accuracy; Pre, precision; Rec, recall; F1, F1-score. The best result for each classifier is indicated in bold.
There are several observations to be made: First, task-fusion configuration could effectively boost the final classification result (SVM: F1-score = 0.95; RF: F1-score = 0.96; LR: F1-score = 0.93), which outperformed each of the single task classifier. These results indicated that combining predictions from different tasks with a late fusion configuration contributed to the performance on the final task, of which the pattern was classifier-independent. It was also observed that the scores of single task classifier and the task fusion configuration were relatively constant in each of the evaluation metrics. Comparing the scores of single task classifiers, the best-performing one was always obtained in the sentence repetition task (SVM: F1-score = 0.88; RF: F1-score = 0.90; LR: F1-score = 0.90), followed by the picture description (SVM: F1-score = 0.80; RF: F1-score = 0.82; LR: F1-score = 0.83) and semantic fluency tasks (SVM: F1-score = 0.64; RF: F1-score = 0.65; LR: F1-score = 0.65).
Correlation between classifier predictions and MoCA scores
To validate the classification result, we examined the relationship between data-driven predictions and participants’ cognitive abilities. Specifically, correlation coefficients (Pearson’s r) and root mean square error (RMSE) were computed between predicted probabilities of MCI (based on different classifiers) and MoCA scores, which indicated that measurements from the two different resources were strongly correlated (SVM: r = –0.74; RF: r = –0.71; LR: r = –0.72; all p < 0.001; see Fig. 2). It should be noted that a higher value in classifier predictions indicates a higher probability of MCI, whereas a higher score in MoCA indicates a lower probability of MCI. Therefore, negative correlations should be expected if our predictions produced the same diagnosis as test scores.

Correlation between predicted probability of MCI and MoCA score. Dark and light scatters represent individual participants in MCI and Normal group. ***p < 0.001.
Features interpretation
Identification of latent factors
Single task classifiers for picture description and semantic fluency were trained based on PCs, which were used to reduce the redundancy in original features. Here, we intend to investigate the underlying structure of these latent variables and establish whether the observed features were specific to a single linguistic dimension, or they would describe patterns in other aspects of speech production. Following Hoffman et al. [46], the pairwise correlation coefficients (Pearson’s r) between each of the 53 observed features and the 21 resulting PCs (explaining 95%of variance in our dataset) in picture description task were computed. Those greater than 0.3 or less than –0.3 are shown in Fig. 3A. For conciseness, only the top six PCs are shown here, as they were sufficient to account for the majority of observed features in speech and language (50 out of 53). Component 1 indexed the number and richness of words produced, the length of utterance and silence segment, and the complexity of sentences, so appeared to reflect the productivity and the syntactic complexity of the narrative discourse. The frequency of words and vocabularies loaded exclusively on component 2, which reflected the preciseness of terms used to describe the concepts in narrative speech. Strong correlations with components 3 and 4 were associated with the proportion of words in a certain class, which were indictive of problems to access specific POS. It was also observed that features describing the syntactic relationships of words were also loaded on component 3. This could be expected as POS explains how words are used in sentences. Finally, components 5 and 6 indexed two types of pauses, one unfilled and the other filled with verbal or nonverbal utterances, which suggested that silent and filled pauses might reflect different problems in speech production.

Interpretation of features in the picture description task. A) Principal components analysis identifying six latent speech and language factors. Features are clustered based on their correlation with each principal component. B) By-group comparison of representative features. Dark and light triangles represent individual participants in MCI and Normal group. The black dots and gray shades indicate the mean and 95%confidence interval of each group. Asterisks indicate significant effect of group based on multiple linear regression. Numerical values of the statistical analysis using multiple linear regression models are provided in the Supplementary Material. *p < 0.05, **p < 0.01, ***p < 0.001.
The 25 observed features and the eight resulting PCs (explaining 95%of variance in our dataset) in the three subtasks of semantic fluency were subjected to the same analysis, with their correlations separately shown in Fig. 4A. Since the underlying structures of latent speech variables were broadly similar in SF_01 (fruit) and SF_02 (animal), their pairwise correlation coefficients were averaged to generate the new matrix. Again, the top four PCs, accounting for 24 out of 25 features, were provided for the sake of conciseness. In the first two subtasks, the component capturing the maximum variance (i.e., component 1) indexed the length/number of utterances, filled pauses, and silent segments, which described speech production at the temporal domain. The duration of silence between filled pauses and the words immediately followed loaded on the second component with the number of words generated within different blocks (i.e., 15, 30, 45, and 60 s). This component may indicate the ability of lexical retrieval. Finally, speech and articulation rates were exclusively represented by component 3. In the third semantic fluency subtask, the first component was a measure of productivity at both key word level (i.e., number of words generated) and temporal level (length of utterance and silent segments). The cognitive process of semantic activation, lexical retrieval, phonological encoding, and articulation was broadly represented by the second component, which indexed the duration/number of filled pauses, as well as the length of silent segment between filled pauses and the words immediately followed. Finally, speech and articulation rates were strongly associated with scores of the third component.

Interpretation of features in the semantic fluency task. A) Principal components analysis identifying four latent speech and language factors. Features are clustered based on their correlation with each principal component. B) By-group comparison of representative features. Dark and light triangles represent individual participants in MCI and Normal group. The black dots and gray shades indicate the mean and 95%confidence interval of each group. Asterisks indicate significant effect of group based on multiple linear regression. Numerical values of the statistical analysis using multiple linear regression models are provided in the Supplementary Material. *p < 0.05, **p < 0.01, ***p < 0.001.
Statistical analyses on pre-defined features
Statistical analyses on pre-defined features were carried out, as it is of greater interest to explore their association with diagnosis. For each of the three tasks, we considered a number of factors that could predict speech and language patterns, including group (MCI versus normal), age, and years of education. A series of linear mixed-effect regression models and multiple regression models were applied for each of the pre-defined features, allowing for an interpretation of how each factor is related to the outcome measure of interest. For conciseness, partial results concerning the effect of group are reported, as our aim is to look for key features helping to differentiate patients from cognitively intact counterparts through these analyses. The slope estimates, interpreted as the average effect of cognitive decline, are presented as beta (B) with standard error (SE), where a positive value indicates an increase in such measure and a negative value indicates the opposite effect. Detailed procedures together with the full results of these analyses can be found in the Supplementary Material.
In picture description task, mixed effect modelling indicated that features describing the number and richness of words were significantly modified due to the cognitive decline. Compared with normal controls, patients with MCI produced significantly fewer words in their discourse (Total words: B = –14.63, SE = 4.80, p = 0.003) and their vocabularies were less diverse (Total vocabularies: B = –8.79, SE = 2.47, p = 0.001; Honoré’s statistic: B = –18.40, SE = 5.33, p = 0.001). Consistent evidence was provided by reduced phonation time (B = –7.06, SE = 1.47, p < 0.001) and increased proportion of silent segments (B = 0.73, SE = 0.17, p < 0.001), which described the productivity of discourse at the temporal domain. Besides, language produced by patients with MCI was syntactically impaired, as revealed by lower sentence complexity both locally (Elements linked to noun: B = –0.41, SE = 0.10, p < 0.001) and globally (Number of coordination: B = –0.24, SE = 0.04, p < 0.001; Dependency distance: B = –5.76, SE = 0.68, p < 0.001). While the detection of MCI based on POS rate has been well-established in the literature of Indo-European languages, the present study did not provide evidence for such a change (Proportion of open class words: B = 0.01, SE = 0.01, p = 0.346; Proportion of closed class words: B = –0.01, SE = 0.01, p = 0.353), which was consistent with what we previously observed in Chinese. Finally, patients with MCI did not exhibit a higher proportion of filled pauses in their discourse (B = 0.01, SE = 0.01, p = 0.482), which also coincided with our prior result [15].
Figure 3B presented the by-group comparison of representative features, with slope estimates obtained from multiple regression analyses for each subtask (PD_01, PD_02, PD_3). Similar patterns of alterations could be observed at the single subtask level, except that patients with MCI tend to use more general terms to describe the scenes in PD_03 (B = 0.05, SE = 0.02, p = 0.002), while this feature was not salient for a wider application (B = 0.02, SE = 0.01, p = 0.096).
In semantic fluency task, results from linear mixed-effect models revealed that the number of words generated by patients with MCI was comparable to that produced by their healthy counterparts in the initial 30 seconds (Words produced within 30 second: B = –0.66, SE = 0.43, p = 0.128), while their productivity significantly dropped after automatically activated words were exhausted (Words produced within 60 s: B = –2.00, SE = 0.61, p = 0.001). The current findings indicated that controlled lexical retrieval might be compromised due to the cognitive decline. In addition, speech fluency was influenced by cognitive status, with patients producing a higher proportion of filled pauses (B = 0.15, SE = 0.07, p = 0.020) and silent segments (B = 0.13, SE = 0.05, p = 0.005). Finally, we observed a longer duration of silence between filled pauses and the words immediately followed in the MCI group (B = 1.94, SE = 0.63, p = 0.003), which may result from declines in multiple cognitive domains. At the subtask level (SF_01, SF_02, SF_03), multiple regression analyses indicated similar patterns with task-specific idiosyncrasies.
In sentence repetition task, the overall performance of patients with MCI was significantly worse than their healthy counterpart as indicated by the longer edit distance (B = 1.36, SE = 0.29, p < 0.001). Importantly, this effect was not consistently shown in all sentences, as less difficult ones were prone to the ceiling effect. Sentence-specific manifestation of cognitive decline was also revealed in a number of acoustic variables measuring speech duration, intensity and power spectrum, of which the results are partially shown in Fig. 5. Interestingly, intonational contour was substantially flatten due to cognitive decline, as reduced alterations in fundamental frequency were observed in all sentences produced by patients with MCI.

Interpretation of features in the sentence repetition task. By-group comparison of representative features. Dark and light triangles represent individual participants in MCI and Normal group. The black dots and gray shades indicate the mean and 95%confidence interval of each group. Asterisks indicate significant effect of group based on multiple linear regression. Numerical values of the statistical analysis using multiple linear regression models are provided in the Supplementary Material. *p < 0.05, **p < 0.01, ***p < 0.001.
DISCUSSION
Early detection of cognitive decline has become a shared goal among researchers, with increasing interests directed toward the use of speech and language as a potential diagnostic biomarker. In the present study, predictions from multiple spoken tasks (picture description, semantic fluency, and sentence repetition) were combined through a voting process and were used to distinguish between patients with MCI and healthy controls. While it was difficult to compare across related works using different tasks and cohorts of participants, the present classification results are promising. The late fusion configuration led to an improved accuracy compared with single task classifier, as the three tasks more dominantly tapped into distinct areas of cognitive processes, and were complementary in capturing the speech and language alterations associated with cognitive decline. Importantly, the probability estimates of MCI were strongly with the severity of the deficits in cognitive abilities revealed by MoCA scores, despite the use of inputs from merely the focus speech and language functions. This is expected as speech production incorporates the interplay of multiple cognitive domains, including working memory, attention, and executive control [2], and involves ongoing interactions among different brain regions [47, 48].
It is worth noting that the speech and language impairments in patients with MCI are not directly linked to any structural or functional loss of “language centers” but are primarily due to the “slip” in cognitive status [26]. This makes the changes more subtle compared with other forms of dementia [49]. Besides, unlike PPA patients who exhibited selective language impairment due to frontotemporal lobar degeneration [50], patients with MCI lack homogeneity of brain atrophy and specific patterns in linguistic deficits [14, 25]. This in a way reinforces the need to employ multiple speech tasks that require differential cortical involvement, and to extract as many potentially informative features as possible to capture the varied modifications of speech and language at the prodromal stage of cognitive decline.
Of particular interest is the fact that tasks may have their specific contributions in predicting the cognitive status, which allow us to pinpoint the underlying domains of language impairment. Compared with the semantic fluency and sentence repetition tasks that were confined to the level of words and sentences, the picture description task characterized communications in the everyday context and provided a greater deal of measurable dimensions. Among all features extracted, those reflecting characteristics of semantically and syntactically impoverished speech represented the predominant features associated with MCI, which have been linked to the degeneration of the temporal lobe and temporal-parietal junction [51]. Critically, evidence from functional imaging studies indicated that a number of subregions of temporal lobe including left superior temporal gyrus, middle temporal gyrus, and anterior temporal lobe were actively engaged during narrative speech production [52, 53]. These classical perisylvian language areas were hypothesized to support semantic [54] and syntactic processing [55] at the discourse level. In fact, previous studies tended to attribute the activation in temporal cortices to the net sum of lexical and syntactic processing, as these two cognitive processes were not independent, but interact with each other while processing higher-level discourse information [53, 56]. Interestingly, this neural evidence was in line with our behavioral-level observation. Features describing the amount and richness of word use and those describing syntactic complexity were uniformly explained by the first principal component, since impoverished sentences were usually produced with fewer words. In addition to the temporal regions, activation during connected speech production also extended to the temporo-parieto-occipital junction. This transcortical region has been functionally linked to the integration of information by a number of studies, as numerous areas on the brain are activated in response to the dynamic context in narratives [53, 57]. It also functions as a sensorimotor interface mediating self-monitoring according to the dual-stream model of speech production [58, 59]. Finally, Fleming and Harris [60] interpreted the reduction of discourse content as “shortcomings in executive skills related to semantic processing, which is responsible for retrieving, maintaining, monitoring, and manipulating semantic representation” [61]. These cognitive control processes have been linked to the prefrontal areas (e.g., inferior frontal gyrus), which were actively engaged at the level of discourse [62, 63], and were vulnerable to the disease [64]. Taken together, the characteristic of “empty speech” in patients with MCI may result from the deterioration in lexical retrieval, syntactic processing, and discourse planning. All these possible detrimental effects led to the temporal-level manifestation with increased silent pauses and reduced phonation time in narrative speech, even though these explicit features may not have direct underlying causes.
Although picture description task provided multidimensional variables identifying language impairments associated with MCI, semantic fluency task might have additional independent effects. This hypothesis was supported by Kave and Goral [33], who observed weak and insignificant correlation between measures of connected speech and semantic fluency scores. Their proposed explanation related partly to the differential experimental paradigm that narrative speech production was guided by external visual cue, whereas semantic fluency task was not. While another possibility is that the two tasks have different retrieval demands: in picture description task, lexical retrieval relies more on semantic knowledge store, whereas in semantic fluency task, lexical retrieval depends on both semantic memory and executive functions [2, 33]. Neuroimaging studies have found that bilateral inferior frontal gyrus, precisely BA45 and BA47, was involved when retrieving task-relevant information from semantic memory [65–67], with coactivations observed in lateral temporal cortex [68]. These regions, as elaborated earlier, are vulnerable targets to the disease, which may explain why patients with MCI performed significantly worse than their healthy counterpart in semantic fluency tasks [69]. In the present study, our results fit in well with these hypotheses. In particular, the number of words produced by the two groups were comparable during the initial 30 seconds, while after that the productivity of patients with MCI significantly decreased. The present results indicated that reduced semantic fluency may probably be due to the decline in executive functions, as lexical retrieval puts higher demand on a controlled process after automatically activated words become exhausted [34, 35]. In a similar vein, such dysfunction was manifested by a longer silence between filled pauses and the words immediately followed in the MCI group. It should be noted that even executive functions were leveraged in the semantic fluency task, it was not as predictive as the other two tasks in the present study. This finding was also revealed in two previous studies [31, 70]. The tentative explanation is that lexical and syntactic features, broadly represented by several principal components in picture description task, may reflect a net sum of semantic and syntactic processing, while features in semantic fluency task are more specific to the lexical retrieval processes. While it is still regarded as a unique contributor to the final classification, considering the heterogeneous cognitive profile in patients with MCI, and the complementary effect brought by the task-fusion configuration.
Among all spoken tasks, sentence repetition task yielded the most predictive power in detecting early cognitive decline, probably because it was the only task requiring memory recall during speech production. Consistent with previous findings [31, 71], sentences were less successfully reproduced by patients with MCI compared with their healthy counterparts, and the trend was more obvious in sentences with higher level of difficulty. This could be expected, as more complex sentences were usually associated with more words and hierarchies, taking longer time to be semantically and syntactically integrated. Accordingly, greater working memory load [72, 73] was needed, and that became harder for patients with MCI to retain the information in sentence stimuli. The current findings may further prove the notion that in spite of the heterogeneous cognitive profile in patients with MCI, memory impairments seem to be a relatively typical pattern, thus justifying its potential as a robust predictor for early cognitive decline. In the present study, several acoustic features, including intensity, intonational contour and power spectrum of speech signals were also included as potential features. It was noted that the acoustic parameter F0 SD was particularly useful in discriminating the two groups of participants, which was in line with the findings of Roark et al. [74], though contradictory to Beltrami et al. [12]. The discrepancies may be due to the use of different experimental paradigms, which required memory recall in both sentence repetition (the present study) and delayed story narration [74], but not in picture description and spontaneous speech production [12]. With greater cognitive demands, patients were more likely to reproduce sentences in a staccato and monotone manner, making their voice less expressive and melodic. This explanation appears to fit in quite well with numerous findings associating prosody to reading proficiency [75, 76], in which prosody was regarded as the “other side of speech fluency”.
Even though message being conveyed varied from task to task, there seemed to be a common set of audio features that could reflect early cognitive decline. Specifically, features depicting speech continuity exhibited a decreased linear trend in utterance and increased linear trend in silence. Such changes in temporal domain may be indicative of specific impairment at the cognitive-linguistic domain. In picture description task, the reduced duration of utterances resulted from the impoverished sentences and the production of fewer words, which might be associated with word-finding difficulties in patients with MCI. Similar patterns in semantic fluency task resulted from the production of fewer words within certain superordinate semantic category, suggesting that patients may have difficulties in retrieving specific lexical items due to the decline of executive functions. In sentence repetition task, the deletion of sentence elements compared with the standard sentence stimuli is the major cause of changes in temporal information. Such observation might result from the MCI-induced memory loss, which is commonly presented in patients with early cognitive decline.
While the present study revealed promising results in predicting MCI status based on multiple spoken tasks, there are still a number of limitations that warrant consideration. First and foremost, the present model was constructed based on three tasks (picture description, semantic fluency, and sentence repetition task), and there is no reason not to employ more tasks to characterize the heterogeneous cognitive profile in patients with MCI. For example, recent findings indicated that cognitively impaired patients tended to produce tangential, off-topic utterances due to the dysfunction of executive control to properly activate the topic-related information [61]. The poor coherence might be well captured using participants’ response to a set of questions probing different areas of semantic knowledge, combining latent semantic analysis-based features that measure local and global coherence in connected speech [46, 63]. Second, even if it was attempted to link observed language impairments to the underlying neuropathological processes, explanations were provided in the light of normative accounts of brain functions. There is still a lack of direct evidence revealing the altered neural mechanisms of speech and language in patients with MCI. Further research is needed to validate these hypotheses. Third, it is also essential to confirm whether the proposed features could differentiate beyond cognitive screening tests, such as the MoCA and Mini-Mental State Examination, and whether they are more sensitive in detecting early symptoms of cognitive decline. To do so, predictions based on different tasks as well as their combinations will be correlated with a series of tests that have specificity in certain cognitive functions. Finally, the current analyses were purely based on manual transcriptions. To render the diagnostic system fully automatic, several challenges such as that concerning the robustness to automatic speech recognition error are still waiting to be resolved.
CONCLUSION
Language is not only a window into the mind, but also a powerful lens to uncover the cognitive status. In this study, a framework was proposed to predict MCI status, by combining probability estimates from multiple spoken tasks. It demonstrated that the task-fusion configuration boosted the discriminative performance compared with each single task classifier, as they provided complementary features that merit the final classification. By linking features to their underlying neuropathological processes, the study provided valuable insights into the heterogeneous cognitive profiles of patients with MCI. The effort is a step closer toward the development of an automatic and knowledge-based system for detection of early signs of cognitive decline based on speech and language inputs.
Footnotes
ACKNOWLEDGMENTS
This study was supported by the grants from The National Key Research and Development Program of China (2018YFA0701405), National Natural Science Foundation of China (NSFC61771461, U1736202, 82071187, and 81870821), Science and Technology Planning Project of Guangdong Province (2019B090915002), Shenzhen Foundational Research Program (JCYJ20170818163505850 and JCYJ20170413161611534), Natural Science Foundation of Ningbo (2018A610408), and Beijing Youth Talent Team Support Program (2018000021223TD08). The authors would like to thank Dr. Runzhi Li, Dr. Xinying Zou, Chengchen Lv, Quanlei Yan, Jingshen Pan, and Yue Su for their assistance during data collection.
