Abstract
Background:
Recently, many studies have been carried out to detect Alzheimer’s disease (AD) from continuous speech by linguistic analysis and modeling. However, few of them utilize language models (LMs) to extract linguistic features and to investigate the lexical-level differences between AD and healthy speech.
Objective:
Our goals include obtaining state-of-art performance of automatic AD detection, emphasizing N-gram LMs as powerful tools for distinguishing AD patients’ narratives from those of healthy controls, and discovering the differences of lexical usages between AD patients and healthy people.
Method:
We utilize a subset of the DementiaBank corpus, including 242 control samples from 99 control participants and 256 AD samples from 169 “PossibleAD” or “ProbableAD” participants. Baseline models are built through area under curve-based feature selection and using five machine learning algorithms for comparison. Perplexity features are extracted using LMs to build enhanced detection models. Finally, the differences of lexical usages between AD patients and healthy people are investigated by a proportion test based on unigram probabilities.
Results:
Our baseline model obtains a detection accuracy of 80.7%. This accuracy increases to 85.4% after integrating the perplexity features derived from LMs. Further investigations show that AD patients tend to use more general, less informative, and less accurate words to describe characters and actions than healthy controls.
Conclusion:
The perplexity features extracted by LMs can benefit the automatic AD detection from continuous speech. There exist lexical-level differences between AD and healthy speech that can be captured by statistical N-gram LMs.
INTRODUCTION
Dementia, subsumed under the major neurocognitive disorder, can cause significant cognitive decline and interfere with independence [1]. Patients affected by dementia are primarily above the age of 60, where the prevalence ranges between 5% – 7% in most parts of the world. And within the elderly group, the prevalence increases exponentially with age [2]. It is estimated that 46.8 million people worldwide were living with dementia in 2015, resulting in economic costs of 604 billion US dollars, and the number of patients is expected to almost double every 20 years [3].
Dementia is a clinical syndrome caused by neurodegeneration, with Alzheimer’s disease (AD) being the most common underlying pathology [4, 5]. In particular, AD accounts for 70% of all dementia cases, while other causes include frontotemporal lobar degeneration, vascular dementia, etc. [6]. According to the progressive degree of cognitive and functional impairment, the course of AD is usually divided into four stages, including mild cognitive impairment (MCI), mild AD, moderate AD, and severe AD. During the course of AD, patients suffer from short memory loss at the beginning and completely depend upon caregivers near the end. However, recent studies suggest that the disease shall be considered as a continuous neuropathology process rather than several distinct clinically defined entities [7]. This idea was recognized and formalized respectively in the 2011 NIA-AA guidelines [8, 9] and 2018 NIA-AA research framework [10].
The diagnosis of AD relies on the evidences from cognitive tests, such as Mini-Mental State Examination (MMSE) [11] and Montreal Cognitive Assessment (MoCA) [12], biochemical markers, medical imaging, etc. Recent research emphasized the role biomarkers play in the process of AD. As 2018 NIA-AA research framework [10] claimed, the term “Alzheimer’s disease” refers to an aggregate of neuropathological changes and thus is defined in vivo by biomarkers and by postmortem examination, not by clinical symptoms. While in the clinical practice, the combination of several evidences from different aspects is still an appropriate diagnosis approach. Apparently, this is an expensive and time-consuming process. On the other hand, since there is no cure for AD so far, it is really essential to provide latent and early-stage AD patients with proper prevention and intervention therapies. Thus, developing an effective and convenient method for AD detection becomes a valuable research topic.
AD and speech
The main symptom of AD is cognitive impairment, which involves executive function, logical inference, language skills, and so on. Language impairment generally appears at the early stages of the disease [13]. Murdoch et al. [14] demonstrated that AD patients had significant worse performance than controls on a standard aphasia test, specifically in the areas of verbal expression, auditory comprehension, repetition, reading, and writing.
Analysis on continuous speech provides an important source of information encompassing the phonetic, phonological, lexical-semantic, morpho-syntactic, and pragmatic levels of language organization [14]. The performances of AD patients on some language tasks, such as picture description and sentence repetition, were distinctive from healthy people. Szatloczki et al. [15] summarized the studies that investigated language impairments at different AD stages and claimed the effectiveness of linguistic analysis for AD detection.
At phonetic and phonological level, AD patients mostly had a low speech rate and frequent hesitations, as reported in Hoffmann et al. [16] and Sajjadi et al. [17]. Other variables, including acoustic features, were rarely mentioned as distinctions between speech of AD patients and healthy controls.
At the lexical-semantic level, the investigation on part of speech revealed that AD patients produce more closed class words [18]. Kavé et al. [19] showed that although persons with AD conveyed less information and made more semantic errors than did control participants, their language structures remained well.
At the morpho-syntactic level, AD patients often suffered from more inflectional errors [20]. De Lira et al. [21] reported a simplification of syntax in language of persons with AD, characterized by reduced sentences and short utterances.
At the discourse and pragmatic levels, language of patients with AD contained confounding and irrelevant information according to Carlomagno et al. [22]. Cuetos et al. [23] pointed out that in a picture description task, AD patients scored significantly lower than control group on two important variables: 1) the total number of semantic units and 2) the total number of objective situations present in the picture, and it led to uninformative and empty speech.
Related work
So far there have been many studies on automatic AD detection by extracting linguistic features from speech recordings and training classifiers by machine learning. Some of them focused on utilizing acoustic features. Wolters et al. [24] explored the correlation between the performance of semantic fluency tasks and semantic memory function, and pointed out that prosodic features such as pause durations and response delays had a great effect on the analysis of semantic fluency data. Satt et al. [25] collected recordings of each participant when taking several cognitive tasks, and carefully designed acoustic features for each task. They obtained an equal error rate (EER) of 87% in a two-category classification between AD and control. However, their dataset was relatively small, which contained only 15 healthy controls and 26 AD patients.
In addition to acoustic features, the features corresponding to other linguistic levels have also been employed to detect AD from continuous speech. Fraser et al. [26] studied the speech and corresponding transcriptions from 240 AD patients and 233 healthy controls in the DementiaBank corpus. They extracted total 370 features considering part-of-speech (POS), syntax, acoustics, and other aspects of linguistics, and obtained a best average accuracy of 81.92% for binary classification using the 35 top-ranked features. Weiner et al. [27] achieved an F-score of 0.80 based on the German database ILSE, and later they applied speech recognition techniques to get the transcriptions of speech recordings automatically [28]. Sadeghian et al. [29] used the extracted linguistic features to assist MMSE for AD detection, and greatly improved the accuracy from 70.8% to 94.4%. Santos et al. [30] adopted word embeddings to improve the traditional lexical network and extracted topological features from the network. Field et al. [31] added area-dependent features based on the work of Fraser et al. [26], and improved positive predictive value (PPV) from 0.83 to 0.84 and negative predictive value (NPV) from 0.81 to 0.82.
Similar to Fraser et al. [26], a baseline detection model is also built in this paper using features at different linguistic levels. An area under curve (AUC)-based feature selection method is designed, which generates a rather small feature set compared with Fraser et al. [26]. The performances of five different classifiers are also compared by experiments in this paper.
Different from the studies introduced above, this paper focuses on utilizing language models (LMs) [32] for AD detection. The techniques of language modeling have been employed in previous studies on detecting mental diseases. Pakhomov et al. [33] used a perplexity index derived from language models to represents the degree of deviation in word patterns used by frontotemporal lobar degeneration patients, compared to patterns of healthy adults. Linz et al. [6] showed that the perplexity of a person’s production in the semantic verbal fluency (SVF) task predicted dementia well, with a F1 score of 0.83.
In this paper, N-gram LMs are used, which describe the probabilities of words given their contexts in utterances. These probabilities of N-gram LMs are estimated using the transcriptions of healthy and AD speech respectively. First, this paper proposes to extract perplexity features using the built N-gram LMs for AD detection. This idea is inspired by Wankerl et al. [34]. They built two trigram LMs using the speech transcriptions from AD patients and controls respectively, and derived a single feature by calculating the difference between the perplexities obtained by these two language models, resulting in a detection EER of 77.1% on the DementiaBank corpus. In this paper, two-dimensional perplexity features are designed and the effects of context length N are investigated by experiments. The perplexity features are further combined with baseline features and the highest detection accuracy of 85.4% is achieved by nested leave-one-person-out evaluation.
Second, a lexical-level analysis is conducted using the built LMs. Some previous studies [35–38] have revealed that AD patients suffer from word finding and word retrieval difficulties. They usually produce more closed class words, such as pronouns, and more superordinate names instead of target names [39–41]. This paper adopts LMs as analytical tools to achieve the lexical-level analysis of AD speech. A proportion test is carried out based on the unigram probabilities of all words in the corpus, and generates a list of 84 words with significant word probability differences between AD and control classes. Examining this word list, we can find that AD patients tend to use more general, less informative, and less accurate words to describe characters and action than healthy controls. Furthermore, a simple Naïve Bayes classifier is constructed based on such lexical analysis and it achieves an accuracy of 82.6% by leave-one-person-out evaluation.
MATERIALS AND METHODS
This paper utilizes the DementiaBank corpus, which is part of the TalkBank project [42]. The corpus contained the recordings from 292 participants taking a picture description task, which was originally designed for the Boston Diagnostic Aphasia Examination [43]. The task required each participant to say whatever happened in the picture (as shown in Fig. 1) as much as possible, and allowed encouragement from interviewers when participants had difficulties.
Among all participants in the corpus, 169 were diagnosed with “PossibleAD” or “ProbableAD”, and 99 were healthy controls. The diagnostic criteria used for “PossibleAD” or “ProbableAD” determination, as specifically described in McKhann et al. [44], included dementia symptom, deficits in several areas of cognition, and laboratory results such as normal lumbar puncture. In order to be consistent with previous studies [26, 34], the samples with “PossibleAD” and “PorbableAD” labels are merged to compose the AD group in our study. The demographics of participants are showed in Table 1. Because participants may take the task more than once during the data collection of this corpus as shown in Table 2, totally there were 242 control samples and 256 AD samples used in our study.

The picture of “Cookie Theft”, adopted from Goodglass et al. [43].
The demographics of participants in the DementiaBank dataset
The numbers of participants with different participation times
Each sample contains a speech recording file and a corresponding transcription. Speech recordings were manually transcribed following the TalkBank CHAT (Codes for the Human Analysis of Transcripts) protocol [45]. Transcriptions also contain certain kinds of notations for abbreviated pronunciation, word repetition, word correction, filled pauses, and POS tags, which were used as linguistic features as introduced in next section. For extracting other linguistic features from texts, we removed all the annotation codes and the raw transcriptions of what participants and interviewers said were utilized.
Baseline features
Altogether 49 features corresponding to different linguistic levels are extracted to build our baseline AD detection models. Most of them are chosen by referring to previous studies [24–32]. The NLTK [46], StanfordNLP [47], and VOICEBOX [48] toolkits are used to extract these features in our study. The descriptions of all features can be found in the Supplementary Material. Table 3 summarizes the types of these features and some of them are briefly explained as follows.
The types of extracted features for building baseline models
Phonetics-phonology level
The F0 and MFCC features contain the mean values of frame-level fundamental frequencies and 13-dimensional mel-frequency cepstral coefficients (MFCC) extracted from the speech of each participant. Abbreviated pronunciations are indicated by the CHAT annotation, such as o (f) and talkin (g), where the letters in parentheses are the ones being omitted during pronunciation. For each participant, the speech rate is measured as the ratio between the speech duration and the total number of words.
Lexical-semantics level
The counts of some special words are adopted because Zhou et al. [49] showed that these counts had considerable difference between AD patients and healthy controls in the DementiaBank corpus. The words include “window”, “girl”, “curtain”, “sink”, “stool”, and “mother”.
Semantic idea density (SID) is measured by the ratio between the count of information units and the number of total tokens. Here, the set of information units given by Ahmed et al. [50] are utilized to calculate SID, and the details of the full set can be found in the Supplementary Material.
Syntax-level
The average parsing tree height of all sentences from each participant is calculated as a syntax-level feature. The counts of some POS types and context-free grammars (CFG) rules or symbols are also used as features following previous studies [26].
Pragmatics-level
When participants were taking the picture description task, the interviewers may encourage them to give more comprehensive descriptions of the picture, especially when the participants had difficulties. This means that the amount of the speech from interviewers may be correlated with the performance of participants on the task.
Therefore, the number and mean duration of participant’s utterances, as well as the ones of interviewer’s utterances, are used as pragmatic-level features. The ratio between the total speech duration of the participant and the interviewer is also calculated to measure the information provided by the interviewer during the task.
Feature selection
An AUC-based method is designed to select useful features from the complete feature set S all which contains total 49 features introduced above. First, a small subset S top is generated from S all according to the AUC-based β values of all features. The definition of β value will be explained later. Second, all combinations of the features in S top are enumerated to obtain an initial solution set S ini which achieves the best classification performance. The resulting S ini can be different depending on the settings of the classifier. Then, all other features not contained by S top are added to S ini one by one in descending order of their β values, until the classification performance of using the feature set begins to degrade. This feature set S fnl is the final result after feature selection.
For each feature, AUC is defined as the area under the receiver operating characteristic (ROC) curve [52]. The value of AUC equals to the probability that when randomly choosing a pair of positive (AD) and negative (healthy control) samples, the positive sample’s feature value is larger than that of the negative one. Thus, if a feature has a good discriminating capability, its AUC should be close to 0 or 1. On the contrary, the AUC of a random feature should be around 0.5. Thus, a variable β is defined as Equation (1) to describe the discriminating capability of each feature.
In our implementation, an empirical threshold
Detection model
For comparison purposes, five classic machine learning algorithms are employed to build the AD detection model respectively, which are logistic regression, support vector machine, decision tree, random forest, K-nearest neighbor algorithm. All algorithms are implemented using the Scikit-Learn toolkit [53]. Logistic regression: An L2 penalty term is added and the penalty factor C is a hyperparameter. Support vector machine: Gaussian kernel function is adopted and the penalty coefficient C is a hyperparameter. Decision tree: Information gain criterion is applied and the minimum leaf sample number N is a hyperparameter. During the training process, the leaf node won’t stop splitting until the number of training samples on that node is less than N. Random forest: Gini purity criterion is used as the training criterion. The number of trees is fixed as 10 and the minimum leaf sample number N is a hyperparameter. K-nearest neighbor algorithm: The value of K is a hyperparameter.
The hyperparameters of these classifiers are tuned in our implementation as introduced in next section. The sets of candidates for tuning these hyperparameters are listed in the Supplementary Material.
Evaluation strategy
Considering the relatively small size of the DementiaBank corpus, common evaluation strategies such as splitting the corpus into training and test sets, and adopting 10-fold cross-validation are not recommended here. Randomly splitting the dataset into training data and test data may lead to large variance of evaluation results. Thus, a nested leave-one-person-out strategy is employed here for evaluation. Specifically, we first conduct an outer leave-one-person-out loop with the number of folds corresponding to the number of participants in the dataset. For each fold, the samples belonging to one participant are used for test and the remaining samples are used for training. Then, an inner leave-one-person-out loop is conducted on the training set of each outer leave-one-person-out fold in order to both select features and tune model hyper-parameters for evaluating this fold. For each specific classifier determined by the classifier type and hyper-parameters, we obtain a feature set through the feature selection process. Then, the classifier with corresponding feature set which gives the best performance in the inner loop is applied to make the prediction for the test set. This evaluation strategy can help to avoid dataset splitting randomness and the overfitting risk of model training.
Language model
N-gram language models have been widely used in the area of natural language processing. In general, an N-gram model represents the conditional probability of
When an N-gram model λ is built, the metric of perplexity is adopted to evaluate how likely a test sequence is generated by the model. A lower perplexity corresponds to a higher likelihood. For a test word sequence X = {w1, w2, …, w
K
}, the perplexity is defined as
N-gram models describe the probabilities of words or word combinations in the training corpus, while perplexity evaluates the matching degrees between the test text and the training corpus in terms of word occurrences. It is expected that one text should achieve a low perplexity if it is evaluated by an N-gram LM estimated using training data of the same genre. Otherwise, the perplexity should be high if the training corpus and the test text are from different genres. Therefore, this paper investigates the methods of utilizing the perplexities extracted by N-gram LMs of AD patients and healthy controls as features for AD detection,
One practical concern of extracting perplexity features is that one sample should not be used to train the language model which calculates the perplexity of itself. In our implementation, we first divide all samples in the training set into two groups, i.e., the control group and the AD group. For each training sample X
i
in the control group, all control samples except X
i
are used to estimate an N-gram LM
Similarly, for each training sample X
j
in AD group, all AD samples except X
j
are used to estimate an N-gram LM
Features with top β values averaged across all outer folds
For each test sample T
i
, λ
C
and λ
AD
are used to calculate the feature vector {PPL
C
, PPL
AD
} as
In our implementations, unigram, bigram and trigram models are built for comparison, which correspond to N = 1, 2, and 3, respectively. Kneser-Ney smoothing is applied when building bigram and trigram models, as well as add-one smoothing for unigram models. The SRILM toolkit [54] is adopted to train N-gram LMs and to calculate perplexities.
Lexical analysis based on unigram LMs
In order to further investigate the lexical-level differences between AD patients and healthy controls, a proportion test [55] is adopted to find the words which had significantly different occurrence probabilities between these two classes.
First, the rare and extremely common words are excluded from the lexicon for analysis according to their counts in the corpus and document frequencies. For each word w in the remaining word list, its z-score is calculated as
RESULTS
Baseline models
As introduced in the Evaluation strategy section, nested leave-one-person-out evaluation was adopted in our experiments. Thus, a feature selection procedure was conducted for each fold of the outer leave-one-person-out loop when building baseline models. In our implementation, the feature filtering threshold was set as 0.7 empirically to control the size of S top . Then, the inner leave-one-person-out loop was conducted and used to produce the final feature set S fnl and to tune the hyper-parameters of classifiers for each outer fold. Table 4 shows the features with top β values averaged across all outer folds. The complete table corresponding to all features can be found in the Supplementary Material.
The five machine learning algorithms introduced in the Detection model section were adopted to build the baseline models for AD detection. The average accuracies of these models across all outer leave-one-person-out folds are summarized in Table 5 for comparison. We can see that logistic regression achieved the highest accuracy of 80.7% among all these 5 algorithms. Thus, it was adopted as the default classifier in following experiments.
Performance of baseline models
Using perplexity features derived from LMs
The results of AD detection using perplexity features derived from N-gram LMs are shown in Table 6. The second column of this table presents the accuracies of only using the two-dimensional {PPL C , PPL AD } features in logistic regression. We can see that the features derived using unigram models achieved the best performance. The unigram models describe lexical-level characteristics, i.e., word probabilities, of the utterances produced by AD patients or healthy controls, while the bigram or trigram models took 1 or 2 history words into account and they are usually considered to capture the syntax knowledge of word combinations. The higher accuracy of using unigram LMs in Table 6 may be explained by the findings of previous studies that the language impairment at syntax level is less significant than the one at lexical-semantic level in AD patients [57, 58].
Performance of using perplexity features derived from N-gram LMs
The two-dimensional perplexity features were then merged with the baseline features and the feature selection procedure was reconducted. Here, its difference with the feature selection procedure for building baseline models was that the {PPL c , PPL ad } features were always contained by S ini . The accuracy results of using different LMs are shown in the last column of Table 6. Comparing with Table 5, we can see that the perplexity features derived by all three N-gram LMs improved the accuracy of AD detection with only baseline features. The unigram models outperformed the bigram and trigram models, despite its simplicity, and achieved the highest accuracy of 85.4%. This result was better than the AD detection accuracies reported in previous work [26, 34] on this corpus. We are not claiming that our proposed method is definitely better, because it is difficult to make a fair accuracy comparison with previous work due to different evaluation strategies.
Lexical analysis based on unigram models
Our dataset contained a total of 52,800 tokens of 1,726 word types. The AD group had 25,408 tokens of 1,280 types, while the control group had 26,392 tokens of 1,161 types. The lexicons of AD and control speech shared 715 word types, only 305 of which corresponded to at least 5 tokens in both groups. This implies that there may exist differences between AD and control groups in choosing proper words to describe the picture.
First, 1,388 rare words which had less than 5 counts in AD or control groups were filtered out. Then, the extremely common words were also removed by examining a word’s document frequency, which was the ratio between the number of samples containing this word and the total sample number. A threshold of 0.90 was set and the words with document frequency higher than this threshold were removed. In fact, only 3 common words {the, is, and} were removed at this step. After word filtering, there were 335 remaining words. For each word w in this list, we calculated its z-score using Equation (11). By keeping the words with |z|>2.56 which corresponded to a significance level of 0.01, a list of 84 words were finally derived. Every word in this list had significantly different proportions between AD and control groups. The full list can be found in the Supplementary Material.
Figure 2 gives a visual presentation of these words. The words were grouped according to their POS tags. Here, the POS tag of each word was determined by its first tag in the Merriam-Webster dictionary if it had more than one POS tag. In each sub-figure, the horizontal axis denoted document frequency and the vertical axis was log 10 (p c /p ad ) where p c and p ad are unigram probabilities used in z-score calculation. The area of each circle corresponded to the word’s occurrence probability in the entire corpus, i.e., p in Equation (11).

Words with significantly different proportions between AD and control groups. For better visualization, the words were grouped according to their POS tags. Horizontal axis denoted document frequency and vertical axis was log 10 (p c /p ad ), where p c and p ad were unigram probabilities used in z-score calculation. The area of each circle corresponded to the word’s occurrence probability in the entire corpus.
From Fig. 2a, we can see that most nouns were used more frequently by the control group than by the AD group. Only 5 nouns had higher proportion in AD group than in control group. Among them, “kid” was frequently used to describe “boy” or “girl” in this picture by AD patients. “thing” and “way” were also common words. Besides, “ladder” and “chair” are inaccurate description for the stool in this picture. These results show that AD patients may miss some details when describing pictures and may have troubles in finding accurate words to describe objects and characters.
By examining Fig. 2b, we can see that the control group used more verbs with specific meanings, while the AD group used more common verbs, such as “get”, “gonna”, “got”, and “doing”. Figure 2c shows that pronouns took higher proportions in AD group than in control group. All of these patterns indicate that AD patients may have difficulties of word finding and tend to use less informative and more general words in their descriptions.
Finally, a multinomial Naïve Bayes classifier was built based on such lexical analysis. The leave-one-person-out evaluation was adopted. In each fold, we selected words that have significantly different proportions between AD and control groups, using the training set of the fold and the method mentioned above. Then, we use these selected words to build a Naïve Bayes classifier and evaluated its performance on the test set of each fold. In Naïve Bayes models, equal prior probabilities were assigned to both classes. This model achieved an accuracy of 82.6% by leave-one-person-out evaluation, which was much higher than the accuracy of using {PPL c , PPL ad } derived from unigram LMs and adopting logistic regression classifier as shown in Table 6. This was a surprisingly good result considering its simplicity and also demonstrated the effectiveness of combining proportion tests with LMs to derive lexical-level features for AD detection.
DISCUSSION
Language impairment usually appears early in AD, and symptom deteriorates continuously during the course of AD [59, 60]. Based on DementiaBank, a public corpus of speech recordings from AD patients and healthy controls in a picture description test, this paper investigates the effects of applying N-gram LMs to AD detection. On one hand, by extracting perplexity features using N-gram LMs, a logistic regression AD detection model was built, which achieved state-of-art detection accuracy of 85.4% on the DementiaBank dataset. On the other hand, a proportion test-based lexical-level analysis was conducted using unigram probabilities, which gave a list of 84 words with significantly different occurrence probabilities between AD and healthy classes. The analysis results show consistently with the findings of previous studies, such as that AD patients usually produce more pronouns, and tend to use more general and familiar words instead of accurate target words [39–41]. It should be noticed that our lexical analysis was carried out in a statistical way using the lexicon and unigram probabilities of all words. No human-designed word selection or feature extraction were involved. Thus, this approach is also expected to be applicable to other similar tasks.
We understand that there are still some limitations with the studies in this paper. First, although the DementiaBank corpus has been widely used to investigate the speech characteristics of AD patients, it is lack of autopsy-confirmed AD pathology. Considering the major differences between the diagnose criteria applied when building the DementiaBank dataset and the newest one, i.e., the 2018 NIA-AA research framework [10], generalizing the findings in this paper to the AD patients diagnosed by a gold standard should be cautious. Second, since our current methods rely on detailed and precise transcriptions of participants’ speech recordings, it constrains the application of fully automatic AD detection. In fact, several studies [61–65] have investigated the possibilities of applying ASR to the speech-based AD detection. To evaluate the influence of ASR errors on LM training and feature extraction and to develop a more automatic framework for AD detection will be the tasks of our future work. Third, this paper focuses on the DementiaBank dataset which consists of subjects speaking English. Whether the conclusions drawn in this paper can be applied to other languages is also a topic worth further investigation.
Footnotes
ACKNOWLEDGMENTS
Yunxia Li acknowledges support from the National Key R&D Program of China (No. 2018YFC1314700), the National Science Foundation of China (No. 81671307), and the Priority of Shanghai Key Discipline of Medicine (No. 2017ZZ02020). Zhiqiang Guo and Zhenhua Ling acknowledge support from the National Science Foundation of China (No. 61871358).
