Abstract
Background:
Primary progressive aphasia (PPA) is a neurodegenerative syndrome with three main clinical variants: non-fluent, semantic, and logopenic. Clinical diagnosis and accurate classification are challenging and often time-consuming. The Mini-Linguistic State Examination (MLSE) has been recently developed as a short language test to specifically assess language in neurodegenerative disorders.
Objective:
Our aim was to adapt and validate the Spanish version of MLSE for PPA diagnosis.
Methods:
Cross-sectional study involving 70 patients with PPA and 42 healthy controls evaluated with the MLSE. Patients were independently diagnosed and classified according to comprehensive cognitive evaluation and advanced neuroimaging.
Results:
Internal consistency was 0.758. The influence of age and education was very low. The area under the curve for discriminating PPA patients and healthy controls was 0.99. Effect sizes were moderate-large for the discrimination between PPA and healthy controls. Motor speech, phonology, and semantic subscores discriminated between the three clinical variants. A random forest classification model obtained an F1-score of 81%for the three PPA variants.
Conclusion:
Our study provides a brief and useful language test for PPA diagnosis, with excellent properties for both clinical routine assessment and research purposes.
INTRODUCTION
Primary progressive aphasia (PPA) is a neurodegenerative syndrome characterized by progressive language impairment. It may be the onset of several neurodegenerative diseases, especially several forms of frontotemporal lobar degeneration and Alzheimer’s disease. According to the consensus criteria by Gorno-Tempini et al. [1], three main variants of PPA are distinguished: non-fluent (nfvPPA), semantic (svPPA), and logopenic aphasia (lvPPA). In brief, non-fluent PPA is characterized by an effortful and slow speech, agrammatism, and/or apraxia of speech; semantic PPA patients show compromised confrontation naming with impaired single-word comprehension; and the logopenic variant is characterized by anomia with frequent pauses during spontaneous language, and impaired repetition deficits [1]. This categorization is relevant because it has demonstrated improved clinical-pathological correlation and outcome prediction [2–5].
The diagnosis of PPA and its variants is a current clinical challenge, where language assessment serves as an essential tool to achieve an adequate classification. In this regard, several tasks are recommended, including the assessment of grammar, motor speech, confrontation naming, repetition, sentence comprehension, single-word comprehension, object/people knowledge, and reading [1]. However, language testing protocols are variable among the different research groups focused on PPA, and often time-consuming. Furthermore, the use of different assessment protocols reduces the comparability between studies in a disorder where collaboration between researchers worldwide would be more necessary to obtain adequate sample sizes. Thus, a standardized and, preferably, brief test may be very useful in the setting of PPA [6].
The Mini-Linguistic State Examination (MLSE) has been recently developed [7, 8] as a brief test, specifically designed for a comprehensive assessment of PPA and other neurodegenerative disorders associated with language and/or speech impairment. The first study of validation of the English version has shown promising results for the diagnosis of PPA and its variants [7]. In this study, our aim was to develop and validate the Spanish version of the MLSE for the diagnosis of PPA and its variants.
METHODS
Study design and Participants
We conducted a cross-sectional study involving 112 participants, 70 patients with PPA, and 42 healthy controls (HC). Patients were recruited consecutively among patients under follow-up in a tertiary-care center. All patients met the current diagnostic criteria for PPA [1] and had an FDG-PET or MRI scan supporting the diagnosis. Patients were evaluated with a comprehensive language and neuropsychological protocol, which has been described elsewhere [9]. Staging of the disease was graded using the Frontotemporal Lobar Degeneration modified Clinical Dementia Rating (FTLD-CDR) [10].
Healthy participants were volunteers, and they met the following criteria: 1) 50–99 years-old; 2) absence of cognitive impairment, confirmed with a MMSE >27, a global Clinical Dementia Rating of 0, and a Functional Activities Questionnaire of 0 [11, 12]; 3) absence of subjective cognitive complaints; 4) absence of any psychiatric, neurological, or neurodevelopmental disorder; 5) absence of a medical disorder with a potential impact on cognition; 6) visual and auditory capacity enough to complete the test.
MLSE
The Spanish version of the MLSE is an adaptation of the original MLSE test developed by Patel et al. [7] for English. It includes the following 11 subtasks: picture naming; syllable repetition; repeat and point; non-word repetition; semantic association; sentence comprehension; sentence comprehension with picture stimuli; word and non-word reading; sentence repetition; written description; and picture oral description. Scoring of the test aims to evaluate the characteristics of language impairment, and errors are categorized into five main linguistic domains: motor speech (maximum score: 30), phonology (maximum score: 30), semantic knowledge (maximum score: 20), syntax (maximum score: 10), and working memory (maximum score: 10). A specific score for each domain is obtained, as well as a global score, obtaining a maximum total of 100 points. A lower score indicates poorer performance and a greater severity of aphasia.
We conducted some changes in the following tasks. First, the reading task, which entailed the reading of irregular words, was adapted because there are no such words in Spanish due to a transparent orthography [13]. Therefore, the five irregular words from the English MLSE were replaced by words in which the stress mark was deleted. This approach was implemented because, in Spanish, the majority of words are stressed on the penultimate syllable, and thus, words stressed on the antepenultimate syllable need a stress mark to indicate an exception to the rule (e.g., “húmero,” “médula”). If the stress mark is deleted in these words, the readers who are not familiar with them will tend to stress the penultimate syllable (huMEro, meDUla). This strategy has proven to be useful in a previous study [13].
Second, the “Repeat and Point” task was modified. Specifically, types of flowers and breeds of dog categories were replaced by types of vegetables and large mammals because, in the pilot phase, we observed important difficulties for healthy participants in the discrimination of those items. For the same reason, “stethoscope” was replaced by “telescope”. Third, in one of the sentence comprehension tasks, the two nouns were of the same gender (i.e., both male, or both female, instead of a male and a female noun), because, in Spanish, the word gender is already indicated in the question (“¿Quién es el doctor?” “¿Quién es la doctora?”). The rest of the tasks and items were exactly the same as in the English version. Words and the names of the pictures were of low-medium frequency in the English version. Items were generally preserved in the Spanish version, and frequency of use was similar to the English version. We performed a pilot study in 20 HC to ensure the understandability and applicability of the test. Some of the above-mentioned amendments were carried out accordingly after the first pilot round.
MLSE was administered by two trained neurologists with experience in PPA (VP and JAM-G). This test was administered independently from the other assessments used for the diagnosis and was not considered for the diagnosis. The mean time of administration was 20 minutes. Administration and scoring guidelines and test are accessible from http://www.mlsexam.com and from the Supplementary Material.
Statistical analysis
Statistical analysis was conducted using SPSS® 20.0 for Mac and RStudio 1.2.5033. Internal consistency was measured using Cronbach’s alpha. Pearson’s correlation coefficient (r) was used to assess the effect of age and education on test scores. Mann-Whitney U and Kruskal-Wallis tests were used for the comparison between 2 or more groups, respectively. A post hoc Dunn analysis with Bonferroni correction was used to evaluate differences between groups when comparing the 3 PPA variants. The effect size was estimated with Cohen’s d for two means comparison, considering the effect as small (d = 0.2), moderate (d = 0.5), or large (d = 0.8). The receiver operating characteristic (ROC) curve was used to evaluate the test’s capacity to discriminate between PPA patients and healthy controls. Youden’s index was calculated to define the optimal cutoff point. Statistical significance was set at a p-value < 0.05.
Machine learning classification
A Random Forest supervised classification model was implemented with scikit-learn v.0.22.1 in Python v.3.6.9. The original dataset was randomly divided into training (70%, n = 78) and test (30%, n = 34) sets. This division was made by considering the distribution of each of the four groups (control, lvPPA, nfvPPA, and svPPA) in the original sample. To determine the best hyperparameters of the model, a 5-Fold Cross-Validation Grid Search was carried out on the training set. The best model was then evaluated on the test set. The procedure here described was applied to two different datasets: one containing only 5 features corresponding to the total scores of the five linguistic MLSE domains, and another containing all domains’ subscores, with 30 features in total. Furthermore, Random Forest models were also used to rank the different features in each of the two datasets according to their importance in the classification. Finally, we selected one of the decision trees from the domain-based Random Forest model obtained. This tree was selected using the following criteria: first, the inclusion of the five MLSE domains; second, a balanced accuracy greater than 70%; and third, clinically meaningful.
RESULTS
Sample characteristics and reliability influence of age and education
There were no statistically significant differences between the HC and PPA groups regarding the main demographic characteristics (Table 1). Patients with PPA were classified as non-fluent (n = 27), semantic (n = 13), and logopenic (n = 30) variants. Mean time since the onset of symptoms was 3.37±2.14 years. Patients with semantic PPA were younger than those with the logopenic variant in the post-hoc analysis (p = 0.030).
Main demographic characteristics
* χ2 test.
Internal consistency was 0.758. All correlations between age and years of education with MLSE scores were low and non-statistically significant (Supplementary Table 1). Correlation between MLSE (total score) and FTLD-CDR (sum of boxes) was r = 0.429 (p < 0.001).
Diagnosis of PPA
PPA patients as a group showed lower performance in all MLSE scores (Table 2). Effect sizes were moderate for MLSE-motor speech and large for the other scores. The area under the ROC curve for the discrimination between HC and PPA patients with the MLSE total score was 0.99 (0.98–1.00, 95%confidence interval). According to the Youden index, the optimal cutoff point was 95.00 (Fig. 1). The area under the curve and the best cutoff points for the different domains were 0.70 and 30 for motor speech, 0.95 and 29 for phonology, 0.94 and 20 for semantic knowledge, 0.94 and 10 for syntax, and 0.85 and 10 for working memory.
Comparison in MLSE scores between HC and PPA patients

ROC curve for the discrimination between HC and PPA using MLSE-Total Score.
Differential diagnosis between PPA variants
Kruskall-Wallis test revealed a significant effect group for MLSE-motor speech (H = 15.56, p < 0.001), MLSE-phonology (H = 17.77, p < 0.001), MLSE-semantics (H = 21.93, p < 0.001), and MLSE-working memory (H = 26.27, p < 0.001), but not for MLSE-syntax (H = 1.77, p = 0.413) and MLSE-total score (H = 4.15, p = 0.125). Post-hoc analyses showed statistically significant differences between nfvPPA versus svPPA (adjusted p = 0.001) and nfvPPA versus lvPPA (adjusted p = 0.009) in Motor Speech; between lvPPA versus svPPA (adjusted p < 0.001) and nfvPPA versus svPPA (adjusted p = 0.040) in Phonology; svPPA vs nfvPPA (adjusted p < 0.001) and lvPPA versus nfvPPA in Semantics (adjusted p = 0.008); and lvPPA versus nfvPPA (adjusted p < 0.001) and lvPPA versus svPPA (adjusted p < 0.001) in working memory (Fig. 2).

Boxplot showing comparison between HC and the three PPA variants in main MLSE scores and the total score. A) Total score; B) Motor Speech; C) Phonology; D) Semantics; E) Syntax; F) Working memory. Gray: Healthy controls. Green: semantic variant; Yellow: logopenic variant; Blue; nonfluent variant. Black lines in the middle of the boxes represent median values for each group; vertical sizes of the boxes represent the interquartile range; flattened arrows represent minimum and maximum values.
Machine learning classification
A Random Forest classifier was tuned for two different datasets, one with the five MLSE domains, and another with all domains’ subscores. The optimal number of estimators (trees) for the domain-based and subscores-based Random Forest models was 700 and 100, respectively. Other tuned hyperparameters and model specifications are shown in Supplementary Table 2. Supplementary Table 3 shows balanced accuracy, weighted averages of precision, recall, and F1-score obtained when evaluating each of the two models with their respective test sets. Both models performed well, with the subscore-based model obtaining a higher value of balanced accuracy (78.10%) than the domain-based model (77.75%). In terms of precision, recall, and f1-score values, the domain-based model performed better than the subscore based one (Supplementary Table 3).
Confusion matrices were also plotted to show differences and similarities between the two models (Fig. 3). Lastly, we obtained the ranking scores from Random Forest models of all features for each of the two datasets in order to know which ones had greater importance in the classification process. The three most important features in the domain-based model were: Syntax (0.300), Semantics (0.262), and Phonology (0.236) (Fig. 4). The decision tree selected from the domain-based model is shown in Fig. 5.

Confusion matrices for each of the two Random Forest models trained and tested using (A) five MLSE domains, and (B) thirty MLSE domain subscores. These plots were obtained using Python’s Scikit-Learn library.

Random Forest features’ importance in domain-based model (5 features).

Decision tree using the five MLSE domains to classify PPA variants. This tree yielded a balanced accuracy of 0.72.
DISCUSSION
Our study shows that the Spanish version of the MLSE is a reliable test with high diagnostic accuracy for PPA. Internal consistency was appropriate according to Cronbach’s alpha, and influence of age and years of education was non-significant. This finding is noteworthy because the absence of a statistically significant correlation with age and the level of education may allow the interpretation of the test independently from demographic factors. Furthermore, this is consistent with the error-based scoring approach used in MLSE, a method that is less susceptible to the effect of age and level of education. In this regard, MLSE examines main language domains with several tasks that are particularly sensitive for each of the PPA variants, allowing the observation of several signs (phonological errors, motor distortions, etc.) that should not be present in healthy subjects. Thus, cutoff points to discriminate between patients with PPA and healthy controls were very close to the maximum score of the MLSE and its domains’ subscores. As in the original version, discrimination between PPA and HC is nearly complete in ROC curve analysis, which confirms the validity of the test as a brief assessment of diagnosis of PPA. Similarly, effect sizes were moderate-large for all the domains.
One of the most interesting findings of our study is the discrimination between PPA variants. On one hand, all domains, except for syntax, showed statistically significant differences between PPA variants, with large effect sizes. As expected, nfvPPA scored lower than svPPA and lvPPA in motor speech; lvPPA and nfvPPA scored lower than svPPA in phonology; svPPA and lvPPA scored lower than nfvPPA in semantics; and lvPPA scored lower than nfvPPA and svPPA in working memory. On the other hand, a random forest classification model obtained a balanced accuracy of 77–78%and an F1-score of 79–81%for the discrimination between subtypes. These findings are remarkable, especially considering the short duration of the test, the difficulties in the clinical diagnosis of PPA variants, and the controversies in the optimal PPA classification [9, 15]. Differences in terms of balanced accuracy were minimal when comparing the model using the MLSE domains and all the individual items. Furthermore, precision, recall, and F1 were slightly better in the model using the main domains. Overall, this confirms the validity of the domains used for test scoring and categorization. Because the model using domains has a lower number of features, it is easier to interpret and perform in every day clinical practice.
Another remarkable finding was the importance of each feature in the classification. In this regard, syntax was the most relevant feature, followed by semantics and phonology. Although syntax did not show statistically significant differences between PPA groups in our study, we can infer that syntax is important, but not in isolation. In fact, the score of each domain with regard to the others is crucial in the interpretation of results. For instance, comprehension of complex sentences is included in the syntax domain. However, patients with svPPA or lvPPA may also show difficulties with this task, because of impairment of comprehension of single words or working memory, respectively. In this case, comparison between the score in this task with other items linked to semantics and working memory helps to disentangle the core deficit in each patient. Regarding motor speech, this domain was less important, probably because it is only impaired in a subgroup of patients with nfvPPA (i.e., apraxia of speech).
Our study has some limitations. First, not all patients were evaluated at the onset of the disease. Future studies should evaluate the diagnostic properties of the test in patients at the time of the first consultation. Second, we used a cross-sectional design. Longitudinal assessments using the MLSE are warranted to learn the test’s value in the monitoring and follow-up of patients.
In conclusion, this study reports the development of the Spanish version of the MLSE and validates this tool for the diagnosis of PPA. This test provides a brief language examination, which may serve as a first step in the screening of patients in whom a language neurodegenerative disorder is suspected, and as a first clinical orientation into the main variants of PPA.
Footnotes
ACKNOWLEDGMENTS
JAM-G is supported by Instituto de Salud Carlos III through the Project INT20/00079 (co-funded by European Regional Development Fund “A way to make Europe”). The work was supported by the Medical Research Council (MR/N025881/1), the Cambridge Centre for Parkinson-plus; and the NIHR Cambridge Biomedical Research Centre (BRC-1215-20014). The views expressed are those of the author(s) and not necessarily those of the NIHR or the Department of Health and Social Care.
