Abstract
Background:
Automated volumetry software (AVS) has recently become widely available to neuroradiologists. MRI volumetry with AVS may support the diagnosis of dementias by identifying regional atrophy. Moreover, automatic classifiers using machine learning techniques have recently emerged as promising approaches to assist diagnosis. However, the performance of both AVS and automatic classifiers have been evaluated mostly in the artificial setting of research datasets.
Objective:
Our aim was to evaluate the performance of two AVS and an automatic classifier in the clinical routine condition of a memory clinic.
Methods:
We studied 239 patients with cognitive troubles from a single memory center cohort. Using clinical routine T1-weighted MRI, we evaluated the classification performance of: 1) univariate volumetry using two AVS (volBrain and Neuroreader™); 2) Support Vector Machine (SVM) automatic classifier, using either the AVS volumes (SVM-AVS), or whole gray matter (SVM-WGM); 3) reading by two neuroradiologists. The performance measure was the balanced diagnostic accuracy. The reference standard was consensus diagnosis by three neurologists using clinical, biological (cerebrospinal fluid) and imaging data and following international criteria.
Results:
Univariate AVS volumetry provided only moderate accuracies (46% to 71% with hippocampal volume). The accuracy improved when using SVM-AVS classifier (52% to 85%), becoming close to that of SVM-WGM (52 to 90%). Visual classification by neuroradiologists ranged between SVM-AVS and SVM-WGM.
Conclusion:
In the routine practice of a memory clinic, the use of volumetric measures provided by AVS yields only moderate accuracy. Automatic classifiers can improve accuracy and could be a useful tool to assist diagnosis.
Keywords
INTRODUCTION
The diagnostic criteria of Alzheimer’s disease (AD) and other dementias have evolved in the past decades from a clinical descriptive perspective to biomarker-supported definitions, mainly due to innovation in brain imaging, and biological fluid markers [1]. Among neuroimaging biomarkers, MRI is the less invasive, most widely available, cost-effective, is systematically recommended in dementia, and can provide supportive criteria for many neurodegenerative conditions [2–4]. MRI can identify areas of atrophy that can suggest particular types of dementia, such as atrophy of the medial temporal structures in late-onset AD [5, 6] or anterior atrophy in frontotemporal dementia [7]. Assessment of regional atrophy using MRI in dementia has been extensively studied using visual, semi-quantitative ratings [5–7], manual volumetry, and more recently Automated Volumetry Software (AVS) [8–11].
AVS such as Neuroreader™ [10] and volBrain [12] provide volumetric measures of anatomical structures. Unlike subjective visual analysis of atrophy, AVS provide objective, quantitative measurement of various regions of interest (ROI) volumes. These tools, which are progressively being implemented in clinical MRI software, have only been evaluated in research settings [10, 14]. Besides, due to their univariate nature, they cannot detect complex multivariate combinations of regional atrophies, essential to discriminate between different dementias.
Automatic classifiers, based on machine learning techniques, are able to automatically learn complex multivariate discriminative patterns without priors on specific anatomical structures. Automatic classifiers have also mainly been evaluated in research settings, with standardized MRI acquisition and focusing on a single type of dementia (most often AD) and age-matched healthy controls [15–19].
In this study, we evaluated the diagnostic classification performance of AVS volumetry (volBrain and Neuroreader™), automatic classifiers (based on whole gray matter or on AVS volumes), in a clinical routine cohort of patients presenting with various neurodegenerative dementia disorders, depression or subjective cognitive decline.
MATERIAL AND METHODS
Participants
All subjects were recruited retrospectively in a tertiary academic expert memory center (Institute for Memory and Alzheimer’s disease – Department of Neurology, Pitié-Salpêtrière University Hospital) from the ClinAD cohort [20]. The ClinAD cohort consists of 992 consecutive patients who consulted from 2005 to 2014 for cognitive impairment and who underwent lumbar puncture. Data collection was planned before the index test and reference standard were performed. All patients had neurological, biological, and neuropsychological evaluations. Cerebrospinal fluid (CSF) Aβ1-42, tau, and phosphorylated tau were available for all participants. All clinical and biological data were generated during a routine clinical workup and were retrospectively extracted for the purpose of this study. Therefore, according to French legislation, explicit consent was waived. However, regulations concerning electronic filing were followed, and patients and their relatives were informed that anonymized data might be used in research investigations.
For each patient, the diagnosis was assessed by a group of three neurologists based on clinical, biological, and imaging data, following international consensus criteria for AD (IWG-2) [21], frontotemporal dementia (FTD) [2], primary progressive aphasia (PPA) of the logopenic (lv-PPA), semantic (SD) or non-fluent/agrammatic (nf-PPA) [22] variant, corticobasal syndrome (CBD) [4], progressive supranuclear palsy (PSP) [23], posterior cortical atrophy (PCA) [24], Lewy body dementia (LBD) [25], and depression [26]. This consensus diagnosis formed the reference standard. The classifier and volumetry (index tests) results were not available to assessors of the reference standard. As clinical presentations and atrophy patterns depend mostly on the age of onset of AD [27], the AD group was separated into early-onset AD (EOAD) and late-onset AD (LOAD), with age of onset respectively before and after 65 years. In addition, 342 out of 992 patients were excluded because they presented with mixed pathology, vascular disease (Fazekas score > 2 or significant stroke) or unclear diagnosis. From the 650 patients of the ClinAD cohort, 380 patients were excluded because the MRI was performed outside our center and was not available for our study, resulting in 270 patients. We added 12 subjective cognitive decline (SCD) patients, defined as patients with cognitive complaint but with normal neuropsychological examination.
Among the 282 patients, 7 were excluded due to poor image quality or failure of image processing pipelines. Specifically, 6 had a very low MRI quality on visual analysis (missing slices or strong motion artifacts) and the image processing pipelines failed in one participant. The quality of the remaining MRI data was variable, reflecting the reality of clinical routine, but proved sufficient for reliable image processing. The quality of image segmentation results was visually assessed. Moreover, we excluded diagnostic groups with less than 15 patients (nf-PPA, PSP, PCA) as automatic classifiers cannot be trained robustly on very small groups of subjects. As a result, the analyses were performed on 239 patients belonging to the following eight diagnostic groups: CBD, EOAD, LOAD, bvFTD, LBD, lv-PPA, SD, and depression. The flow chart is described on Supplementary Table 1. In this cohort, the only group without degenerative condition was that of patients with depression. We aim to compare the results obtained for depression to that obtained for SCD. To that purpose, we added 12 patients with SCD, defined as patients with cognitive complaint but with normal neuropsychological examination. For this group, classifiers were trained using the depression group and applied to the SCD group, because the training of the classifier on 12 participants would not be robust enough.
Demographic data are summarized in Table 1. Difference between groups on demographic and clinical data was evaluated with ANOVA for continuous data and χ2 test for binary data using XLStat Software (Addinsoft, http://www.xlstat.com). As expected, since we separated the AD group in LOAD and EOAD, age at diagnosis was significantly different between groups (in ANOVA and Post-Hoc Test). The Mini-Mental State Examination (MMSE) score was also different since the neurodegenerative conditions do not have the same cognitive profile. For example, language impairment in PPA usually leads to lower MMSE scores than frontal dysfunction in FTD. There was no difference between groups regarding gender and MRI magnetic field.
Demographic and clinical characteristics of the population. Group differences were assessed with ANOVA for continuous variables and χ2 test for discrete variables. Data are expressed as mean±SD
CBD, corticobasal syndrome; Depr., depression; EOAD, early-onset AD; FTD, frontotemporal dementia of the behavioral type; LBD, Lewy body dementia; LOAD, late-onset AD; lv-PPA, logopenic variant of primary progressive aphasia; SCD, subjective cognitive decline; SD, semantic variant of primary progressive aphasia.
MRI acquisition
All 239 patients had an available brain MRI performed in the Department of Neuroradiology at Pitié-Salpêtrière Hospital: 63 on a 3T MRI GE Sigma HD, 9 on a 1.5 T MRI GE Optima 450, 44 on a 1.5T MRI GE Signa Excite and 123 on a 1T MRI Philips Panorama. All MRI included a 3D T1-weighted sequence with a spatial resolution ranging from 0.5× 0.5×1.2mm3 to 1×1×1.2mm3. Since imaging was performed as part of clinical routine, MRI acquisition parameters were not homogenized. Sequence parameters are available in Supplementary Table 2. The 12 SMC patients had an MRI performed in our center: 8 on a 3T MRI GE Signa HD, 1 on a 1.5 T MRI GE Optima 450, and 3 on a 1T MRI Philips Panorama.
Fully automated volumetry software
The Neuroreader™ software (http://www.brainreader.net) is a commercial clinical brain image analysis tool [10]. The system provides the volumes of the following structures: intracranial cavity, tissue categories (white matter (WM), gray matter (GM), and CSF), subcortical GM structures (putamen, caudate, pallidum, thalamus, hippocampus, amygdala, and accumbens), and lobes (occipital, parietal, frontal, and temporal). Processing times range from 3 to 7 minutes as a function of image size, irrespective of magnetic field strength.
The volBrain software (http://volBrain.upv.es) is an online freely-available academic brain image analysis tool [12]. The volBrain system takes around 15 min to perform the full analysis and provides the same volumes as Neuroreader™ except for the lobar volumes, only provided by Neuroreader™. However, the volBrain system provides hemisphere, brainstem and cerebellum segmentations which were not used in this study.
Automatic classification using SVM
Pre-processing: Extraction of whole gray matter maps
All T1-weighted MRI images were segmented into GM, WM, and CSF tissues maps using the Statistical Parametric Mapping unified segmentation routine with the default parameters (SPM12, London, UK http://www.fil.ion.ucl.ac.uk/spm/) [28]. A population template was calculated from GM and WM tissue maps using the DARTEL diffeomorphic registration algorithm with the default parameters [29]. The obtained transformations and a spatial normalization were applied to the GM tissue maps. All maps were modulated to ensure that the overall tissue amount remains constant and normalized to MNI space. 12 mm smoothing was applied as the classification performed better with this parameter than with none or less smoothed images.
SVM classification
Whole gray matter (WGM) maps were then used as input of a high-dimensional classifier, based on a linear support vector machine (SVM) classifier. In brief, the linear SVM looks for a hyperplane which best separates two given groups of patients, in a very high dimensional space composed of all voxel values. In such approach, the machine learning algorithm automatically learns the spatial pattern (set of voxels and their weights) allowing to discriminate between diagnostic group. Importantly, the classifier does not use prior information such as anatomical boundaries between structures or that a specific anatomical structure (e.g., hippocampus) would be affected in a given condition. Please refer to Cuingnet et al. [15] for more details.
SVM classification was performed for each possible pair of diagnostic groups (e.g., EOAD versus FTD, LOAD versus FTD, etc.). The performance measure was the balanced diagnostic accuracy defined as: (sensitivity – specificity)/2. Unlike standard accuracy, balanced accuracy allows to objectively compare the performance of different classification tasks even in the presence of unbalanced groups [15].
In order to compute unbiased estimates of classification performances, we used a 10-fold cross validation, meaning that each 10% of the set is used for testing and the other 90% for training, changing the groups in each out of the ten trials. This ensures that the patient that is currently being classified has not been used to train the classifier, a problem known as “double-dipping”. Finally, the SVM classifier has one hyper-parameter to optimize. The optimization was done using a grid-search. Again, in order to have a fully unbiased evaluation, the hyper-parameter tuning was done using a second, nested, 10-fold cross-validation procedure.
Finally, in order to have a fair comparison between WGM maps and AVS volumes, we also performed SVM classification using volumes of each AVS as input, all regional volumes (for a given AVS) being simultaneously used in a multivariate manner.
Radiological classification
Two neuroradiologists (AB, with 8 years of experience, and SS, with 4 years of experience), specialized in the evaluation of dementia, performed a visual classification of three diagnosis pairs on the same dataset: FTD versus EOAD, depression versus LOAD, and LBD versus LOAD. We chose FTD versus EOAD and depression versus LOAD for their relevance in clinical practice. We chose LBD versus LOAD because the SVM classifier yielded only moderate accuracies, and because the diagnosis of LBD based on MRI is difficult. The neuroradiologists were blind to all patient data except MRI.
RESULTS
Automated volumetry software: volBrain and Neuroreader™
We performed a univariate classification based on each AVS volume separately. Volumes were normalized to the measured total intracranial volume (mTIV) (using the formula: Volume/mTIV), as discrimination was slightly better than with absolute values. VolBrain and Neuroreader™ performed similarly on univariate classification with balanced accuracy rates ranging from 46% to 71% based on hippocampal volumes. We show various volumes obtained in Neuroreader™ in Supplementary Table 3. We show results of classification based on hippocampal volume computed with Neuroreader™ in Table 2. In Supplementary Table 4 to 9, we provide classification balanced accuracy based on volumes of other anatomical structures, known to be of particular interest in various neurodegenerative conditions.
Classification results for univariate classification from hippocampal volumes obtained with Neuroreader™ ASS. For each pair of possible diagnoses, we report the balanced accuracy. Chance level classification is at 50%. Colder colors (green/blue) correspond to less accurate classifications while warmer colors (red/orange) correspond to more accurate classifications
CBD, corticobasal syndrome; Depr., depression; EOAD, early-onset AD; FTD, frontotemporal dementia of the behavioral type; LBD, Lewy body dementia; LOAD, late-onset AD; lv-PPA, logopenic variant of primary progressive aphasia; SCD, subjective cognitive decline; SD, semantic variant of primary progressive aphasia.
Automatic SVM classifier from whole gray matter maps
Table 3 provides the results of automatic SVM classification from WGM segmentation maps. Balanced accuracies ranged from 52% (LBD versus LOAD) to 90% (EOAD versus SCD). We present in Supplementary Figure 1 two examples of weight maps, which are graphic representations of the most relevant voxels for classification.
Classification results for SVM classification from Whole Gray Matter maps. For each pair of possible diagnoses, we report the balanced accuracy. Chance level is at 50%. Colder colors (green/blue) correspond to less accurate classifications while warmer colors (red/orange) correspond to more accurate classifications
CBD, corticobasal syndrome; Depr., depression; EOAD, early-onset AD; FTD, frontotemporal dementia of the behavioral type; LBD, Lewy body dementia; LOAD, late-onset AD; lv-PPA, logopenic variant of primary progressive aphasia; SCD, subjective cognitive decline; SD, semantic variant of primary progressive aphasia.
Automatic SVM classification from AVS volumes
To fully compare AVS with our SVM-WGM classification, we provide, in Supplementary Table 10, results of SVM classification from all volumes obtained with volBrain and Neuroreader™ in addition to SVM based on WGM. In general, results were slightly lower than with SVM classification from WGM. Overall, volBrain and Neuroreader™ performed similarly, even though one or the other tool achieved slightly higher performances in some specific cases.
Radiological classification
Classification by experienced neuroradiologists resulted in the following balanced accuracies: 77% (neuroradiologist 1) and 72% (neuroradiologist 2) for LOAD versus depression, 72% and 75% for FTD versus EOAD, and 57% and 63% for LBD versus LOAD (Table 4). Neuroradiological classification performed better than both SVM-AVS and univariate AVS except for LBD versus LOAD classification in which they performed equally. The performance of the SVM-WGM was in general comparable to that of neuroradiologists. However, it was superior to both radiologists for FTD versus EOAD classification.
Comparative performances of neuroradiologists, univariate AVS, and automatic classifiers. The three diagnostic classification tasks are Depression versus LOAD, FTD versus EOAD and LBD versus LOAD
AVS, Automated Volumetry Software; SVM-AVS, Support Vector Machine Automated Volumetry Software.
DISCUSSION
In this study, we assessed the diagnostic performance of AVS and SVM classifiers for various neurodegenerative conditions. SVM classifier based on WGM provided accurate diagnostic classification for the majority of diagnoses and was far more accurate than univariate classification based on regional volumes such as hippocampal volume obtained through AVS. The performance of the SVM classifier was similar or slightly higher to that of trained neuroradiologists on selected classification tasks.
The best accuracies were obtained with SVM classification from WGM maps. Balanced accuracy was superior to 70% in 64% of the available combinations and superior to 80% in 25% of them. Two studies evaluated SVM classification between AD and FTD in a research setting [16, 30]. In this setting, they obtained slightly higher diagnostic classification, with AD versus FTD classification ranging from 84% to 90% (in our study: FTD versus EOAD: 83% and FTD versus LOAD: 73%). This slightly superior accuracy might be explained by the more controlled setting of research studies, in particular less heterogeneous MRI acquisitions, and by the fact that our patients were at a slightly less advanced disease stage. Moreover, in Klöppel et al. [30], the use of anatomopathology as the diagnosis criteria, might have provided more homogeneous groups of patients, helping to better distinguish different diagnoses. To the best of our knowledge, only one study has previously evaluated SVM classifiers in clinical routine with various types of dementia [31]. The accuracies that we report are consistent with those reported in Koikkalainen et al. [31] in which diagnostic accuracy for FTD versus AD was 80% (in our study, FTD versus LOAD: 73% and FTD versus EOAD: 83%), for LBD versus AD 68% (in our study, LBD versus EOAD: 77% and LBD versus LOAD: 52%), and for LBD versus FTD 77.5 (in our study, LBD versus FTD: 67%). In this previous study, as compared to ours, there was not any patient with PPA or CBD. Furthermore, contrarily to our study, diagnoses were not assessed with the latest diagnosis criteria, especially regarding AD CSF biomarkers. Finally, this study did not compare the performance of SVM to that of AVS tools which are quickly becoming standard in radiological routine. Therefore, to the best of our knowledge, we present the first study of whole-brain classifiers on clinical routine data based on the latest diagnostic criteria, and with comparison to AVS tools, the current standard of quantitative clinical radiology.
When focusing on some particularly difficult clinical situations, automatic classification results are particularly promising. For instance, SVM classification distinguished depression, EOAD, and FTD with an accuracy superior to 80%. In particular, SVM classification was more accurate than that of trained neuroradiologists for EOAD versus FTD. These situations often imply facing young patients, with an atypical symptomatic presentation. In these cases, there is often a dramatic impact on the professional and familial life. Finally, the diagnosis implies different types of care including choosing between cholinesterase inhibitors in AD versus antidepressant drugs in depression for instance or making a genetic diagnosis for FTD. Another challenging situation can be the disentanglement of PPA variants which all include predominant language impairment but are associated to variable neuropathological lesions [32]. SD could be distinguished from lv-PPA with an accuracy of 77%. As expected, the classifier, as well as the neuroradiologists, performed better on dementia known to have a strongly specific atrophy pattern (such as SD or FTD) [7] and worse on dementia with less specific atrophy patterns (LBD, CBD) [33, 34]. Interestingly, the classifier allowed to distinguish SCD from the vast majority of neurodegenerative diseases with high accuracy. One can note that it performed better for SCD than for depression. One explanation could be the atrophy usually described in depression [35].
Compared to our SVM classifier, univariate classification based on AVS performed poorly. When analyzing the accuracy for diagnosis based on each of the volumes obtained with AVS, they ranged between 53% and 84%. With hippocampus alone, classifying rates rarely exceeded 70%, which is relatively low. In previous studies, the role of the hippocampus has been mainly evaluated for the diagnosis of AD versus controls or in mild cognitive impairment populations to identify patients who will later progress to AD [8, 37]. In our study, we evaluated MRI measurements in AD versus other dementia (FTD for instance), where hippocampal volumetry alone is known to perform poorly [38, 39].
Poor performance of univariate classification and improvement when using SVM classification of both AVS volumes (balanced accuracy ranging from 60 to 80%) emphasize the fact that atrophy in dementia involves complex distributed spatial pattern. The only study comparing univariate (hippocampus) and multivariate analysis in two AVS (NeuroQuant™ and Neuroreader™) found different conclusions [13]. They did not find any additional prognostic performance with multivariate analysis compared to univariate. Nevertheless, this study focused on prediction of progression to AD among mild cognitive impairment patients, an objective that differs from ours. Finally, the SVM classifier using WGM generally performed better than the multivariate analyses of both AVS. This is likely because the pattern of atrophy may not coincide with the boundaries of the anatomical regions delineated by AVS. This demonstrates the interest of letting the algorithm learn a discriminative pattern from the WGM, without prior, rather than using anatomical boundaries provided by AVS.
Neuroradiological classification was generally more accurate than hippocampal volumetry using AVS. The only exception was for LBD versus LOAD, a differential diagnosis for which anatomical MRI does not bring much relevant information and for which all approaches performed relatively poorly. Neuroradiological classification and SVM-WGM generally achieved similar performance. Nevertheless, the performance of SVM-WGM was superior for EOAD versus FTD. This indicates that an automatic classifier can be a useful tool to assist trained neuroradiologists for difficult situations.
Our study also demonstrates the feasibility of those techniques in the context of routine MRI data of varying image quality and acquired at different magnetic field strength. AVS segmentation and SVM classification were successful on almost every MRI.
One limitation of our study is the use of a binary classifier which does not totally correspond to the clinical practice where patients can have multiple diagnostic hypotheses. Further investigations could include multi-group classification instead of paired groups, in order to obtain a probability related to each potential diagnosis. Another limitation might be that we did not include healthy controls but rather used two control groups composed of patients with depression and SCD respectively. However, this situation is representative of the clinical routine: patients seen in a memory clinic are usually diagnosed with a neurological or a psychiatric condition, or present with subjective cognitive impairment, and are thus not “pure” control subjects.
As AVS start to be implemented in clinical routine, a final step in the analysis of raw AVS volumes could be a classification with an SVM based on all the AVS data. By analogy with AVS, our SVM-WGM classifier could be implemented in the post-processing of MRI in clinical routine. Thus, neuroradiologists could use the indication provided by the automatic classifier to refine their diagnosis. Also, in our study, neuroradiologists were operating in highly specialized centers and had considerable experience with different types of dementia (including rare diseases). It is thus conceivable that an automatic classifier would be of even greater help in less specialized centers.
Conclusion
Our study supports the applicability of computer-assisted diagnostic tools such as AVS and SVM classifiers to clinical routine data. When facing various dementia disorders, the accuracy of univariate volumetric analysis is too low to assist clinical decision making. In a clinical routine setting, automatic classifiers provide high diagnostic accuracy for distinguishing between several types of dementia. The implementation of advanced MRI-based computer-assisted diagnostic tools in clinical routine, such as SVM classification, could help to improve diagnostic accuracy.
Footnotes
ACKNOWLEDGMENTS
O.C. is supported by a “Contrat d’Interface Local” from Assistance Publique-Hôpitaux de Paris (AP-HP).
HH is supported by the AXA Research Fund, the Fondation Université Pierre et Marie Curie and the Fondation pour la Recherche sur Alzheimer, Paris, France.
The research leading to these results has received funding from the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute), reference ANR-10-IAIHU-06 (Agence Nationale de la Recherche-10-IA Institut Hospitalo-Universitaire-6), and reference ANR-11-IDEX-004 (Agence Nationale de la Recherche-11- Initiative d’Excellence-004, project LearnPETMR number SU-16-R-EMR-16), from the European Union H2020 program (project EuroPOND, grant number 666992), from the joint NSF/NIH/ANR program “Collaborative Research in Computational Neuroscience” (project HIPLAY7, grant number ANR-16-NEUC-0001-01), from Agence Nationale de la Recherche (project PREVDEMALS, grant number ANR-14-CE15-0016-07), from the ICM Big Brain Theory Program (project DYNAMO), from the Abeona Foundation (project Brain@Scale) and from the “Contrat d’Interface Local” program (to Dr. Colliot) from Assistance Publique-Hôpitaux de Paris (AP-HP).
