Abstract
Background:
Patients with behavioral variant of frontotemporal dementia (bvFTD) initially may only show behavioral and/or cognitive symptoms that overlap with other neurological and psychiatric disorders. The diagnostic accuracy is dependent on progressive symptoms worsening and frontotemporal abnormalities on neuroimaging findings. Predictive biomarkers could facilitate the early detection of bvFTD.
Objective:
To determine the prognostic accuracy of clinical and structural MRI data using a support vector machine (SVM) classification to predict the 2-year clinical follow-up diagnosis in a group of patients presenting late-onset behavioral changes.
Methods:
Data from 73 patients were included and divided into probable/definite bvFTD (n = 18), neurological (n = 28), and psychiatric (n = 27) groups based on 2-year follow-up diagnosis. Grey-matter volumes were extracted from baseline structural MRI scans. SVM classifiers were used to perform three binary classifications: bvFTD versus neurological and psychiatric, bvFTD versus neurological, and bvFTD versus psychiatric group(s), and one multi-class classification. Classification performance was determined for clinical and neuroimaging data separately and their combination using 5-fold cross-validation.
Results:
Accuracy of the binary classification tasks ranged from 72–82% (p < 0.001) with adequate sensitivity (67–79%), specificity (77–88%), and area-under-the-receiver-operator-curve (0.80–0.9). Multi-class accuracy ranged between 55–59% (p < 0.001). The combination of clinical and voxel-wise whole brain data showed the best performance overall.
Conclusion:
These results show the potential for automated early confirmation of diagnosis for bvFTD using machine learning analysis of clinical and neuroimaging data in a diverse and clinically relevant sample of patients.
Keywords
INTRODUCTION
The behavioral variant of frontotemporal dementia (bvFTD) is the second most common early-onset dementia and is characterized by the deterioration of personality, social cognition, and executive functions [1]. The progressive behavioral changes seen in bvFTD are typically accompanied by apathy and social disinhibition leading to isolation and the expression of a wide range of inappropriate behavior potentially causing a great amount of distress for family members and caretakers [2, 3].
In 2011, a set of revised diagnostic criteria was developed by the International Behavioral Variant FTD Criteria Consortium (FTDC) in order to improve the diagnostic sensitivity of bvFTD [1, 4]. Although the sensitivity of FTDC diagnostic criteria has increased, the diagnostic process is challenging and several clinical predicaments remain. The behavioral and cognitive symptoms as described in FTDC’s possible bvFTD have a considerable overlap with several psychiatric disorders (e.g., depression, schizophrenia, and others) and other types of dementia [5 –8] affecting the accuracy of the diagnostic criteria. To increase the level of diagnostic certainty for bvFTD, patients must show imaging abnormalities suggestive of frontotemporal degeneration along with clinical decline (probable bvFTD). Especially in early disease stages or in some specific pathogenic mutations, it is challenging for physicians to determine whether patients have bvFTD as significant functional decline might not be detectable, or imaging results are not conclusive for bvFTD [1 , 9]. Therefore, there is a crucial unmet clinical need for better tools to differentiate between bvFTD and clinically and radiologically overlapping disorders.
We have previously shown that the conversion to bvFTD can be predicted well by visual ratings using magnetic resonance imaging (MRI) and fludeoxyglucose (18F) positron emission tomography ([18F]FDG-PET) findings [9]. However, [18F]FDG-PET is costly, requires the injection of a radioactive tracer, is not available in the majority of hospitals and has a high false-positive rate [9, 10] as psychiatric disorders may show frontotemporal hypometabolism as well [11]. In addition, the image analysis requires a highly specialized neuroradiologist and involves manual evaluation of the scans. Therefore, the focus of this study is on MRI imaging only and its automatic evaluation without the need of an expert neuroradiologist.
Previous studies have shown that machine learning based pattern recognition techniques could be useful to automatically classify structural MRI scans and aid clinical diagnosis, as these methods can be used to classify patients at the single subject level. This is a fundamentally different approach from typical neuroimaging studies that test differences between patients and controls at the group level [12]. It has been demonstrated that machine learning techniques can successfully distinguish FTD from Alzheimer’s disease (AD) and healthy controls (HC) [13 –19], but little is known about its use in a more diverse sample. We chose to focus on grey-matter data since it is a standard way to analyze structural MRI data, its atrophy is used in the diagnosis of (probable) bvFTD and it has shown consistent group differences between bvFTD patients and HC [1, 20]. In addition to neuroimaging data, we have previously shown that clinical data can also be informative for distinguishing between bvFTD and psychiatric disorders [21]. Predictive accuracy may therefore be improved by combining MRI data with clinical data.
Here, we aimed to predict the conversion to bvFTD in patients with early-stage behavioral complaints in a clinically representative sample. We used machine learning algorithms to predict 2-year follow-up diagnosis based on baseline clinical and structural MRI data from patients that were included based on a set of behavioral complaints only. This approach is expected to resemble clinical practice, in which a wide range of patients are presented with symptoms fitting a differential diagnosis, ranging from various psychiatric, neurological to neurodegenerative disorders including bvFTD. We also investigated whether limiting the classification to a set of a priori selected regions-of-interests (ROI) associated with bvFTD atrophy could further improve predictive accuracy or whether a global brain pattern would be better suited for the task.
MATERIALS AND METHODS
Study sample
73 patients were selected from the Late Onset Frontal Syndrome (LOF) study, an observational, cross-sectional, and prospective follow-up study conducted at multiple medical centers [22]. Data from these patients were selected because of the availability of a high-resolution 3-dimensional(D) T1-weighted MRI scan at baseline and a two-year follow-up diagnosis. Recruitment occurred between April 2011 and June 2013 through the memory clinic of the Alzheimer Center at the VU University Medical Center (VUmc) and the department of Old Age psychiatry of the GGZinGeest in Amsterdam the Netherlands. From the initial cohort of 137 participants, 54 did not have a good quality whole brain 3D T1-weighted MRI scan or a baseline MRI scan that had been performed elsewhere and 10 had subjective complaints which did not allow for a clinical diagnosis. The age of the included patients ranged from 48 to 75 years, and 78% were male. The average age (p = 0.961) and gender (p = 0.331) at baseline did not differ between the included and excluded patients in the current study.
The inclusion and exclusion criteria for the LOF study have been described in [22]. In short, the inclusion criteria consisted of a presentation of symptoms without underlying clinical diagnosis quantified by 1) behavioral changes consisting of apathy, disinhibition, and/or compulsive/stereotypical behavior arising in the middle or late adulthood observed by clinician or reliable informant, and 2) a Frontal Behavior Inventory (FBI) [23] score of ≥11 or a score of ≥10 on the Stereotypy Rating Inventory (SRI) [24]. Exclusion criteria were 1) a confirmed diagnosis of dementia or psychiatric disorder with overlapping clinical symptoms, 2) a Mini-Mental State Examination (MMSE) [25] score ≤18, 3) medical history including traumatic brain injury, mental retardation, drug or alcohol abuse, 4) absence of a reliable informant, 5) insufficient communication skills, 6) acute onset of behavioral problems, 7) clinical apparent aphasia or semantic dementia, and 8) contraindications for MRI. After inclusion, all patients underwent a standardized clinical assessment at baseline and at 2-year follow-up. The final consensus diagnosis between the neurologist and the psychiatrist was made based upon the relevant clinical information and additional investigations, including results of cerebrospinal fluid biomarkers, MRI and [18F]FDG-PET at baseline ([18F]FDG-PET-scan was only performed in case of normal MRI findings or doubt on the interpretation of the abnormalities being explanatory for the behavioral changes). All patients with a positive family history for early-onset dementia were referred for clinical genetic counseling. If deemed appropriate, genetic screening included the MAPT, GRN, PSEN1, and APP genes. In all subjects of whom DNA was available (n = 137) C9orf72 repeat expansion was tested. After two years of follow-up, neuropsychiatric examination, neuropsychological examination and the brain MRI were repeated. Diagnosis was based upon the National Institute on Aging-Alzheimer’s Association guidelines for AD, the National Institute of Neurological Disorders and Stroke-Association Internationale pour la Recherche et l’Enseignement en Neurosciences criteria for vascular dementia (VaD), the international consensus diagnostic criteria for dementia with Lewy bodies (DLB), the DSM-IV for psychiatric disorders and the FTDC for bvFTD [1 , 26–28]. Included patients were grouped based on their 2-year follow-up diagnosis as either bvFTD (including only probable and definite bvFTD as defined by the FTDC), primary psychiatric disorders (e.g., major depression disorder, bipolar disorder, obsessive compulsive disorder, schizophrenia, further abbreviated to ‘psychiatric’) or other neurological diseases (e.g., AD, VaD, DLB, further only summarized as ‘neurological’). Thus, behavioral symptoms at baseline eventually led to widely different diagnosis at 2-year follow-up. The study was approved by the Medical Research Ethics Committee of the VUmc.
Clinical data
To establish a baseline for the classification analysis we extracted demographic and clinical variables available at baseline. Those variables included: age at baseline, gender, education according to the Verhage scale [29], MMSE, total FBI and SRI score. This allowed us to compare the added value of MRI data for the automatic classification of bvFTD.
MR image acquisition
Baseline structural MRI data was acquired in the Alzheimer Center of the VU University Medical Center on two 3T whole-body MR systems (Signa HDxt, GE Medical Systems Milwaukee, WI, USA; Ingenuity TF PET/MR, Philips Healthcare, Cleveland, OH, USA) following a standard MRI protocol for dementia [30]. This protocol included either a sagittal 3D T1-weighted fast spoiled gradient-echo (FSPGR) sequence with oblique coronal reformats with the following parameters: time to repetition (TR) = 7.8 ms, time to echo (TE) = 3 ms, flip angle (FA) = 12°, sagittal sections = 180 and voxel size = 0.98×0.98×1 mm on the GE Signa HDxt, or a turbo field echo (TFE) sequence with the following parameters: TR = 7 ms, TE = 3 ms, FA = 12°, sagittal sections = 180, voxel size = 1×1×1 mm on the Philips Ingenuity TF.
MR image processing
The T1-weighted images of all 73 patients were preprocessed using the CAT12 toolbox (http://www.neuro.uni-jena.de/cat). Preprocessing consisted of tissue-segmentations into grey matter (GM), white matter, and cerebrospinal fluid followed by normalization to the Montreal Neurological Institute (MNI) space using DARTEL [31] registration based on a template derived from 555 HC subjects of the IXI-database (http://www.brain-development.org) provided by the CAT12 toolbox. The normalized images were modulated by the Jacobian determinant to preserve local tissue volume and spatially smoothed by an 8 mm isotropic Gaussian kernel. A GM mask was created by thresholding the individual GM images at 0.2 and ensuring that a voxel was only included if at least 51% of patients included the same voxel in their individual mask.
Following the preprocessing MRI data was either extracted as a whole brain voxel-wise map or confined to ROIs based on the literature [1, 32]. A set of bilateral ROIs associated with bvFTD were selected and all voxels within these ROIs were used for the classification. The ROIs were selected from the probabilistic cortical Harvard-Oxford atlas [33 –36] from FSL [37] and included the orbitofrontal cortex, bilateral temporal pole, insular and prefrontal cortex.
Clinical and MRI data
In addition, we combined the voxel-wise and ROI GM volume data (separately) with the clinical data by concatenating the individual feature matrices. This provided a simple way to incorporate information from different data sources for the classification. Therefore, in total five different data views were considered: 1) clinical, 2) voxel-wise whole brain, 3) ROI localized, 4) clinical + ROI, and 5) clinical + voxel-wise.
Support vector machine classification
Four different classification tasks were performed: 1) bvFTD versus psychiatric + neurological; 2) bvFTD versus psychiatric; 3) bvFTD versus neurological; and 4) bvFTD versus psychiatric versus neurological. In contrast to the other three, the last classification is not a binary but a multi-class classification. For this the ‘one-against-one’ approach was used [38]. In this framework three classifiers, one for each pair of classes, are created and a prediction is performed by taking a vote across the three classifiers whereby the majority prediction wins. We used a linear support vector machine (SVM) classifier [39] to perform the classification tasks with the C parameter set to 1 following general recommendations for the neuroimaging field [40]. To create the linear kernel, we removed the voxel-wise mean across subjects of the training set (see below), normalized the resulting feature vectors by their Euclidean norm and calculated the linear kernel.
Since the sample sizes across our classes were imbalanced, we used randomized undersampling of the majority class(es) to create a balanced set for classification. This was done to prevent the classifier focusing only on the majority class during the classification and to learn an overall distinguishing classification boundary. To this end, 18 patients were randomly selected (matching the sample size of the smallest class) from the other class(es) without replacement. To prevent the final result from being dependent on the specific 18 patients which were picked from the majority class all classifications were repeated 500 times. For each iteration of the randomized undersampling 5-fold cross-validation stratified for the individual classes was conducted. During cross-validation 4-folds of the data are used for training the classifier while the 5th-fold is used for testing. Accuracy, sensitivity, specificity, and area under the receiver-operator curve (AUC) were computed and averaged across the five test folds as an approximation of the out-of-sample performance of the classifier. In the multi-class case sensitivity was defined as the accuracy of correctly classifying bvFTD and specificity was calculated for each of the other classes (neurological and psychiatric) independently. No AUC calculation was made in this case. To be able to directly compare the performance of the different data approaches for the same classification problem, it was ensured that the same patients were selected by the random undersampling procedure and that they were distributed across the five folds of the cross-validation procedure in the exact same manner making the classification performance directly comparable in terms of the four data views used and not the partitioning of the participants.
Label permutation tests [41] were used to assess whether the observed accuracy and AUC were significantly different from chance level with multiple comparisons correction for the four different data views. To that end, the whole procedure, including the 500 random undersamplings, was repeated 1000 times while randomly permuting the labels of the classes. All machine learning analyses were implemented using the scikit-learn toolbox (version 0.19.1) [42] for Python (version 2.7.15). Random undersampling was performed through the imbalanced-learn toolbox (version 0.3.3) [43].
Statistical analysis
Statistical analyses of demographic and clinical data were performed in SPSS. Differences in characteristics between diagnostic groups of patients were assessed using one-way analyses of variance (ANOVA), χ 2, and Kruskal-Wallis H tests where appropriate.
To assess the difference between the five data usage approaches—clinical data, voxel-wise whole brain, ROI approach, combination of ROI and clinical data, and combination of voxel-wise and clinical data—for a specific classification task a sign flipping test was implemented using accuracy as metric to be compared. Comparing two data views at a time the average difference in accuracy calculated across each test fold and each undersampling iteration was computed. To determine an empirical null-distribution the differences were randomly sign-flipped 5000 times and an average sign-flipped difference was computed. The observed average difference was compared to the calculated null-distribution and a p-value was estimated. The significance level was set to α= 5% Bonferroni corrected for 20 one-sided tests. This procedure allowed to investigate an overall average advantage of using a data usage approach over another.
Anatomical localization
To investigate which regions contributed most to the SVM classification for the voxel-wise whole brain approach on a group level, a novel approach to estimate p-values for the weights of the SVM was employed [44]. In this approach a statistic is computed which is a combination between the weight component value and the size of the margin. This provides a better metric than looking into the value of the weights alone. To obtain a p-value an analytical approximation to the null-distribution obtained through permutation tests was used [44]. It has been shown that this analytical approximation shows good correlation with the null-distribution obtained from empirical permutation tests.
RESULTS
Demographic and clinical characteristics
Based on the diagnosis at the two-year follow-up, patients were assigned to three different groups: 1) bvFTD (n = 18), 2) other neurological diseases (n = 28), or 3) psychiatric disorders (n = 27). Within the bvFTD group, two (11.1%) patients were diagnosed with definite bvFTD and 16 (88.9%) with probable bvFTD. Within the other neurological diseases group, five (17.9%) patients were diagnosed with AD, three (10.7%) with DLB, three (10.7%) with VaD, three (10.7%) with vascular mild cognitive impairment, eight (28.6%) with other dementias, and six (21.4%) patients had other neurological disorders. Within the psychiatric group, eight (29.6%) patients were diagnosed with major depressive disorder, three (11.1%) with minor depressive disorder, three (11.1%) with autism spectrum disorder, three (11.1%) with bipolar disorder, two (7.4%) with a personality disorder, one (3.7%) with obsessive compulsive disorder, one (3.7%) with schizophrenia, and six (22.2%) patients were diagnosed with other psychiatric disorders.
Demographic characteristics, clinical variables and corresponding statistics of differences between diagnostic groups are summarized in Table 1. A significant difference between groups was observed in the age of the patients (p = 0.03). Post hoc Tukey tests revealed that this difference was only significant between the neurological and psychiatric group (mean age difference of 4.84 years, p = 0.02), which was not part of our binary comparisons. In line with our previous reports, a significant difference was observed between the SRI test scores as estimated by a one-way Kruskal-Wallis ANOVA (p < 0.0001). Post hoc tests revealed a significant difference between bvFTD and psychiatric (mean difference of 11.3, p < 0.0001) and bvFTD and neurological (mean difference of 14.5, p < 0.0001) groups [21, 45].
Demographic and clinical variables at baseline
Education in Verhage scale [29]; IQR, interquartile range; MMSE, Mini-Mental State Examination; FBI, Frontal Behavioral Inventory; SRI, Stereotype Rating Inventory; TIV, total intracranial volume aone-way ANOVA test, b χ 2 test, cKruskall-Wallis H test, *p < 0.05.
Support vector machine classification
The accuracy of all binary classification tasks ranged between 72–82% (Table 2 and Fig. 1) with the highest accuracy per classification always achieved using a combination of clinical and MRI data. The top accuracies per classification task were: 79% (p < 0.001) for bvFTD versus neurological + psychiatric, 81% (p < 0.001) for bvFTD versus neurological and 82% (p < 0.001) in the bvFTD versus psychiatric classification. In general, sensitivity (67–79%) was slightly lower than specificity (77–89%) implying a better performance of identifying non-bvFTD patients than bvFTD patients. AUC values of 0.75–0.90 indicated a generally high performance of the classifiers.
Cross-validated average accuracy, sensitivity, specificity and AUC of the four classification problems for the four data usage approaches across 500 undersampling iterations
p-value was calculated using a label permutation test (1000 iterations). SD: standard deviation, AUC: area under receiver-operator curve. aMulti-class classification with 3 classes: chance level is at 33.33%. bAccuracy of correctly classifying bvFTD. cAccuracy of correctly classifying neurological (first line) and psychiatric (second line) groups. *pBonferroni < 0.05.

Boxplots of various performance metrics of classifiers across three binary (bvFTD versus Neurological + Psychiatric, bvFTD versus Neurological and bvFTD versus Psychiatric), and one multi-class (bvFTD versus Neurologic versus Psychiatric, ‘one-against-one’ approach) classifications. Boxplots summarize the average cross-validated performance of the metric across the 500 random undersampling iterations. Grey line indicates chance level (50% (0.5) for binary and 33.33% for multi-class). Colors indicate the different data usage approaches. Black dots show outliers which are values larger (smaller) than the 3rd (1st) quartile by at least 1.5 times the interquartile range. A) Overall accuracy of the binary classifiers defined as average across sensitivity and specificity. B) Area under receiver-operator curve (AUC). C) Sensitivity and (D) Specificity of binary classification tasks. E) Metrics of the multi-class task: overall accuracy and class-specific (bvFTD, Neurological or Psychiatric) accuracy
In the multi-class case, bvFTD versus neurological versus psychiatric, accuracy between 54–59% could be observed (chance-level at 33.33%). The highest accuracy was again obtained for the combined voxel-wise and clinical data (59%, p < 0.001). While sensitivity for correctly identifying bvFTD was similar across all data views (65–71%), accuracies in correctly predicting the psychiatric or neurological group differed across approaches: clinical data did not provide a very high accuracy in identifying the psychiatric group (42%) while the purely MRI based methods failed at identifying patients with neurological conditions accurately (33–35%). Only the combined data lead to reasonably balanced performances (accuracy for neurological group: 46%, psychiatric group: 60%).
To investigate which patients were consistently correctly (or incorrectly) classified, we quantified how often our classifiers correctly predicted their labels across the 500 random undersampling iterations (Supplementary Figure 1). This showed that accuracies were mainly driven by consistent misclassification of the same few subjects.
Given the strong difference in SRI score between groups we also investigated the usage of the SRI score alone without the addition of other clinical or demographic variables. The obtained classification accuracy was comparable to the results reported for the clinical data (data not shown) which indicates that classification based on clinical data is driven by the difference in SRI score.
To investigate whether our choice of default value of the SVM hyperparameter C = 1 is reasonable we conducted a grid-search on the training set for different values of C (0.01, 0.1, 1, 10, 100). With optimized values of C, the performance of our classifiers remained qualitatively similar (Supplementary Table 1), even though C values were often different from 1 (Supplementary Table 2). This can be explained by the fact that the C hyperparameter in the case of the standard SVM does not have a strong influence on classification accuracy in the case of high-dimensional neuroimaging data [40].
Furthermore, we investigated how performance of the different classifiers varied if a different method for cross-validation was used, comparing the 5-fold cross-validation approach to a leave-one-patient-per-group out approach (Supplementary Table 3). Again, no qualitative difference in average performance was observed.
Method comparisons
The results of comparisons of classifications based on different data approaches are shown in Supplementary Table 4. The average difference in accuracy for each test set fold was computed and through a sign flipping test an empirical null distribution was obtained to derive a p-value. In all but one of the classification tasks, the combination between clinical and MRI (voxel-wise or ROI) data showed a small but consistent improvement in performance. The only case where the improvement was not consistent enough is in the case of the bvFTD versus psychiatric classification where clinical + voxel-wise MRI data performed as well as the voxel-wise MRI data alone. Generally, a combination of clinical + whole-brain voxel-wise MRI data yielded higher accuracies than the combination of clinical and ROI data except for the bvFTD versus neurological classification where also the ROI localized MRI data outperformed data from the whole-brain. For all other classifications, voxel-wise MRI data performed better than ROI data. MRI data outperformed clinical data in the bvFTD versus neurological + psychiatric and bvFTD versus psychiatric classification. However, in the bvFTD versus neurological and the multi-class classification clinical data alone performed equally well or better than MRI data.
Anatomical localization
Regions located in the anterior temporal, frontal, and cerebellar lobe of the brain exhibit high contribution to the classification tasks of the SVM classifier (Supplementary Table 5 and Fig. 2). These regions show mostly negative weights which would imply a high chance for non-bvFTD classification given low amount of atrophy in these regions. These results show that regions important for the classification also reside outside of the purely frontal and temporal regions which indicates the need for a whole brain approach for classification.

Untresholded -log(p) – value maps characterizing the regions important for the binary classification of the support vector machine classifier (SVM) in the voxel-wise whole brain data usage case obtained by the approach described in Gaonkar et al. [44]. Hot colors indicate positive weights and cold colors indicate negative weights of the SVM. The maps are shown for the bvFTD versus Neurological + Psychiatric (A), bvFTD versus Neurological (B), and bvFTD versus Psychiatric (C) classification task. Left side of the brain is shown on the left and the corresponding coordinates in Montreal Neurological Institute (MNI) space are z = –60, – 50, – 38, – 26, – 14, 0, 14, 26, 38, 50, 60. The figure was made with the nilearn package (http://nilearn.github.io). Region names of Bonferroni corrected surviving areas can be found in Supplementary Table 1.
DISCUSSION
In this study we showed that multivariate analysis of regional GM volume estimates from baseline structural T1-weighted MRI recordings and baseline clinical data enable the individual prediction of probable and definite bvFTD diagnosis at two-year follow-up in a clinically representative sample of patients presenting late onset behavioral changes. The approach yielded reasonably high accuracy and high AUC values. Among the four classification tasks the one with the highest relevance for the clinical setting had the goal to identify bvFTD patients across the combination of heterogeneous psychiatric and neurological disorders. This classification task yielded an average accuracy of 79%, with a sensitivity of 75% and specificity of 83%, using the combination of clinical and voxel-wise MRI data. The highest average accuracy of 82%, with a sensitivity of 79% and a specificity of 86%, was obtained for the bvFTD versus psychiatric disorders classification task when using the same data combination. This is encouraging since the differentiation of bvFTD versus psychiatric disorders is very challenging in clinical practice [32, 46]. The bvFTD versus neurological classification achieved an accuracy of 81% with sensitivity of 73% and specificity of 89%, using the combination of clinical and ROI data. To investigate whether we can distinguish all three classes at once we also performed a multi-class classification. While the best performing approach (combined data with 59% accuracy) allowed for a classification which was better than chance (chance-level 33.33%, p < 0.001) this is not a performance which can be considered potentially useful in a clinical setting due to its low base accuracy. We found that the overall specificity for bvFTD classification was slightly higher than the sensitivity for all binary classifiers. This reflects the difficulty seen in clinical practice concerning the diagnosis of bvFTD in its early stages and corresponds with previous literature on the accuracy of neuroimaging techniques used to detect bvFTD [9 , 48]. The higher specificity suggests that our methods may be particularly suitable to exclude bvFTD from clinically relevant differential diagnoses in patients reporting behavioral symptoms, thereby initiating further diagnostics for other conditions at baseline. The combination of MRI and clinical data leads to a small but consistent improvement over using only MRI data in almost all of the classification tasks. This can be also seen in the literature where combination of different data sources leads to an improvement of classification performance [19 , 50]. Accuracy was higher when using MRI data alone (ROI or voxel-wise) compared with using clinical data alone in two out of our four classification tasks: bvFTD versus neurological and psychiatric and bvFTD versus psychiatric. In the other two classification tasks there was no significant difference between MRI data alone or clinical data alone. This is also reflected in the literature where neuropsychological tests can be seen as a strong baseline [51]. Most of the current machine learning studies predicting bvFTD use an establish diagnosis for bvFTD. These studies can be seen as an upper bound of the performance which can be achieved by these methods and therefore provide a reasonable comparison. Accuracies vary between 75–93% when classifying FTD against HC [14 , 52] and between 68–94% when classifying FTD against AD [13–16 , 53]. However, in most of these studies, the FTD group was not limited to patients with bvFTD but also included the language variant of FTD [13–15 , 53] which makes them less directly comparable to the current study. Many studies also included imaging modalities such as [18F]FDG-PET, single-photon emission computed tomography (SPECT), or arterial spin labelling perfusion MRI [16 , 53], which are not readily available in many hospitals. Moreover, the machine learning classifiers in the before-mentioned studies were trained on data that was obtained from patients whose diagnosis could be already confirmed clinically at the time of inclusion, i.e., typical structural abnormalities must have been already apparent in order to meet the diagnostic criteria for probable bvFTD. The classifiers that were trained in our study were assigned a more difficult classification task and were able to predict bvFTD using structural MR images acquired two years before diagnosis in a cohort that resembles the population that is seen in clinical practice more closely. An exception to this practice is the study by Feis et al. [19] which attempted the prospective prediction of pre-symptomatic FTD mutation carriers using an already established bvFTD classification model [50]. In a classification versus HC, the established model only achieved a moderate AUC of 0.57. Training a new model for their data, however, improved the performance towards an AUC of 0.68. In addition to using patients with already established diagnosis all studies so far have only assessed the classification performances for distinguishing AD (or HC) from (bv)FTD and did not take a relevant psychiatric population into account. The classification performances achieved in our study are comparable in accuracy of visual ratings of structural MRI images to predict bvFTD diagnosis. Mendez et al. [47], found 67% accuracy with 63.5% sensitivity and 70.4% specificity for clinicians using visual ratings of frontotemporal atrophy to distinguish FTD from psychiatric diagnosis and other neurological disorders two years prior to final diagnosis. The most relevant comparison of our classification performance can be made with respect to the diagnostic accuracy reported by Vijverberg et al. [9], who investigated the same LOF dataset that was used in this study, and found 81.5% accuracy for the diagnosis of bvFTD with 70% sensitivity and 93% specificity for frontotemporal atrophy using visual ratings of global cortical atrophy, medial temporal lobe atrophy and white matter intensities, scored by an experienced academic neuroradiologist. If combined with [18F]FDG-PET-scans, the performance rose to 84.5% accuracy with 96% sensitivity and 73% specificity. The reported performance in the current study is therefore comparable with what can be achieved by visual ratings of an expert neuroradiologist specialized in neurodegeneration when using MRI data alone, and slightly lower than what can be achieved when it is combined with PET data. However, our current approach provides an automatic, less time-consuming and less expensive method which can also be applied in hospitals without PET scanner and a specialized neuroradiologist. It should be noted that the usage of a machine learning algorithm in clinical practice is not without additional costs as well. Specifically, there is a considerable overhead in bringing an algorithm to a point which makes it suitable for usage in the clinic by constant validation and confirmation of its performance [54]. But after the confirmation of the algorithms generalization capability and replication among multiple samples, a trained machine learning model can be run using any recent personal computer with the capability of reading in MRI images.
From the two MRI data usage approaches, the whole brain voxel-wise approached generally outperformed the ROI localized classification. With the exception of the classification of bvFTD against neurological disorders where the important information seems to be localized in our chosen ROIs (orbitofrontal cortex, bilateral temporal pole, insular, and prefrontal cortex), whole brain data was always significantly outperforming the localized ROI approach. A potential explanation is provided by the post hoc examination of the weight-maps of the SVM classifier (Supplementary Table 5 and Fig. 2) which clearly shows regions outside of the prefrontal and temporal gyri to be beneficial for solving the classification tasks. The reason for this may be that brain regions that are important for the classification task but beyond the ROIs might not only be directly related to the disorder in question but can be used for denoising purposes in the classification process [55]. The expense of this additional benefit is that whole brain data does not allow for easy regional interpretations of the classification anymore. It should be noted that an image of p-values of the weight maps (as provided in Fig. 2) does not correspond to a univariate voxel-wise group comparison as obtained through voxel-based morphometry but represents the multivariate pattern used by the SVM classifier. There are several limitations to this study. We investigated a rather low number of subjects per group. The small number of bvFTD patients used for training limits the ability of the SVM to learn the complex multivariate pattern that separates it from other classes. Additionally, having more data would allow for a bigger test set in each iteration of the cross-validation procedure which would allow for less variance in the estimation of the generalization performance of the classifier [40]. A further limitation is the age difference, which was found between the different patient groups. Such a difference could in an extreme case lead to a high classification performance without the classifier learning anything about the underlying disorders. However, post hoc tests revealed that the difference is only present between the psychiatric and neurological groups. Therefore, it does not limit the primary analysis performed in this study—bvFTD versus combination of both psychiatric and other neurological disorders—since the binary classification task never involved the classification between the psychiatric and the neurological group. An additional consideration is the use of a single neuroimaging modality to detect brain abnormalities. Even though combining the strengths of different imaging modalities could improve diagnostic accuracy, imaging techniques such as SPECT or [18F]FDG-PET are not available in most hospitals and require the infusion of nuclear tracers. This procedure is expensive and exposes patients to radiation, which may not be necessary. However, combination of additional neuroimaging modalities has been shown to perform strongly in distinguishing (bv)FTD from AD and HC and should therefore be also investigated in future studies of prospective diagnosis prediction [19 , 50]. Future studies should also consider the added value of using cognitive tests as an additional data domain useful for classification, and investigate whether the addition of multiple parameters are of added value for the diagnostic process.
In summary, we demonstrated the potential feasibility of machine learning techniques to automatically predict the accurate diagnosis of bvFTD at the single subject level. We were able to predict probable and definite bvFTD diagnosis that was confirmed after 2 years of multidisciplinary follow-up using whole brain structural and clinical data in a cohort that more closely resembles the population seen in clinical practice, with classification performances that are comparable to the diagnostic accuracy of expert neuroradiologists visually assessing MR images. Our results show how machine learning classifiers could be of value for the clinic and used to provide diagnostic certainty for bvFTD. Studies aimed to improve the diagnostic accuracy of bvFTD in its early stages are critical for research on new interventions and treatments, as well as the timely development of management strategies for patients and their families.
