Machine Learning and MRI-based Diagnostic Models for ADHD: Are We There Yet?

Abstract

Objective:

Machine learning (ML) has been applied to develop magnetic resonance imaging (MRI)-based diagnostic classifiers for attention-deficit/hyperactivity disorder (ADHD). This systematic review examines this literature to clarify its clinical significance and to assess the implications of the various analytic methods applied.

Methods:

A comprehensive literature search on MRI-based diagnostic classifiers for ADHD was performed and data regarding the utilized models and samples were gathered.

Results:

We found that, although most studies reported the classification accuracies, they varied in choice of MRI modalities, ML models, cross-validation and testing methods, and sample sizes. We found that the accuracies of cross-validation methods inflated the performance estimation compared with those of a held-out test, compromising the model generalizability. Test accuracies have increased with publication year but were not associated with training sample sizes. Improved test accuracy over time was likely due to the use of better ML methods along with strategies to deal with data imbalances.

Conclusion:

Ultimately, large multi-modal imaging datasets, and potentially the combination with other types of data, like cognitive data and/or genetics, will be essential to achieve the goal of developing clinically useful imaging classification tools for ADHD in the future.

Keywords

attention deficit hyperactivity disorder biomarkers classification machine learning MRI imaging classifier

Introduction

Clinicians diagnose ADHD by evaluating symptoms of hyperactivity, impulsivity, inattention, and impaired functioning across settings. The diagnosis of ADHD shows considerable levels of concurrent and predictive validity in its clinical features, course, neurobiology, and treatment response (Faraone, 2005; Faraone & Biederman, 2000). Nevertheless, concerns about diagnostic accuracy persist. Some suggest that the current method of diagnosing ADHD is too subjective and leads to over-diagnosing ADHD in the community (Bruchmüller et al., 2012; Visser et al., 2014). Psychiatric diagnoses have been called “subjective” because they rely on clinician evaluation of responses from patients, parents, and/or informants. Other studies have raised concerns about the under-diagnosis of ADHD (Ginsberg et al., 2014; The Express Scripts Lab, 2014), especially in girls and women, which suggests biases in applying the current diagnostic algorithm. Another issue is the misdiagnosis of ADHD as being another disorder. When this occurs, patients may be exposed to unnecessary treatments and will continue to struggle with the many impairments associated with ADHD. Those who have ADHD and are not diagnosed with the disorder will continue to have impaired functioning leading to increased risks for other health and social problems (Dalsgaard et al., 2015; Franke et al., 2018; Lambert & Hartsough, 1998; Lichtenstein et al., 2012; Reiersen & Todorov, 2013).

In response to these concerns, researchers have sought to develop objective measures to diagnose ADHD or to monitor the course of ADHD symptoms during treatment. Much research has examined peripheral biochemical markers in differentiating ADHD and control patients, such as (Norepinephrine (NE), 3-Methoxy-4-hydroxyphenylethylene glycol (MHPG), monoamine oxidase (MAO), zinc, and cortisol (Faraone et al., 2014; Scassellati et al., 2012). NE, MHPG, MAO, b-phenylethylamine, and cortisol are also somewhat predictive of response to ADHD medications. Meta-analysis also shows that peripheral measures of oxidative stress differ between ADHD and control participants (Joseph et al., 2015). Electroencephalographic (EEG) (Snyder et al., 2015), actigraphic (Dane et al., 2000), and eye vergence measurements (Solé Puig et al., 2015), as well as interactive gaming behavior (Faraone et al., 2016) were also examined as ADHD biomarkers. Neuropsychological tests (Ritsner, 2009), particularly continuous performance tests (CPTs) (e.g. Corkum & Siegel, 1993; Homack & Riccio, 2006; Riccio & Reynolds, 2001) have been evaluated in many studies. In recent years, genetic markers in the form of polygenic risk scores also have shown some predictive ability of diagnosis and prognosis of ADHD (Demontis et al., 2019; Hamshere et al., 2013; Riglin et al., 2016). Many of these prior studies show group differences but do not present diagnostic accuracy statistics. A clinically useful biomarker should have at least 80% sensitivity and 80% specificity. They should also be reliable, reproducible, inexpensive, non-invasive, easy to use, and confirmed by at least two independent studies. These criteria were defined by work of the task force on biological markers by the World Federation of ADHD (Thome et al., 2012). None of the measures examined by them met these criteria for clinical utility (Thome et al., 2012).

Prior structural MRI (sMRI) studies have consistently reported alterations in frontal, parietotemporal, cingulate, cerebellum, basal ganglia, and corpus callosum regions (Castellanos et al., 2002; Mackie et al., 2007; Seidman et al., 2005, 2006; Shaw et al., 2006, 2014; Valera et al., 2007). Studies of the largest ADHD sMRI dataset from the Enhancing Neuro Imaging Genetics Through Meta-Analysis (ENIGMA) consortium’s ADHD Working Group reported the significant volumetric reductions in intracranial volume, amygdala, caudate nucleus, nucleus accumbens, hippocampus, and cortical surface areas from many regions in children with ADHD (Hoogman et al., 2017, 2019). These regions have also been implicated in functional MRI (fMRI) studies showing altered brain connectivity and activation in the fronto-striatal, fronto-parietal and fronto-temporal-parietal circuits, as well dorsal anterior cingulate cortex in ADHD brains (Dickstein et al., 2006; Smith et al., 2006; Tian et al., 2006). Studies have also examined the developmental trajectories of these anatomical and functional alterations across the lifespan finding initial delays that are followed by apparent normalization (Castellanos et al., 2002; Shaw et al., 2006, 2014).

These findings encouraged efforts to develop objective diagnostic tools for ADHD using MRI data. Early studies used standard statistical methods such as discriminant analysis with very small sample sizes (Semrud-Clikeman et al., 1996; Zhu et al., 2005, 2008). For example, a discriminant analysis reported by Semrud-Clikeman et al. (1996) included 10 participants in each of three diagnostic groups: developmental dyslexia, ADHD or control. Zhu and colleagues’ discriminant analysis classifier assessed 9 ADHD and 11 typically developing boys. Although high predictive accuracies were reported in these studies (85%–87%), it is difficult to evaluate how well those models would generalize given the small samples and lack of replication samples.

The ADHD-200 Global Competition (The ADHD-200 Consortium, 2012) challenged researchers to develop an MRI-based diagnostic classifier for ADHD. It provided a dataset of 776 children, adolescents and young adults (7–21 years old, 63% healthy controls, 37% ADHD) from eight sites. Fifty teams from around the world joined the competition with 21 final submissions. Machine learning models predominated. Due to the large sample size, the consortium was able to set aside a test set that was not used for model selection and development. The competition results were judged by the performance on the test set only. This contrasts with previous studies with small sample sizes, where a held-out test set was not available. The ADHD-200 winning team used an ensemble model which achieved a 61% accuracy with 21% sensitivity and 94% specificity using both structural and resting state-fMRI data along with the demographic predictors (Eloyan et al., 2012). The accuracy, although considerably lower than previously reported high accuracies, was one of the first in an independent, held-out test set. Despite the modest accuracy, the ADHD-200 competition re-kindled enthusiasm for developing imaging-based diagnostic classifiers for ADHD. The publicly available dataset has become the main data source driving the machine learning model development for ADHD. Since the competition in 2012, we have seen a steady increase in the number of publications applying machine learning classifiers to ADHD. Thirty-one additional published studies have used either the whole or part of the ADHD-200 dataset (Supplemental Figure 1 and Table 1).

Table 1.

Machine Learning Literature on ADHD Neuroimaging Data.

Study	Training Sample Size)	ADHD% (Training Set)	Test Sample Size	ADHD% (Test Set)	Data Source	Ages	Sex	Model	Features	Performance Metrics	Test Type	Accuracy	References	PMID/Conference
Aradhya, 2019	371	n.a	94	n.a	ADHD-200 subset (right-handed males)	Children and young adults (7–21)	M, F	CNN	rs-fMRI	Accuracy	K-Fold-CV(K = 10)	70%	Aradhya et al. (2019)	n.a
Ariyarathne, 2020	26	n.a	16	n.a	ADHD-200 subset	Children and young adults (8–21)	M, F	CNN	rs-fMRI	Accuracy, Sensitivity, Specificity	Held-out Test	85%	Ariyarathne et al. (2020)	n.a
Bohland, 2012	776	37%	171	45%	ADHD-200	Children and young adults (7–21)	M, F	SVM	sMRI and rs-fMRI	Accuracy, AUC	K-Fold-CV(K = 2) Held-out Test	74%67%	Bohland et al. (2012)	23,267,318
Brown MR 2012	668	36%	171	45%	ADHD-200	Children and young adults (7–21)	M, F	SVM	rs-fMRI	Accuracy	Held-out TestK-Fold-CV(K = 10)	55%71%	Brown et al. (2012)	23,060,754
Chaim-Avancini, 2017	96	54%	n.a	n.a	Clinic and commmunity	Adults (18–50)	M, F	SVM	sMRI and DTI	Accuracy, ROC AUC, Sensitivity, Specificity, PPV, NPV	K-Fold-CV(K = 10)	74%	Chaim-Avancini et al. (2017)	29,080,396
Chen, 2020	633	43%	n.a	n.a	ADHD-200	Children and young adults (7–21)	M, F	BHT	rs-fMRI	Accuracy, Sensitivity, Specificity	LOOCV	88%	Chen et al. (2020)	32,143,793
Chen, 2022	872	37%	n.a	n.a	ADHD-200 and CINEPS	Children and adolescents (6-14)	M, F	CNN	rs-fMRI	Accuracy, Sensitivity, Specificity,AUC	K-Fold-CV(K=5)	79%	Chen, Ming et al. 2022)	35,246,986
Cheng W, 2012	239	41%	n.a	n.a	ADHD-200	Children and young adults (7–21)	M, F	SVM	rs-fMRI	Accuracy, Sensitivity, Specificity	LOOCV	76%	Cheng et al. (2012)	22,888,314
Colby JB, 2012	776	37%	197	n.a	ADHD-200	Children and young adults (7–21)	M, F	SVM	sMRI and rs-fMRI	Accuracy, Sensitivity, Specificity	Held-out Test	59%	Colby et al. (2012)	22,912,605
Dai D, 2012	624	36%	n.a	n.a	ADHD-200	Children and young adults (7–21)	M, F	MKL	sMRI and rs-fMRI	Accuracy, Sensitivity, Specificity, J-statistic, F1- score, ROC AUC	K-Fold-CV(K = 10) Held-out Test	68%62%	Dai et al. (2012)	22,969,710
Deshpande G, 2015	1,177	37%	n.a	n.a	ADHD-200	Children and young adults (7–21)	M, F	FCCANN	rs-fMRI	Accuracy	LOOCV	90%	Deshpande et al. (2015)	25,576,588
Dey, 2014	776	37%	n.a	n.a	ADHD-200	Children and young adults (7–21)	M, F	SVM	rs-fMRI	Accuracy, Sensitivity, Specificity	Training samples Held-out Test	71%74%	Dey et al. (2014)	24,982,615
Du J, 2016	216	55%	n.a	n.a	ADHD-200	Children and young adults (7–21)	M, F	SVM	rs-fMRI	Accuracy, Sensitivity, Specificity, ROC AUC	K-Fold-CV(K = 10)	95%	Du et al. (2016)	27,166,430
Eloyan A, 2012	776	37%	194	n.a	ADHD-200	Children and young adults (7–21)	M, F	Ensemble	sMRI and rs-fMRI and demographics	Accuracy, Sensitivity, Specificity	Held-out TestK-Fold-CV (n = 184randomly chosen internal test set)	61%78%	Eloyan et al. (2012)	22,969,709
Fair DA 2013	104	50%	n.a	n.a	ADHD-200	Children and young adults (7–21)	M, F	SVM	rs-fMRI	Accuracy, Sensitivity, Specificity	LOOCV	83%	Fair et al. (2013)	23,382,713
Ghiassian S, 2016	769	36%	171	45%	ADHD-200	Children and young adults (7–21)	M, F	MHPC	sMRI and rs-fMRIand demographics	Accuracy, Sensitivity, Specificity	Held-out Test	70%	Ghiassian et al. (2016)	28,030,565
Hao A, 2015	216	55%	41	71%	ADHD-200 (NYUsubset)	Children and young adults (7–21)	M, F	DBN	fMRI	Accuracy, Sensitivity, Specificity	Held-out Test	49%	Hao et al. (2015)	n.a
	85	28%	50	46%	ADHD-200 (Pekingsubset)	Children and young adults (7–21)	M, F	DBN	fMRI	Accuracy, Sensitivity, Specificity	Held-out Test	54%
	83	27%	11	27%	ADHD-200 (KKIsubset)	Children and young adults (7–21)	M, F	DBN	fMRI	Accuracy, Sensitivity, Specificity	Held-out Test	72%
Hart H, 2014	60	50%	n.a	n.a	Clinic and local community	Children and adolescents (10–17)	M, F	GPC	task-fMRI	Accuracy, ROC AUC, Sensitivity, Specificity, PPV, NPV	LOOCV	77%	Hart et al. (2014)	24,123,508
Iannaccone R, 2015	40	50%	n.a	n.a	Outpatient clinic andlocal schools	Adolescents (12–16)	M, F	SVM	task-fMRI	Accuracy, ROC AUC, sensitivity, Specificity	LOOCV	78%	Iannaccone et al. (2015)	25,613,588
Igual L, 2012	78	50%	n.a	n.a	URNC database	Children and adolescents (6–18)	M, F	SVM	sMRI of thecaudate nucleus	Accuracy, Sensitivity, Specificity	K-Fold-CV(K = 5)	73%	Igual et al. (2012)	22,959,658
Jie, 2016	216	55%	n.a	n.a	ADHD-200 (NYUsubset)	Children and young adults (7–21)	M, F	SVM	rs-fMRI	Accuracy, ROC AUC, sensitivity, Specificity	LOOCV	83%	Jie et al. (2016)	27,060,621
Johnston BA, 2014	68	50%	n.a	n.a	Clinic and local schools	Children and adolescents (8–17)	M	SVM	sMRI	Accuracy, Sensitivity, Specificity	LOOCV	93%	Johnston et al. (2014)	24,819,333
Kuang D, 2014	83	n.a	11	n.a	ADHD-200 (KKIsubset)	Children and young adults (7–21)	M, F	DBM	rs-fMRI	Accuracy, Sensitivity, Specificity	Held-out Test	73%	Kuang et al. (2014)	n.a
	85	28%	50	46%	ADHD-200 (Pekingsubset)	Children and young adults (7–21)	M, F				Held-out Test	54%
	222	n.a	41	n.a	ADHD-200 (NYUsubset)	Children and young adults (7–21)	M, F				Held-out Test	37%
Lanka, 2020	759	37%	171	45%	ADHD-200	Children and young adults (7–21)	M, F	ensembleand ELM	rs-fMRI	Balanced Accuracy	Held-out Test	61%	Lanka et al. (2020)	31,691,160
Lim L, 2013	48	60%	n.a	n.a	Clinic	Children and adolescents (10–18)	M	GPC	sMRI	Accuracy, AUC, Sensitivity Specificity, PPV, NPV	LOOCV	79%	Lim et al. (2013)	23,696,841
Mao, 2019	626	46%	162	45%	ADHD-200	Children and young adults (7–21)	M, F	4D CNN	rs-fMRI	Accuracy, ROC AUC	Held-out Test	71%	Mao et al. (2019)	n.a
McNorgan, 2020	80	69%	n.a	n.a	MTA 168	Adults (18-15)	M, F	MFC	fMRI	Accuracy, Sensitivity, Specificity	K-Fold_CV(K=5)	91%	(McNorgan, Chris et al. 2020)	33,391,011
Olivetti E, 2012	923	38%	n.a	n.a	ADHD-200	Children and young adults (7–21)	M, F	ERT	rs-fMRI	Accuracy, FP, FN, TP, TN, Log(B10)	K-Fold-CV(K = 10)	66%	Olivetti et al. (2012)	23,060,755
Olivetti E, 2015	923	38%	n.a	n.a	ADHD-200	Children and young adults (7–21)	M, F	ERT	rs-fMRI	Accuracy, MCC, J-statistic, F1-score, Log(B10)	K-Fold-CV(K = 10)	62%	Olivetti et al. (2012)	27,747,500
Peng X, 2013	110	50%	n.a	n.a	ADHD-200 (Pekingsubset)	Children and young adults (7–21)	M, F	ELM	sMRI	Accuracy, ROC AUC	LOOCV	90%	Peng et al. (2013)	24,260,229
Peng, 2021	876	39%	n.a	n.a	ADHD-200	Children and young adults (7-21)	M, F	3D CNN	sMRI and fMRI	Accuracy, Pecision, Recall, ROC AUC	K-Fold_CV(K=5)	73%	(Peng, Jian et al. 2021)	n/a
Qureshi MN, 2016	106	50%	n.a	n.a	ADHD-200	Children and young adults (7–21)	M, F	H-ELM	sMRI	Accuracy	K-Fold-CV(K = 10) K-Fold-CV(70/30split)	80%85%	Qureshi et al. (2016)	27,500,640
Qureshi MN, 2017	106	50%	28	50%	ADHD-200	Children and young adults (7–21)	M, F	ELM	sMRI and rs-fMRI	Accuracy, Sensitivity, Specificity, F1-score, Precision, Recall	Held-out Test	93%	Qureshi et al. (2017)	28,420,972
Riaz, 2017	464	52%	65	44%	ADHD-200 (NeuroImaging, NYU and Peking subset)	Children and young adults (7–21)	M, F	CNN and SVM	rs-fMRI and demorgraphic	Accuracy	Held-out Test	69%	Riaz et al. (2017)	n.a
Riaz, 2018a	442	43%	n.a	n.a	ADHD-200 (NeuroImaging, KKI, NYU and Peking subset)	Children and young adults (7–21)	M, F	SVM	rs-fMRI and demorgraphic	Accuracy, ROC AUC, sensitivity, Specificity	LOOCV	87%	Riaz, Asad, Alonso, et al. (2018)	29,137,838
Riaz, 2018b	226	54%	n.a	n.a	ADHD-200 (NYUsubset)	Children and young adults (7–21)	M, F	CNN	rs-fMRI	Accuracy, Sensitivity, Specificity	Held-out Test	73%	Riaz, Asad, Arif, et al. (2018)	n.a
Sato, 2012	759	36%	171	45%	ADHD-200	Children and young adults (7–21)	M, F	AdaBoost	rs-fMRI	Sensitivity, Specificity, Balanced Accuracy	Held-out Test	55%	Sato et al. (2012)	23,015,782
Semrud-Clikeman, 1996	20	50%	n.a	n.a	Clinic and commmunity	Children and adolescents (6–16)	M, F	PDA	sMRI	Accuracy	training samples	87%	Semrud-Clikeman et al. (1996)	14,588,457
Sen, 2018	776	37%	171	45%	ADHD-200	Children and young adults (7–21)	M, F	SVM	sMRI and rs-fMRI	Accuracy, Sensitivity, Specificity, J-statistic	Held-out Test	67%	Sen et al. (2018)	29,664,902
Shao, 2018	50	36%	16	36%	ADHD-200 (KKIsubset)	Children and young adults (7–21)	M, F	SVM	rs-fMRI	Accuracy, Sensitivity, Specificity, MCC	Held-out Test		Shao et al. (2018)	30,009,990
Sidhu, 2012	668	36%	171	45%	ADHD-200	Children and young adults (7–21)	M, F	SVM	rs-fMRI	Accuracy	training samples	76%	(Sidhu et al., 2012)	23,162,439
Sidhu, 2012	668	36%	171	45%	ADHD-200	Children and young adults (7–21)	M, F	SVM	rs-fMRI	Accuracy	Held-out Test	67%	(Sidhu et al., 2012)	23,162,439
Stanley, 2022	254	50%	n.a.	n.a.	ABCD	Adoslescent (9-20)	M,F	CNN	rs-fMRI	Accuracy, AUC, sensitivity	Hold-out-test	71%	(Emma A. M. Stanley et al. 2022)	n.a
Tan, 2017	215	54%	n.a	n.a	ADHD-200 (NYUsubset)	Children and young adults (7–21)	M, F	SVM	sMRI and rs-fMRI	Accuracy, AUC, sensitivity, Specificity, BalancedAccuracy	K-Fold-CV(K = 10)	68%	Tan et al. (2017)	28,943,846
Tang, 2019	633	43%	n.a	n.a	ADHD-200	Children and young adults (7–21)	M, F	BHT	rs-fMRI	Accuracy, Sensitivity, Specificity	LOOCV	92%	Tang, Wang, et al. (2019)	30,938,224
Tang, 2020	633	43%	n.a	n.a	ADHD-200	Children and young adults (7–21)	M, F	BHT	rs-fMRI	Accuracy, Sensitivity, Specificity	LOOCV	98%	Tang, Li, et al. (2020)	n.a
Wang, 2013	46	50%	n.a	n.a	FCON_1000	Adults (18–50)	M, F	SVM	rs-fMRI	Accuracy, Sensitivity, Specificity	LOOCV	80%	Wang et al. (2013)	23,684,384
Wang, 2018	71	51%	n.a	n.a	ADHD-200 subset	Children and adolescents (6–18)	M, F	SVM	sMRI	Accuracy, Sensitivity, Specificity	LOOCV	75%	Wang et al. (2018)	30,031,733
Wang, 2021	470	25%	117	25%	ADHD-200	Children and young adults (7-21)	M, F	3D MVA-CNN	sMRI and fMRI	Accuracy, Sensitivity, Specificity	K-Fold-CV(K=5)	79%	(Wang, Zijan et al. 2021)	34,517,567
Xiao, 2016	47	68%	n.a	n.a	Clinic	n.a	n.a	Lasso	sMRI	Accuracy, Sensitivity, Specificity	LOOCV	81%	Xiao et al. (2016)	27,747,592
Yao, 2018	189	59%	n.a	n.a	Clinic	Adults (18–34)	M, F	Ensemble	rs-fMRI	Accuracy, Sensitivity, Specificity	K-Fold-CV(K = 10)	80%	Yao et al. (2018)	30,441,383
Yao, 2018	189	59%	n.a	n.a	Clinic	Children and adolescents (6–14)	M	Ensemble	rs-fMRI	Accuracy, Sensitivity, Specificity	K-Fold-CV(K = 10)	86%	Yao et al. (2018)
Yoo, 2020	94	50%			Clinic	Children and adolescents (6–17)	M, F	RF	sMRI and rs-fMRI and DTI	Accuracy, AUC, Sensitivity, Specificity, PPV, NPV	LOOCV	85%	Yoo et al. (2020)	31,321,662
Yoo, 2020	94	50%			Clinic	Children and adolescents (6–17)	M, F	RF	sMRI and rs-fMRI and DTI	Accuracy, AUC, Sensitivity, Specificity, PPV, NPV	Held-out Test	69%	Yoo et al. (2020)	31,321,662
Zhu CZ, 2008	20	45%	n.a	n.a	Community	Children and adolescents (11–17)	M	FDA	rs-fMRI	Accuracy, Sensitivity, Specificity	LOOCV	85%	Zhu et al. (2008)	18,191,584
Zhang-James, 2021	1194	50%	179	50%	ENIGMA	Children and adolescents (11-17)	M,F	MLP	sMRI	AUC ROC, PR curves	Held-out Test		(Zhang-James et al. 2021)	33,526,765
	2132	50%	320	50%		Adults(18-50)	M,F				Held-out Test
Zou, 2017	559	35%	171	45%	ADHD-200	Children and young adults (7–21)	M, F	3D CNN	rs-fMRI and sMRI	Accuracy	Held-out Test	69%	Zou et al. (2017)	n.a
Zu, 2019	216	55%	n.a	n.a	ADHD-200 (NYUsubset)	Children and young adults (7–21)	M, F	STM	rs-fMRI	Accuracy	K-Fold-CV(K = 10)	65%	Zu et al. (2019)	29,948,906

Note. ABCD = adolescent Brain Cognitive Development Study; AUC = the area under the ROC curve; BHT = Binary Hypothesis Testing; CINEPS = Cincinnati Early Prediction Study; CNN = Convolutionary Neural Net; CV = cross-validation; LOOCV = leave-one-out cross validation; DBN = Deep Bayesian Network; DBM = Deep Belief Network; ELM = extreme learning machine; H-ELM = hierarchical extreme learning machine; ENIGMA = consortium; ERT = extremely randomized tree; FCON_1000 = 1000 Functional Connectomes Project database (http://www.nitrc.org/projects/fcon_1000); FDA = Fisher discriminative analysis; fMRI = funcitonal MRI; rs-fMRI = resting state-functional MRI; sMRI = structure MRI; DTI = diffusion tensor imaging; GBM = a gradient boosting method; GPC = Gaussian process classifiers; Log(B10) = the log of the Bayes factor for the hypothesis of dependence versus independence; MCC = Matthew’s correlation coefficient; MHPC = the histogram of oriented gradients (HOG)-feature-based patient classification; MKL = multi-kernellearning; MVA = multi-view attentional; FCCANN = fully connected cascade artificial neural network; PDA = predictive discriminant analysis; PPV = Positive predictive value; NPV = Negative predictive value; RF = Random Forest; SVM = support vector machine; STM = Support tensor machine; TP = the number of true positive diagnosis; TN = the number of true negative diagnosis; FP = the number of false positive diagnosis; FN = and the number of false negative diagnosis.

Balanced Accuracy = (sensitivity + specificity)/2.

This systematic review examines the prior literature applying ML to MRI data in ADHD to clarify the clinical significance of findings and to assess the implications of the various analytic methods applied. We discuss the progress made over the years as well as lessons and methodological issues that we learned from this body of work. We hope to provide a roadmap for future studies that aim to overcome these issues and achieve clinically useful models for diagnosing ADHD.

Methods

A literature search on MRI-based diagnostic classifiers for ADHD using key words (“ADHD” AND “MRI” AND (“Machine learning” OR “Classi*”)) and examining their references identified 55 studies in total (up to July 1, 2022, Pubmed, Embase, and Google). Supplemental Figure 2 shows the article selection procedure in a PRISMA diagram. The eligible studies applied statistical or machine learning classifiers using MRI data to differentiate participants with ADHD from controls. Table 1 lists the selected studies along with the performance of their best models. If a study dealt with multi-class classification, for example, having ADHD, ASD and control groups, only the two-class classification accuracies involving ADHD versus the control groups were examined in this review. We used percent correct (accuracy) to compare results across studies because it was available for most of the papers. Studies that met the classifier criteria but did not report an accuracy statistic or other metrics that can be used to compute accuracies, were not included in our quantitative analysis. If a study reported multiple models, only the model which had the highest accuracies was included in Table 1.

We extracted and examined study characteristics, including machine learning model types, MRI data modality, cross-validation and testing methods, training sample size, training set class ratio (the ratios of ADHD vs. Control participants’ numbers), data source, dataset age and sex compositions and publication years, etc. We grouped machine learning models to three categories: support vector machine (SVM), convolutional neural networks (CNNs), and others. We assigned studies with a training set class ratio between 0.4 and 0.6 as “balanced” (i.e., nearly equal), and those with higher or lower ratios as “unbalanced.” Nine studies used various methods to balance demographic differences between the ADHD and control groups. These were assigned as “balanced,” even if their original class ratio was outside of the balanced range (Deshpande et al., 2015; Fair et al., 2013; Ghiassian et al., 2016; Qureshi et al., 2016; Riaz, Asad, Alonso, et al., 2018; Stanley et al. 2022; Wang et al., 2013; Zhang-James et al. 2021). We reported the age and sex groups, as well as the minimum and maximum age range of the dataset. For the ADHD-200 samples, the overall age range was used if a specific subset was used but age information was not provided. Minimum and maximum values of age were derived for studies that reported mean and standard deviation of the ages.

We also classified studies based on the methods they used to evaluate model performance and generalizability. Two methods were used. The held-out test set method evaluates model performance on data that were set aside, that is, they were not used during model estimation and training. Because this method requires a large sample, many studies resort to cross-validation (CV) method to assess model performance. CV methods randomly re-sample examples to be set aside during model fitting. The most commonly used versions are the leave-one-out CV (LOOCV) (Deshpande et al., 2015; Fair et al., 2013; Hart et al., 2014; Iannaccone et al., 2015; Peng et al., 2013) and K-Fold-CV (where K is often = 10, 5, or 2) (Brown et al., 2012; Dai et al., 2012; Du et al., 2016; Qureshi et al., 2016). For example, in 10-fold CV, the original dataset is partitioned into 10 equal sub-samples or “folds.” For each iteration of model estimation nine of the subsamples are used to estimate model parameters and the left-out fold is used to estimate model accuracy. The left-out fold changes from iteration to iteration. For LOOCV, one sample is left out for testing while all the others are used for training or model fitting. In either situation, the process is repeated until all samples have been used in both the training and test sets. The CV accuracy is estimated by averaging over all iterations of CV accuracies. Although CV samples were not used during the model training/fitting at each iteration, they are, nevertheless, used as training examples in other iterations.

Our main objective was to understand how study features influenced model accuracy. We used likelihood-ratio (LR) test assisted variable selection in combination with multivariate linear regression to quantitatively evaluate if these features predicted model accuracy. The variable selection algorithm implemented in STATA16’s gvselect command computes both the Akaike’s (1974) information criterion (AIC) and Schwarz’s (1978) Bayesian information criterion (BIC) (StataCorp, 2019). We performed the variable selection and linear regression modeling for all the studies combined, as well as separately for the K-Fold-CV, LOOCV, and held-out test groups. Training sample size was primarily examined as a continuous variable. However, we also classified sample sizes as small (<300) or large (>300) to compare the variability of their accuracy estimates using Levene’s (1960) robust test statistic. In addition to the quantitative analysis, we also qualitatively reviewed the relevant study characteristics if a quantitative analysis was not possible.

Results

Among all the studies included, over half the studies (N = 31, 56%) reported only CV results without a held-out test set. Forty-two percent (N = 23) used a Hold-out test sample to evaluate classifier performance. All but one of the 23 studies used the ADHD-200 samples. Among the studies that reported held-out test results, six also reported CV results. Figure 1 shows that the 16 studies using K-Fold-CV and the 17 studies using LOOCV reported, on average, higher accuracies than studies using held-out tests (F_(2,55) = 34.52, p < .001).

Figure 1.

Best prediction accuracies reported in each study for each type of the available tests: K-Fold-CV, LOOCV or held-out tests.

Excluding a 2021 outlier study with the largest sample size and low accuracy (Zhang-James et al. 2021), the accuracy estimates overall increased in later publication years (F_{(1, 54)} = 5.15, p = .027, Figure 2). There was no significant change of reported accuracies over the years in studies using the K-Fold-CV methods (Figure 2 Left). The increase was primarily driven by studies using LOOCV and Hold-out test sets that showed a statistically significant increase over time (F_{(1, 17)} = 12.63, p = .0024, F_{(1, 22)} = 4.61, p = .043, Figure 2 Middle and Right).

Figure 2.

Accuracy in studies published over the years. Zhang-James et al. (2021) study is highlighted with red triangles to signify its outlier status due to reported the large sample size and low accuracy.

Training sample size, overall, was not significantly associated with accuracy (Figure 3 Left and Right), either as a continuous or categorical variable. However, it negatively predicted accuracies in the LOOCV group (F_{(1, 17)} = 10.15, p =.005, Figure 3 Middle). Studies with large samples had lower mean accuracies than studies with small studies (72% vs. 77% mean accuracies, t = 1.79, p = .038). Furthermore, the accuracy results from small studies were more variable than those from large studies in the held-out test group (Levene’s robust test statistic W0 _{(1, 30)} = 6.58, p = .015). The variance differences between large and small samples were not statistically significant for either the K-Fold-CV or LOOCV.

Figure 3.

Accuracy versus training sample size. Sample sizes <300 were labeled as triangle and >300 are labeled as circle. The fitted line between accuracy and sample size were plotted for each test type.

Twenty-four studies (49%) used a training dataset that had severely imbalanced classes. Nine of those studies applied data balancing methods to compensate for the class imbalance and are grouped as balanced studies. Class-balanced studies reported higher accuracies for both the K-Fold-CV (F_{(1, 19)} = 6.55, p = .02) and LOOCV (F_{(1, 17)} = 36.02 and p = .0001, Supplemental Figure 3A). However, the balanced studies in the K-Fold-CV group were mostly small studies with the exception of three studies (Supplementary Figure 3B); we could not differentiate whether the higher accuracy was due to the negative relationship with sample size or the benefit of data balance. The higher accuracies in the balanced LOOCV group were related to sample size as only the large group (>300 samples) showed a significant relationship between accuracy and the balanced criterion (F_{(1, 5)} = 44.81, p = .0011). No statistical difference was found for either accuracies or training sample size between the balanced and unbalanced studies in the held-out test group.

Because the ADHD-200 dataset was the main data source, most studies (N = 29) used resting-state fMRI data (rs-fMRI), or rs-fMRI in combination with sMRI data (multi-modal, N = 16). Only eight studies used sMRI data, and only two used task-based fMRI data. There were no significant differences in sample size across the different feature types (Supplemental Figure 4A). However, except for the two task-based fMRI studies, which both used LOOCV and reported significantly lower accuracies than other MRI modalities (t = −23.3, p < .0001), there was no difference in reported classification accuracies observed among the sMRI, rs-fMRI, or multi-modal studies (Supplementary Figure 4B).

The ADHD-200 dataset has a mix of children, adolescents and young adults (age 7–21). Ten other studies focused only on children and/or adolescents (under age 18). Only five studies examined classification models for older adults (Chaim-Avancini et al., 2017; McNorgan et al. 2020; Wang et al., 2013; Yao et al., 2018; Zhang-James et al. 2021). Overall, the difference in accuracy across the three types of age compositions was not significant (F_{(2, 55)} = 1.74, p = .19).

Most studies used a mixture of male and female participants. Four studies only included boys (Johnston et al., 2014; Lim et al., 2013; Yao et al., 2018; Zhu et al., 2008). These four reported significantly higher classification accuracies than all other studies that used a mixture of males and females (F_{(1, 55)} = 32.14, p = .0001). However, all four were small studies (n = 20–189). Three reported LOOCV and one reported 10-Fold-CV accuracies.

Across all studies, the most frequently used model was the support vector machine (SVM). It was used in 20 (36%) studies. SVM, and most other ML models cannot directly analyze images. Instead, they analyze some transformation of images such as regional volumes or cortical thickness. In contrast, convolutional neural networks (CNNs) can analyze images directly and thus have access to all the information available. Only in recent years (2017–2022) have studies applied CNN methods to MRI images (N = 11). We did not find any statistically significant differences between the accuracies reported with SVM, CNN, or other models for ADHD F_{(2, 55)} = 0.01, p = .91, Supplemental Figure 5).

Discussion and Qualitative Review

Our quantitative analysis of prediction accuracy for ADHD revealed several significant findings. First, accuracies based on K-Fold-CV or LOOCV were significantly higher than those reported using held-out tests, which suggests that CV methods may over-estimate model performance. Second, we found greater variability of test accuracies reported in studies with small sample sizes than those of larger sample sizes and an inverse relationship of sample size and K-Fold-CV accuracies. Third, estimates of accuracy increased with publication year. This was driven primarily by the Hold-out and the LOOCV test-groups. Since sample size has been roughly the same since 2012, with the exception of a 2021 study (Zhang-James et al. 2021), the increasing accuracy over time could be due to several design features: (1) the use of more sophisticated models (such as deep neural networks and CNN models), (2) improved methods of data balancing and data augmentation, and (3) use of feature selection, feature space reduction methods or different MRI data modalities. However, our analysis cannot conclusively clarify if any or all of above attributed to the increase of accuracies. We discuss the implications of these findings and provide further review of some study characteristics that were not examined in our quantitative analysis.

Cross-Validation Versus Held-out Test Set

In the CV approach, the validation samples used to estimate accuracy are not used during the model training/fitting at each iteration. They are, nevertheless, used as training examples in other iterations. Moreover, because there are many iterations, the validation set can influence parameter estimates. In contrast, the held-out test method uses a test set that was never used during model training. As a result, CV accuracies have been shown to overestimate test set accuracy when both are available (Brown et al., 2012; Dai et al., 2012). Our results confirm the inflation of accuracy by K-Fold-CV or LOOCV. Held-out test accuracy is a better indicator of model performance with unseen samples.

More than half the studies (N = 31,56%) reported only CV results without a held-out test set. An earlier review reported 13% of ADHD neuroimaging (including MRI and electroencephalographic) studies consisted of “circular analysis,” where independent test sets were not used (Pulini et al., 2019). Our results are more similar to what Kriegeskorte et al. (2009) had estimated, 42% to 56% of studies consisted of “circular analysis,” based on all fMRI studies published in five prestigious journals (Nature, Science, Nature Neuroscience, Neuron, Journal of Neuroscience) in 2008. Nevertheless, our review highlights the importance of building a large dataset through collaborations and open data sharing as we pointed out that the majority of the studies that were able to afford a held-out test were those that used the ADHD-200 dataset.

Sample Size

Machine learning, particularly deep learning, often requires large sample sizes due to the large number of parameters and hyperparameters that a model needs to learn. However, many neuroimaging studies of ADHD used very small sample sizes. In our small sample group, the sample size ranged from 20 to 239 (average sample size 112). Small sample sizes can lead to model overfitting and overestimates of accuracy (Brain & Webb, 1999; Wolfers et al., 2015) . In our review, this effect was reflected in the large variability of accuracies in the CV studies. Indeed, some of the highest and lowest test accuracies were reported in studies with extremely small sample sizes. None of the studies reviewed here used a learning curve analysis to assess overfitting. This method, which examines the relationship of model performance over various numbers of training sample sizes, can help us to determine if a model is overfit and if it can benefit from more training examples (Zhang-James et al., 2020).

We found a negative relationship between sample size and K-Fold-CV accuracies. Because increasing the number of training samples typically improves performance (Bengio et al., 2013), this suggests that the lower estimates of accuracy from the larger samples are more likely to be correct than the higher estimates from smaller samples. Those higher estimates were likely biased as described in the prior section. Pulini et al. (2019) also reported a negative relationship of sample size and accuracies in ADHD imaging studies and Vabalas observed the negative relationship of sample size on reported accuracies in machine learning classifiers of autism spectrum disorders (Vabalas et al., 2019). Both reviews were based on studies with sample sizes up to only ~1,000. Similar observations were also made by Wolfers et al. (2015) when reviewing neuroimaging-based diagnostics for a number of different psychiatric disorders.

Sample Heterogeneity and Data Imbalance

Although collaborative consortia, such as the ADHD-200, used relatively large samples sizes, such collaborations raise issues about sample heterogeneity and the use of imbalanced data. For example, like many other clinically referred samples, the ADHD-200 dataset had more boys than girls in the ADHD group compared with the control group. The ADHD group also had lower IQs than the control group. In addition, the demographic composition and sample acquisition methods differed across different study sites. The problem of dataset imbalance was addressed by several participating teams. Brown et al. (2012) from the University of Alberta found that models using only demographic information including age, sex, handedness, and IQ had sufficient statistical power to achieve a test accuracy 62.5%, higher than their models using fMRI features. In the work of Colby et al. (2012), a model using only demographic information had a higher accuracy (62.7%) than models using multimodal MRI features (55%). Both models using only the demographic features, although not meeting the requirements of the competition, outperformed the winning team that reported 61% accuracy using both structural and rs-fMRI data along with the demographic predictors in an ensemble model (Eloyan et al., 2012). An additional study by Sidhu et al. (2012) also reported better accuracy using demographic information than the rs-fMRI features using the ADHD-200 dataset. These observations highlight the concerns of data imbalance, and suggest that, if not dealt with carefully, the classifiers could be learning the neural correlates of the demographic features, rather than the diagnostic groups.

Some studies used methods to address the problem of unbalanced data. One approach is random undersampling, that is, removing some research participants and creating a smaller sample size that is balanced for confounding factors (Qureshi et al., 2016; Wang et al., 2013). This is in contrast to oversampling, where some random samples from the minority classes were duplicated to create a lager and balanced dataset. Others used regression to control confounding factors such as age, sex, and acquisition sites, and used adjusted MRI features (residuals) in the classification algorithms (Deshpande et al., 2015; Fair et al., 2013). Some studies mentioned data balancing, but did not provide details on how it was done (Ghiassian et al., 2016). Lim et al. (2013) used a gaussian process classifier to discriminate 29 boys with ADHD from 19 control boys. The limited samples sizes prohibited subsampling to balance the data. They noted, although the boys with ADHD had significantly lower IQ than the control boys, the model-generated probability of having ADHD was not correlated with IQ, age, and other clinical features (Lim et al., 2013). In more recent studies, more sophisticated methods such as Synthetic Minority Over-sampling Technique (SMOTE; Chawla et al., 2002) were used to generate synthetic minority samples to combat the sample imbalance problem (Riaz, Asad, Alonso, et al., 2018).

Previous studies from other fields have shown that not all the class balancing methods work equally well in reducing classifiers’ bias toward the majority class and guarantee good performance (Blagus & Lusa, 2013; He & Ma, 2013). In the ADHD studies that we reviewed, we did not find found higher held-out test accuracies in balanced studies than the unbalanced studies. Balanced studies in the K-Fold-CV group reported higher accuracies than those with unbalanced samples. However, they were all mostly studies than the unbalanced studies. It is not clear, at least for the K-Fold-CV group, if balanced designs led to higher accuracies, because sample size was a strong and negative predictor for the accuracies. Moreover, higher accuracies in the balanced LOOCV group were dependent on sample size (only significant for large studies with over 300 participants). This suggests that sample balancing may help performance in cases of large heterogenous studies. As expected, the results indicate that small sample sizes are not compensated by data balancing. More studies and larger sample sizes will be needed to find the appropriate class balancing methods and assess the potential benefit.

Classification Performance Metrics

When test sets (or cross-validation sets) are also imbalanced, the overall accuracy may not be an ideal indicator of the performance of the classifier. A high accuracy can simply result from a classifier that classifies all samples into the class that has more participants. Most studies (N = 36, 75%) addressed this concern by also reporting sensitivity (True Positive rate, TP, the percentage of correctly identified cases (ADHD)) and specificity (True Negative rate, TN, the percentage of correctly identified controls). Three studies reported balanced accuracy, which is the arithmetic mean of the sensitivity and specificity; and three studies reported Youden’s J-statistic (sensitivity + specificity -1).

Compared with percent correct, a better method to evaluate the overall performance of a classifier is the area under the Receiver Operating Characteristic curve (ROC; Fawcett, 2004). The ROC curve plots sensitivity over the full range of false positive rates (equivalent to 1- specificity). The area under the ROC (AUC) measures the overall diagnostic accuracy of a classifier. Higher AUCs indicate better discriminating power (with 1 for the perfect classifier and 0.5 for the random non-discriminative classifier). The AUC is in general less sensitive to imbalance of a dataset compared with the percent correct measure, because AUC does not have bias toward models that perform well on the majority class at the expense of the minority class (He & Ma, 2013). Davis and Goadrich (2006) suggested that the area under the Precision-recall curve (AUPRC) is superior for assessing extremely imbalanced datasets and more informative than the ROC curve. The AUPRC plots precision (the percentage of examples classified as positive that are true positive, also known as positive predictive value, PPV) over recall (sensitivity). Overall, in the body of literature that we examined, no studies reported the AUPRC, and only 13 reported the AUC.

Other popular metrics for machine learning models are F1-score and Matthew’s correlation coefficient (MCC). The F1-score is the harmonic mean of precision and recall. MCC is the correlation coefficient between the predicted and actual classes. Like the areas under the PRC or ROC, both the F1-score and MCC are better indicators of model performance than the percent correct statistic if test data classes are imbalanced. However, of the studies included in this review, only three reported F1-scores, and only two reported MCC.

Because most studies used percent correct to measures accuracy, we could only analyze percent correct in the current review. This may not represent the true model performance due to the limitations of this metric. We recommend that future studies adopt ROC or PRC analysis methods. Furthermore, inspecting the curves visually can reveal more information about how well the model discriminate classes at different decision thresholds. We don’t recommend metrics such as the F1-score, MCC, and J-statistics, because these scores only capture the diagnostic matrix at a single threshold level. Furthermore, the performance metrics are important, not only for properly interpreting test results, but also for model training. If a model was trained by maximizing a biased metric, it will not be fully optimized to generalize to other samples. Finally, metrics that are insensitive to class imbalance (such as AUC or AUPRC) do not protect against biased due to feature imbalance as discussed in the prior section.

Age and Sex

Although ADHD onsets prior to age 12, two-thirds of children continue to have symptoms and functional impairments into adulthood (Faraone et al., 2006). Longitudinal data show that some ADHD-associated brain alterations diminish during adolescence and adulthood (Castellanos et al., 2002; Shaw et al., 2006, 2014). Consistent with this, the very large ENIGMA ADHD study reported significant ADHD versus control differences for children but not for adolescents and adults (Hoogman et al., 2017, 2019). Neuroimaging classifiers studies have focused on younger populations; only three studies developed ML classifiers for adults. Our observation of a lack of difference in predictive accuracy between classifiers for children/adolescents versus adults is, therefore, inconclusive due to the small numbers of the adult studies. Few longitudinal studies have been reported for imaging in ADHD (Castellanos et al., 2002; Shaw et al., 2006, 2014). No machine learning models have been applied to longitudinal data yet. More efforts are needed to overcome the shortage of adult ADHD samples, as well as imaging data across the life span.

ADHD is more prevalent in boys than in girls (Faraone et al., 2015; Ottosen et al., 2019). As a result, although the majority of the studies included samples from the both males and females, a high percentage of ADHD samples were from males (i.e., ~80% male in ADHD-200 dataset). However, the control samples were often balanced (i.e., 52% male in ADHD-200 dataset). If sex is left unbalanced, it could result in erroneous prediction results, as we described in the above sections. Furthermore, brain alterations have been found to differ between the sexes at different ages (Almeida Montes et al., 2013; Hoogman et al., 2019; Onnink et al., 2014). The low representation of females in available samples may prevent the classifiers from learning female-specific brain alternations. Our quantitative analysis showed significantly higher accuracies in four male-only studies than other studies of sex-mixed samples. However, all four were studies with small sample sizes (<189), with three reporting LOOCV accuracies and the other reporting 10-Fold CV. Given the sample size effect and inflation by CV methods, it is inconclusive if ML models predict ADHD better in boys than girls.

MRI Modality

Although we found no significant difference in the accuracies reported for the sMRI and rs-fMRI studies, the small number of studies using sMRI data preclude any meaningful inferences regarding which MRI modality is the most informative for discriminating ADHD patients from controls. Some studies attempted to identify the most informative MRI data modality. Qureshi et al. (2017) found that sMRI features yielded the highest prediction accuracy. Colby et al. (2012) found that combined multi-modality features performed best compared with individual data modalities. However, all the MRI models performed worse than a classifier using only demographic features (Colby et al., 2012). In a later study using a three-dimensional CNN model, Zou et al. extracted higher-level features from the sMRI and rs-fMRI modalities separately. This design leveraged the relationship between the two MRI modalities, yet still was able to extract independent features that collectively were useful for classification (Zou et al., 2017). The authors also found that using multi-modal features outperformed either data modality alone (Zou et al., 2017). Despite these individual observations, the overall lack of statistically significant differences in accuracies across different modalities in our review suggests that more studies are needed before we can determine which MRI modalities or combinations thereof are most informative for diagnostic classification.

ML Classifiers

SVM was the ML model that was used most frequently, accounting for 38% of studies. SVM, however, is limited in handling images and relies on other preprocessing methods to extract a tabular representation of three-dimensional brain images. In more recent years, an increasing number of studies have used neurol networks (Chen et al. 2022; McNorgan et al. 2020; Peng et al. 2021; Stanley et al. 2022; Wang et al. 2021; Zhang-James et al. 2021), particularly CNNs, which were developed for image analysis. We did not, however, observe statistically significant differences between the accuracies of the SVM and CNN models for ADHD. This finding is limited by the small number of studies using CNN classifiers. Nevertheless, because the use of CNNs will likely increase in the future, we here describe their current contributions to the field and their potential for the future.

Riaz et al. (2017) used a CNN-based method (FCNet) to extract the functional connectivity (FC) of brain regions and then trained an SVM classifier using the extracted features to discriminate ADHD from control participants. The classifier achieved a highest held-out test accuracy of 68.6% for the ADHD-200 Peking subset. In the follow up study, the team built an end-to-end model system, DeepFMRI, which utilized multiple FCNets to extract features that were then fed into a deep neural network (Riaz, Asad, Arif, et al., 2018). DeepFMRI streamlined the feature generation and selection as well as classification in one framework, and achieved a highest test accuracy of 73.1% for the NYU subset. Using preprocessed rs-fMRI and sMRI features as independent inputs, Zou et al. (2017) used a two-branched three-dimensional CNN to learn hierarchical features from each unique modality in a joint learning task. The multi-modal joint learning CNN architecture was superior to CNNs using either data modality alone. Aradhya et al. (2019) also used a CNN classifier and extracted features using the Deep Transformation Method (DTM).

Most studies, including many CNN studies, used pre-processed MRI features, such as those anatomized to an AAL template. Mao et al. argued that rather than using hand-crafted features, one should use a CNN to directly learn discriminatory features from images. Their four-dimensional CNN classifier, designed to learn and extract spatial and temporal features from rs-fMRI images, discriminated ADHD from control participants with an accuracy of 71.3% (Mao et al., 2019). To increase their sample size and reduce overfitting, the authors augmented data by transforming rs-fMRI data into many short and fixed-lengthed video clips. Despite their promising results, they acknowledged that much work is still needed to localize the most discriminative sequences. Interestingly, a CNN using activation correlations from individual brain regions of the Default Mode Network (DMN) of the brain outperformed those using whole brain features (Ariyarathne et al., 2020). Using only one relevant brain region substantially reduced feature space and complexity. The significantly improved model performance also suggests that current sample sizes, in relation to the number of features available, maybe limiting the CNN models’ capacity. With more samples becoming available in the future, and the increased datasets of publicly available raw MRI images, CNN methods will likely to be seen in more and more studies and be explored to their full capacity for feature extraction and classification as has been the case for computer vision (Arcadu et al., 2019; Bhanumathi & Sangeetha, 2019; Iqbal et al., 2018; Lin et al., 2018; Toyonaga et al., 2017).

Building Larger Datasets

Sample size has been a major bottleneck impeding the development of more accurate and clinically useful imaging classifiers for ADHD. The largest MRI dataset, to date, has been built by the Enhancing Neuro Imaging Genetics Through Meta-Analysis (ENIGMA) consortium. Under the umbrella of the ENIGMA consortium, many independent working groups for specific diseases or phenotypes have been established, including ADHD. By implementing standardized data processing protocols and pipelines, the ENIGMA consortium made it possible to share data across many sites to perform within-disorder and cross-disorder studies (Boedhoe et al., 2020; Thompson et al., 2014, 2017, 2020). The ENIGMA ADHD Working Group has obtained over 4,100 samples of ADHD participants and controls from 37 sites thus far. In the initial ENIGMA ADHD reports, Hoogman et al. (2017, 2019) reported that, for children, ADHD was associated with significant volumetric reductions in intracranial volume, amygdala, caudate nucleus, nucleus accumbens, hippocampus, and cortical surface areas from many regions. No significant differences were found for adolescents or adults. Furthermore, the estimated effect sizes for children were small, ranging from 0.11 to 0.19. Users of the ENIGMA ADHD dataset, however, face the same problems of data heterogeneity and imbalanced demographic groups as those using the ADHD-200 dataset. Significant challenges remain when using such data to build a machine learning classifier. Furthermore, the ENIGMA ADHD data is primarily preprocessed sMRI data in tabular form. Not all sites have data on other modalities, such as rs-fMRI or DTI, available for their samples. The ENIGMA ADHD sites have not yet pooled raw MRI images, which is needed for CNN models. Nevertheless, we encourage the research community to continue to contribute ADHD samples to this cohort and provide more open access to various data modalities. Future studies that attempt to advance ML-powered ADHD diagnostic classifiers should use the ENIGMA ADHD dataset available, as this resource has been under-utilized in AI and ML studies of ADHD.

Conclusions

Our review of ML studies of MRI-based ADHD diagnostic classifiers has important implications for methods development, but these studies have not yet led to clinically useful classifiers. Our review shows that the variability of results across studies is due, in part, to differences in methodology. Future work should use the largest samples possible and should rely on a held-out test set, rather than cross-validation for estimating prediction accuracy. Future studies should not rely on percent correct as a measure of accuracy in unbalanced samples. Our analysis also highlighted the need of data from underrepresented groups, particularly females and adults. We hope that our review provides a better understanding of the efforts invested in developing ADHD imaging classifiers in the field and encourages more stringent model design and data processing for future studies. In the meanwhile, the initial results from the ENIGMA ADHD consortium should encourage more sites to participate. The lack of a very large multi-modal dataset that include sufficient data from both sex and all ages may be the biggest impediment to developing a clinically useful classifier for diagnosing ADHD.

Supplemental Material

sj-docx-1-jad-10.1177_10870547221146256 – Supplemental material for Machine Learning and MRI-based Diagnostic Models for ADHD: Are We There Yet?

Supplemental material, sj-docx-1-jad-10.1177_10870547221146256 for Machine Learning and MRI-based Diagnostic Models for ADHD: Are We There Yet? by Yanli Zhang-James, Ali Shervin Razavi, Martine Hoogman, Barbara Franke and Stephen V. Faraone in Journal of Attention Disorders

Footnotes

Declaration of Conflicting Interests

The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Drs. Zhang-James and Hoogman declare no conflict of interest. Dr. Franke has received educational speaking fees from Medice. Dr. Faraone In the past year, Dr. Faraone received income, potential income, travel expenses continuing education support and/or research support from Aardvark, Aardwolf, Tris, Otsuka, Ironshore, KemPharm/Corium, Akili, Supernus, Atentiv, Noven, Axsome and Genomind. With his institution, he has US patent US20130217707 A1 for the use of sodium-hydrogen exchange inhibitors in the treatment of ADHD. In previous years, he received support from: Alcobra, Arbor, Aveksham, CogCubed, Eli Lilly, Enzymotec, Impact, Janssen, Lundbeck/Takeda, McNeil, NeuroLifeSciences, Neurovance, Novartis, Pfizer, Shire, and Sunovion. He also receives royalties from books published by Guilford Press: Straight Talk about Your Child’s Mental Health; Oxford University Press: Schizophrenia: The Facts; and Elsevier: ADHD: Non-Pharmacologic Interventions. In addition, he is the program director of .

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Dr. Zhang-James is supported by the European Union’s Horizon 2020 research and innovation program under grant agreements No 965381 and NIH R01 MH116037. Dr. Hoogman is supported by a personal Veni grant from of the Netherlands Organization for Scientific Research (NWO, grant number 91619115). Dr. Franke is supported by a personal Vici grant (grant number 016-130-669) from the Netherlands Organization for Scientific Research (NWO). Dr. Faraone is supported by the European Union’s Horizon 2020 research and innovation program under grant agreement No 965381; NIMH grants U01AR076092-01A1, 1R21MH1264940, R01MH116037; 1R01NS128535 – 01; Oregon Health and Science University, Otsuka Pharmaceuticals, Noven Pharmaceuticals Incorporated, and Supernus Pharmaceutical Company.

ORCID iD

Stephen V. Faraone

Supplemental Material

Supplemental material for this article is available online.

Data Availability

All data and models used in the current paper are made publicly available in the manuscript and/or supplementary materials.

Author Biographies

Yanli Zhang-James, MD/PhD, is an associate professor at department of Psychiatry, SUNY Upstate Medical University. Her areas of research focuses on predictive modeling analytics in psychiatry.

Ali Shervin Razavi is an MD/PhD student at SUNY Upstate medical university with an interest in pursuing research in Machine Learning applications in medicine.

Martine Hoogman, PhD, is an assistant professor/ junior Principal Investigator (jPI) with a tenure track, at the department of Psychiatry and the department of Human Genetics, Radboud University Medical Center in Nijmegen. She leads the largest international neuroimaging collaboration on ADHD, the ENIGMA-ADHD.

Barbara Franke, PhD, is the Chair of Molecular Psychiatry at Radboud University. She is an expert in the molecular genetics of neurodevelopmental disorders like ADHD and autism. She is an elected member of the Royal Netherlands Academy of Arts and Sciences, the Royal Holland Society of Sciences and Humanities, and of Academia Europaea.

Steven V. Faraone, PhD, is a Distinguished Professor and Vice Chair for Research at the department of Psychiatry, SUNY Upstate Medical University. He is the president of the World Federation of ADHD and a leading expert in ADHD research.

References

Akaike

(1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723.

Almeida Montes

L. G.

Prado Alcántara

Martínez García

R. B.

De La Torre

L. B.

Avila Acosta

Duarte

M. G

. (2013). Brain cortical thickness in ADHD: Age, sex, and clinical correlations. Journal of Attention Disorders, 17(8), 641–654.

Aradhya

A. M. S.

Joglekar

Suresh

Pratama

(2019). Deep transformation method for discriminant analysis of multi-channel resting state fMRI. Proceedings of the AAAI Conference on Artificial Intelligence, 33(1), 2556–2563.

Arcadu

Benmansour

Maunz

Willis

Haskova

Prunotto

(2019). Deep learning algorithm predicts diabetic retinopathy progression in individual patients. npj Digital Medicine, 2(1), 92.

Ariyarathne

Silva

S. D.

Dayarathna

Meedeniya

Jayarathne

(2020). ADHD identification using convolutional neural network with seed-based approach for fMRI data [Conference session]. Proceedings of the 2020 9th International Conference on Software and Computer Applications, Langkawi, Malaysia. Association for Computing Machinery (pp. 31–35).

Bengio

Courville

Vincent

(2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828.

Bhanumathi

Sangeetha

(Eds.) (2019, March 15–16). CNN based training and classification of MRI brain images. 2019 5th International Conference on Advanced Computing & Communication Systems (ICACCS); 2019.

Blagus

Lusa

(2013). SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics, 14(1), 106.

Boedhoe

P. S. W.

van Rooij

Hoogman

Twisk

J. W. R.

Schmaal

Abe

Alonso

Ameis

S. H.

Anikin

Anticevic

Arango

Arnold

P. D.

Asherson

Assogna

Auzias

Banaschewski

Baranov

Batistuzzo

M. C.

Baumeister

. . . van Den Heuvel

O. A.

(2020). Subcortical brain volume, regional cortical thickness, and cortical surface area across disorders: Findings from the ENIGMA ADHD, ASD, and OCD Working Groups. Am J Psychiatry, 177(9), 834–843.

10.

Bohland

J. W.

Saperstein

Pereira

Rapin

Grady

(2012). Network, anatomical, and non-imaging measures for the prediction of ADHD diagnosis in individual subjects. Frontiers in Systems Neuroscience, 6, 78.

11.

Brain

Webb

G. I.

(Eds.) (1999). On the effect of data set size on bias and variance in classification learning. Proceedings of the fourth Australian Knowledge Acquisition Workshop (AKAW '99). The University of New South Wales.

12.

Brown

M. R.

Sidhu

G. S.

Greiner

Asgarian

Bastani

Silverstone

P. H.

Greenshaw

A. J.

Dursun

S. M.

(2012). ADHD-200 global competition: Diagnosing ADHD using personal characteristic data can outperform resting state fMRI measurements. Frontiers in Systems Neuroscience, 6, 69.

13.

Bruchmüller

Margraf

Schneider

(2012). Is ADHD diagnosed in accord with diagnostic criteria? Overdiagnosis and influence of client gender on diagnosis. Journal of Consulting and Clinical Psychology, 80(1), 128–138.

14.

Castellanos

F. X.

Lee

P. P.

Sharp

Jeffries

N. O.

Greenstein

D. K.

Clasen

L. S.

Blumenthal

J. D.

James

R. S.

Ebens

C. L.

Walter

J. M.

Zijdenbos

Evans

A. C.

Giedd

J. N.

Rapoport

J. L.

(2002). Developmental trajectories of brain volume abnormalities in children and adolescents with attention-deficit/hyperactivity disorder. The Journal of the American Medical Association, 288(14), 1740–1748.

15.

Chaim-Avancini

T. M.

Doshi

Zanetti

M. V.

Erus

Silva

M. A.

Duran

F. L. S.

Cavallet

Serpa

M. H.

Caetano

S. C.

Louza

M. R.

Davatzikos

Busatto

G. F.

(2017). Neurobiological support to the diagnosis of ADHD in stimulant-naïve adults: Pattern recognition analyses of MRI data. Acta Psychiatrica Scandinavica, 136(6), 623–636.

16.

Chawla

N. V.

Bowyer

K. W.

Hall

L. O.

Kegelmeyer

W. P.

(Eds.) (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.

17.

Chen

Fan

Dillman

J. R.

Wang

Altaye

Zhang

Parikh

N. A.

(2022). ConCeptCNN: A novel multi-filter convolutional neural network for the prediction of neurodevelopmental disorders using brain connectome. Medical Physics, 49, 3171–3184.

18.

Chen

Tang

Wang

Liu

Zhao

Wang

(2020). ADHD classification by dual subspace learning using resting-state functional connectivity. Artificial Intelligence in Medicine, 103, 101786.

19.

Cheng

Zhang

Feng

(2012). Individual classification of ADHD patients by integrating multiscale neuroimaging markers and advanced pattern recognition techniques. Frontiers in Systems Neuroscience, 6, 58.

20.

Colby

J. B.

Rudie

J. D.

Brown

J. A.

Douglas

P. K.

Cohen

M. S.

Shehzad

(2012). Insights into multimodal imaging classification of ADHD. Frontiers in Systems Neuroscience, 6, 59.

21.

Corkum

P. V.

Siegel

L. S.

(1993). Is the continuous performance task a valuable research tool for use with children with attention-Deficit-hyperactivity disorder? Journal of Child Psychology and Psychiatry, and Allied Disciplines, 34(7), 1217–1239.

22.

Dai

Wang

Hua

(2012). Classification of ADHD children through multimodal magnetic resonance imaging. Frontiers in Systems Neuroscience, 6, 63.

23.

Dalsgaard

Østergaard

S. D.

Leckman

J. F.

Mortensen

P. B.

Pedersen

M. G.

(2015). Mortality in children, adolescents, and adults with attention deficit hyperactivity disorder: A nationwide cohort study. Lancet, 385(9983), 2190–2196.

24.

Dane

A. V.

Schachar

R. J.

Tannock

(2000). Does actigraphy differentiate ADHD subtypes in a clinical research setting? Journal of the American Academy of Child and Adolescent Psychiatry, 39(6), 752–760.

25.

Davis

Goadrich

(2006). The relationship between Precision-Recall and ROC curves [Conference session]. Proceedings of the 23rd international conference on Machine learning, Pittsburgh, PA (pp. 233–240). Association for Computing Machinery.

26.

Demontis

Walters

R. K.

Martin

Mattheisen

Als

T. D.

Agerbo

, et al. (2019). Discovery of the first genome-wide significant risk loci for attention deficit/hyperactivity disorder. Nat Genet, 51(1), 63–75.

27.

Deshpande

Wang

Rangaprakash

Wilamowski

(2015). Fully connected Cascade Artificial Neural Network Architecture for attention deficit hyperactivity disorder classification from Functional Magnetic Resonance Imaging Data. IEEE Transactions on Cybernetics, 45(12), 2668–2679.

28.

Dey

Rao

A. R.

Shah

(2014). Attributed graph distance measure for automatic detection of attention deficit hyperactive disordered subjects. Frontiers Neural Circuits, 8, 64.

29.

Dickstein

S. G.

Bannon

Castellanos

F. X.

Milham

M. P.

(2006). The neural correlates of attention deficit hyperactivity disorder: An ALE meta-analysis. Journal of Child Psychology and Psychiatry, and Allied Disciplines, 47(10), 1051–1062.

30.

Wang

Jie

Zhang

(2016). Network-based classification of ADHD patients using discriminative subnetwork selection and graph kernel PCA. Computerized Medical Imaging and Graphics, 52, 82–88.

31.

Eloyan

Muschelli

Nebel

M. B.

Liu

Han

Zhao

Barber

A. D.

Joel

Pekar

J. J.

Mostofsky

S. H.

Caffo

(2012). Automated diagnoses of attention deficit hyperactive disorder using magnetic resonance imaging. Frontiers in Systems Neuroscience, 6, 61.

32.

Fair

D. A.

Nigg

J. T.

Iyer

Bathula

Mills

K. L.

Dosenbach

N. U. F.

Schlaggar

B. L.

Mennes

Gutman

Bangaru

Buitelaar

J. K.

Dickstein

D. P.

Di Martino

Kennedy

D. N.

Kelly

Luna

Schweitzer

J. B.

Velanova

Wang

Y. F.

. . . Milham

M. P.

(2013). Distinct neural signatures detected for ADHD subtypes after controlling for micro-movements in resting state functional connectivity MRI data. Frontiers in Systems Neuroscience, 6(80), 80.

33.

Faraone

S. V.

(2005). The scientific foundation for understanding attention-deficit/hyperactivity disorder as a valid psychiatric disorder. European Child & Adolescent Psychiatry, 14(1), 1–10.

34.

Faraone

S. V.

Asherson

Banaschewski

Biederman

Buitelaar

J. K.

Ramos-Quiroga

J. A.

Rohde

L. A.

Sonuga-Barke

E. J.

Tannock

Franke

(2015). Attention-deficit/hyperactivity disorder. Nature Reviews Disease Primers, 1, 15020.

35.

Faraone

S. V.

Biederman

(2000). Nature, nurture, and attention deficit hyperactivity disorder. Developmental Review, 20, 568–581.

36.

Faraone

S. V.

Biederman

Mick

(2006). The age-dependent decline of attention deficit hyperactivity disorder: A meta-analysis of follow-up studies. Psychological Medicine, 36(2), 159–165.

37.

Faraone

S. V.

Bonvicini

Scassellati

(2014). Biomarkers in the diagnosis of ADHD–promising directions. Current Psychiatry Reports, 16(11), 497.

38.

Faraone

S. V.

Newcorn

J. H.

Antshel

K. M.

Adler

Roots

Heller

(2016). The Groundskeeper Gaming Platform as a diagnostic tool for attention-deficit/hyperactivity disorder: Sensitivity, specificity, and relation to other measures. Journal of Child and Adolescent Psychopharmacology, 26(8), 672–685.

39.

Fawcett

(2004). ROC graphs: Notes and practical considerations for researchers. Machine Learning, 31, 1–38.

40.

Franke

Michelini

Asherson

Banaschewski

Bilbow

Buitelaar

J. K.

Cormand

Faraone

S. V.

Ginsberg

Haavik

Kuntsi

Larsson

Lesch

K. P.

Ramos-Quiroga

J. A.

Réthelyi

J. M.

Ribases

Reif

(2018). Live fast, die young? A review on the developmental trajectories of ADHD across the lifespan. European Neuropsychopharmacology, 28(10), 1059–1088.

41.

Ghiassian

Greiner

Jin

Brown

M. R.

(2016). Using functional or structural magnetic resonance images and personal characteristic data to identify ADHD and autism. PLoS One, 11(12), e0166934.

42.

Ginsberg

Quintero

Anand

Casillas

Upadhyaya

H. P.

(2014). Underdiagnosis of attention-deficit/hyperactivity disorder in adult patients: A review of the literature. The Primary Care Companion for CNS Disorders, 16(3), eng1015475322155.

43.

Hamshere

M. L.

Langley

Martin

Agha

S. S.

Stergiakouli

Anney

R. J. L.

Buitelaar

Faraone

S. V.

Lesch

K. P.

Neale

B. M.

Franke

Sonuga-Barke

Asherson

Merwood

Kuntsi

Medland

S. E.

Ripke

Steinhausen

H. C.

Freitag

. . . Thapar

(2013). High loading of polygenic risk for ADHD in children with comorbid aggression. American Journal of Psychiatry, 170(8), 909–916.

44.

Hao

Yin

(2015). Discrimination of ADHD children based on Deep Bayesian Network [Conference session]. IET International Conference on Biomedical Image and Signal Processing (ICBISP 2015).

45.

Hart

Chantiluke

Cubillo

A. I.

Smith

A. B.

Simmons

Brammer

M. J.

Marquand

A. F.

Rubia

(2014). Pattern classification of response inhibition in ADHD: Toward the development of neurobiological markers for ADHD. Human Brain Mapping, 35(7), 3083–3094.

46.

(2013). Imbalanced Learning: Foundations, Algorithms, and Applications. Wiley-IEEE Press.

47.

Homack

Riccio

C. A.

(2006). Conners’ Continuous Performance Test (2nd ed.; CCPT-II). Journal of Attention Disorders, 9(3), 556–558.

48.

Hoogman

Bralten

Hibar

D. P.

Mennes

Zwiers

M. P.

Schweren

L. S. J.

(2017). Subcortical brain volume differences in participants with attention deficit hyperactivity disorder in children and adults: A cross-sectional mega-analysis. Lancet Psychiatry, 4(4), 310–319.

49.

Hoogman

Muetzel

Guimaraes

J. P.

Shumskaya

Mennes

Zwiers

M. P.

, et al. (2019). Brain imaging of the cortex in ADHD: A coordinated analysis of large-scale clinical and population-based samples. Am J Psychiatry, 176(7), 531–542.

50.

Iannaccone

Hauser

T. U.

Ball

Brandeis

Walitza

Brem

(2015). Classifying adolescent attention-deficit/hyperactivity disorder (ADHD) based on functional and structural imaging. European Child & Adolescent Psychiatry, 24(10), 1279–1289.

51.

Igual

Soliva

J. C.

Escalera

Gimeno

Vilarroya

Radeva

(2012). Automatic brain caudate nuclei segmentation and classification in diagnostic of Attention-Deficit/Hyperactivity Disorder. Computerized Medical Imaging and Graphics, 36(8), 591–600.

52.

Iqbal

Ghani

M. U.

Saba

Rehman

(2018). Brain tumor segmentation in multi-spectral MRI using convolutional neural networks (CNN). Microscopy Research and Technique, 81(4), 419–427.

53.

Jie

Wee

C. Y.

Shen

Zhang

(2016). Hyper-connectivity of functional networks for brain disease diagnosis. Medical Image Analysis, 32, 84–100.

54.

Johnston

B. A.

Mwangi

Matthews

Coghill

Konrad

Steele

J. D.

(2014). Brainstem abnormalities in attention deficit hyperactivity disorder support high accuracy individual diagnostic classification. Human Brain Mapping, 35(10), 5179–5189.

55.

Joseph

Zhang-James

Perl

Faraone

S. V.

(2015). Oxidative stress and ADHD: A meta-analysis. Journal of Attention Disorders, 19(11), 915–924.

56.

Kriegeskorte

Simmons

W. K.

Bellgowan

P. S.

Baker

C. I.

(2009). Circular analysis in systems neuroscience: The dangers of double dipping. Nature Neuroscience, 12(5), 535–540.

57.

Kuang

Guo

Zhao

(2014). Discrimination of ADHD based on fMRI data with deep belief network [Conference session]. Intelligent Computing in Bioinformatics (ICIC 2014).

58.

Lambert

N. M.

Hartsough

C. S.

(1998). Prospective study of tobacco smoking and substance dependencies among samples of ADHD and non-ADHD participants. Journal of Learning Disabilities, 31(6), 533–544.

59.

Lanka

Rangaprakash

Dretsch

M. N.

Katz

J. S.

Denney

T. S.

Jr. Deshpande

(2020). Supervised machine learning for diagnostic classification from large-scale neuroimaging datasets. Brain Imaging and Behavior, 14(6), 2378–2416.

60.

Levene

(1960). Robust tests for equality of variances. In Olkin Sgg

Hoeffding

Madow

W. G.

Mann

H. B.

(Eds.), In contributions to probability and statistics: Essays in honor of Harold Hotelling (pp. 278–292). Stanford University Press.

61.

Lichtenstein

Halldner

Zetterqvist

Sjölander

Serlachius

Fazel

Långström

Larsson

(2012). Medication for attention deficit-hyperactivity disorder and criminality. New England Journal of Medicine, 367(21), 2006–2014.

62.

Lim

Marquand

Cubillo

A. A.

Smith

A. B.

Chantiluke

Simmons

Mehta

Rubia

(2013). Disorder-specific predictive classification of adolescents with attention deficit hyperactivity disorder (ADHD) relative to autism using structural magnetic resonance imaging. PLoS One, 8(5), e63660.

63.

Lin

Tong

Gao

Guo

Yang

Guo

Xiao

(2018). Convolutional neural networks-based MRI image analysis for the Alzheimer's disease prediction from mild cognitive impairment. Frontiers in Neuroscience, 12, 777.

64.

Mackie

Shaw

Lenroot

Pierson

Greenstein

D. K.

Nugent

T. F.

3rd Sharp

W. S.

Giedd

J. N.

Rapoport

J. L.

(2007). Cerebellar development and clinical outcome in attention deficit hyperactivity disorder. American Journal of Psychiatry, 164(4), 647–655.

65.

Mao

Wang

Huang

Yue

Sun

Xiong

(2019). Spatio-temporal deep learning method for ADHD fMRI classification. Information Sciences, 499, 1–11.

66.

McNorgan

Judson

Handzlik

Holden

J. G.

(2020). Linking ADHD and behavioral assessment through identification of shared diagnostic task-based functional connections. Frontiers in Physiology, 11, 583005.

67.

Olivetti

Greiner

Avesani

(2012). ADHD diagnosis from multiple data sources with batch effects. Frontiers in Systems Neuroscience, 6, 70.

68.

Onnink

A. M.

Zwiers

M. P.

Hoogman

Mostert

J. C.

Kan

C. C.

Buitelaar

Franke

(2014). Brain alterations in adult ADHD: Effects of gender, treatment and comorbid depression. European Neuropsychopharmacology, 24(3), 397–409.

69.

Ottosen

Larsen

J. T.

Faraone

S. V.

Chen

Hartman

Larsson

Petersen

Dalsgaard

(2019). Sex differences in comorbidity patterns of attention-deficit/hyperactivity disorder. Journal of the American Academy of Child and Adolescent Psychiatry, 58(4), 412–422.e3.

70.

Peng

Debnath

Biswas

A. K.

(2021). Efficacy of novel summation-based synergetic artificial neural network in ADHD diagnosis. Machine Learning with Applications, 6, 100120.

71.

Peng

Lin

Zhang

Wang

(2013). Extreme learning machine-based classification of ADHD using brain structural MRI data. PLoS One, 8(11), e79476.

72.

Pulini

A. A.

Kerr

W. T.

Loo

S. K.

Lenartowicz

(2019). Classification accuracy of neuroimaging biomarkers in attention-deficit/hyperactivity disorder: Effects of sample size and circular analysis. Biological Psychiatry Cognitive Neuroscience and Neuroimaging, 4(2), 108–120.

73.

Qureshi

M. N.

Min

H. J.

Lee

(2016). Multiclass classification for the differential diagnosis on the ADHD subtypes using recursive feature elimination and hierarchical extreme learning machine: Structural MRI study. PLoS One, 11(8), e0160697.

74.

Qureshi

M. N. I.

Min

H. J.

Lee

(2017). Multi-modal, multi-measure, and multi-class discrimination of ADHD with hierarchical feature extraction and extreme learning machine using structural and functional brain MRI. Frontiers in Human Neuroscience, 11, 292.

75.

Reiersen

A. M.

Todorov

A. A.

(2013). Exploration of ADHD subtype definitions and Co-occurring psychopathology in a Missouri population-based large sibship sample. Scandinavian Journal of Child and Adolescent Psychiatry and Psychology, 1(1), 3–13.

76.

Riaz

Asad

Alonso

Slabaugh

(2018). Fusion of fMRI and non-imaging data for ADHD classification. Computerized Medical Imaging and Graphics, 65, 115–128.

77.

Riaz

Asad

Arif

S. M. M. R. A.

Alonso

Dima

Corr

Slabaugh

(Eds.) (2017). FCNet: A convolutional neural network for calculating functional connectivity from functional MRI. Connectomics in NeuroImaging CNI 2017 Lecture Notes in Computer Science. Springer.

78.

Riaz

Asad

Arif

S. M. M. R. A.

Alonso

Dima

Corr

Slabaugh

(2018). Deep fMRI: AN end-to-end deep network for classification of fMRI data [Symposium]. 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), Washington, DC (pp. 1419–1422).

79.

Riccio

C. A.

Reynolds

C. R.

(2001). Continuous performance tests are sensitive to ADHD in adults but lack specificity. A review and critique for differential diagnosis. Annals of the New York Academy of Sciences, 931, 113–139.

80.

Riglin

Collishaw

Thapar

A. K.

Dalsgaard

Langley

Smith

G. D.

Stergiakouli

Maughan

O’Donovan

M. C.

Thapar

(2016). Association of genetic risk variants with attention-deficit/hyperactivity disorder trajectories in the general population. JAMA Psychiatry, 73(12), 1285–1292.

81.

Ritsner

M. S.

(Ed.) (2009). Neuropsychological endophenotypes and biomarkers. Springer Netherlands.

82.

Sato

J. R.

Hoexter

M. Q.

Fujita

Rohde

L. A.

(2012). Evaluation of pattern recognition and feature extraction methods in ADHD prediction. Frontiers in Systems Neuroscience, 6, 68.

83.

Scassellati

Bonvicini

Faraone

S. V.

Gennarelli

(2012). Biomarkers and attention-deficit/hyperactivity disorder: A systematic review and meta-analyses. Journal of the American Academy of Child and Adolescent Psychiatry, 51(10), 1003–1019.e20.

84.

Schwarz

(1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461–464.

85.

Seidman

L. J.

Valera

E. M.

Makris

(2005). Structural brain imaging of attention-deficit/hyperactivity disorder. Biological Psychiatry, 57(11), 1263–1272.

86.

Seidman

L. J.

Valera

E. M.

Makris

Monuteaux

M. C.

Boriel

D. L.

Kelkar

Kennedy

D. N.

Caviness

V. S.

Bush

Aleardi

Faraone

S. V.

Biederman

(2006). Dorsolateral prefrontal and anterior cingulate cortex volumetric abnormalities in adults with attention-deficit/hyperactivity disorder identified by magnetic resonance imaging. Biological Psychiatry, 60(10), 1071–1080.

87.

Semrud-Clikeman

Hooper

S. R.

Hynd

G. W.

Hern

Presley

Watson

(1996). Prediction of group membership in developmental dyslexia, attention deficit hyperactivity disorder, and normal controls using brain morphometric analysis of magnetic resonance imaging. Archives of Clinical Neuropsychology, 11(6), 521–528.

88.

Sen

Borle

N. C.

Greiner

Brown

M. R. G.

(2018) A general prediction model for the detection of ADHD and Autism using structural and functional MRI. PLoS One, 13(4), e0194856.

89.

Shao

(2018). Classification of ADHD with bi-objective optimization. Journal of Biomedical Informatics, 84, 164–170.

90.

Shaw

De Rossi

Watson

Wharton

Greenstein

Raznahan

Sharp

Lerch

J. P.

Chakravarty

M. M.

(2014). Mapping the development of the basal ganglia in children with attention-deficit/hyperactivity disorder. Journal of the American Academy of Child and Adolescent Psychiatry, 53(7), 780–789.e11.

91.

Shaw

Lerch

Greenstein

Sharp

Clasen

Evans

Giedd

Castellanos

F. X.

Rapoport

(2006). Longitudinal mapping of cortical thickness and clinical outcome in children and adolescents with attention-deficit/hyperactivity disorder. Archives of General Psychiatry, 63(5), 540–549.

92.

Sidhu

G. S.

Asgarian

Greiner

Brown

M. R.

(2012). Kernel principal component analysis for dimensionality reduction in fMRI-based diagnosis of ADHD. Frontiers in Systems Neuroscience, 6, 74.

93.

Smith

A. B.

Taylor

Brammer

Toone

Rubia

(2006). Task-specific hypoactivation in prefrontal and temporoparietal brain regions during motor inhibition and task switching in medication-naive children and adolescents with attention deficit hyperactivity disorder. American Journal of Psychiatry, 163(6), 1044–1051.

94.

Snyder

S. M.

Rugino

T. A.

Hornig

Stein

M. A.

(2015). Integration of an EEG biomarker with a clinician’s ADHD evaluation. Brain and Behavior, 5, e00330.

95.

Solé Puig

Pérez Zapata

Puigcerver

Esperalba Iglesias

Sanchez Garcia

Romeo

Cañete Crespillo

Supèr

. (2015). Attention-related eye vergence measured in children with Attention Deficit Hyperactivity Disorder. PLoS One, 10(12), e0145281.

96.

Stanley

E. A.

Rajashekar

Mouches

Wilms

Plettl

Forkert

(2022). A fully convolutional neural network for explainable classification of attention deficit hyperactivity disorder. SPIE.

97.

StataCorp. (2019). Stata Statistical Software: Release 16. StataCorp LP.

98.

Tan

Guo

Ren

Epstein

J. N.

L. J.

(2017). A computational model for the automatic diagnosis of attention deficit hyperactivity disorder based on functional brain volume. Frontiers in Computational Neuroscience, 11, 75.

99.

Tang

Wang

Chen

Sun

Jiang

Wang

(2019). identifying adhd individuals from resting-state functional connectivity using subspace clustering and binary hypothesis testing. Journal of Attention Disorders, 25(5), 736–748.

100.

The ADHD-200 Consortium. (2012). The ADHD-200 Consortium: A Model to advance the translational potential of neuroimaging in Clinical Neuroscience. Frontiers in Systems Neuroscience, 6, 62.

101.

The Express Scripts Lab. (2014). Turning attention to ADHD: U.S. medication trends for attention deficit hyperactivity disorder. Web.

102.

Thome

Ehlis

A. C.

Fallgatter

A. J.

Krauel

Lange

K. W.

Riederer

Romanos

Taurines

Tucha

Uzbekov

Gerlach

(2012). Biomarkers for attention-deficit/hyperactivity disorder (ADHD). A consensus report of the WFSBP task force on biological markers and the World Federation of ADHD. The World Journal of Biological Psychiatry, 13(5), 379–400.

103.

Thompson

P. M.

Andreassen

O. A.

Arias-Vasquez

Bearden

C. E.

Boedhoe

P. S.

Brouwer

R. M.

Buckner

R. L.

Buitelaar

J. K.

Bulayeva

K. B.

Cannon

D. M.

Cohen

R. A.

Conrod

P. J.

Dale

A. M.

Deary

I. J.

Dennis

E. L.

de Reus

M. A.

Desrivieres

Dima

Donohoe

. . . Ye

. (2017). ENIGMA and the individual: Predicting factors that affect the brain in 35 countries worldwide. NeuroImage, 145, 389–408.

104.

Thompson

P. M.

Jahanshad

Ching

C. R. K.

Salminen

L. E.

Thomopoulos

S. I.

Bright

Baune

B. T.

Bertolín

Bralten

Bruin

W. B.

Bülow

Chen

Chye

Dannlowski

de Kovel

C. G. F.

Donohoe

Eyler

L. T.

Faraone

S. V.

Favre

. . . Grabe

H. J.

(2020). ENIGMA and global neuroscience: A decade of large-scale studies of the brain in health and disease across more than 40 countries. Translational psychiatry, 10(1), 100.

105.

Thompson

P. M.

Stein

J. L.

Medland

S. E.

Hibar

D. P.

Vasquez

A. A.

Renteria

M. E.

Toro

Jahanshad

Schumann

Franke

Wright

M. J.

Martin

N. G.

Agartz

Alda

Alhusaini

Almasy

Almeida

Alpert

Andreasen

N. C.

. . . Aribisala

(2014). The ENIGMA Consortium: Large-scale collaborative analyses of neuroimaging and genetic data. Brain Imaging and Behavior, 8(2), 153–182.

106.

Tian

Jiang

Wang

Zang

Liang

Sui

Cao

Peng

Zhuo

(2006). Altered resting-state functional connectivity patterns of anterior cingulate cortex in adolescents with attention deficit hyperactivity disorder. Neuroscience Letters, 400(1-2), 39–43.

107.

Toyonaga

Shiga

Hirata

Yamaguchi

Takeuchi

Kudo

, et al. (2017). Convolutional neural network (CNN) of MRI and FDG-PET images may predict hypoxia in glioblastoma. Journal of Nuclear Medicine, 58(supplement 1), 699.

108.

Vabalas

Gowen

Poliakoff

Casson

A. J.

(2019). Machine learning algorithm validation with a limited sample size. PLoS One, 14(11), e0224365.

109.

Valera

E. M.

Faraone

S. V.

Murray

K. E.

Seidman

L. J.

(2007). Meta-analysis of structural imaging findings in attention-deficit/hyperactivity disorder. Biological Psychiatry, 61(12), 1361–1369.

110.

Visser

S. N.

Danielson

M. L.

Bitsko

R. H.

Holbrook

J. R.

Kogan

M. D.

Ghandour

R. M.

Perou

Blumberg

S. J.

(2014). Trends in the parent-report of health care provider-diagnosed and medicated attention-deficit/hyperactivity disorder: United States, 2003-2011. Journal of the American Academy of Child and Adolescent Psychiatry, 53(1), 34–46.e2.

111.

Wang

Jiao

Tang

Wang

(2013). Altered regional homogeneity patterns in adults with attention-deficit hyperactivity disorder. European Journal of Radiology, 82(9), 1552–1557.

112.

Wang

X. H.

Jiao

(2018). Diagnostic model for attention-deficit hyperactivity disorder based on interregional morphological connectivity. Neuroscience Letters, 685, 30–34.

113.

Wang

Zhu

Shi

Zhang

Yan

(2021). A 3D multiscale view convolutional neural network with attention for mental disease diagnosis on MRI images. Mathematical Biosciences and Engineering, 18, 6978–6994.

114.

Wolfers

Buitelaar

J. K.

Beckmann

C. F.

Franke

Marquand

A. F.

(2015). From estimating activation locality to predicting disorder: A review of pattern recognition for neuroimaging-based psychiatric diagnostics. Neuroscience and Biobehavioral Reviews, 57, 328–349.

115.

Xiao

Bledsoe

Wang

Chaovalitwongse

W. A.

Mehta

Semrud-Clikeman

Grabowski

(2016). An integrated feature ranking and selection framework for ADHD characterization. Brain Informatics, 3(3), 145–155.

116.

Yao

Guo

Zhao

Liu

Cao

Wang

Y., D

Calhoun

Sun

Sui

(2018). Discriminating ADHD from healthy controls using a novel feature selection method based on relative importance and ensemble learning. Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 2018, 4632–4635.

117.

Yoo

J. H.

Kim

J. I.

Kim

B. N.

Jeong

(2020). Exploring characteristic features of attention-deficit/hyperactivity disorder: findings from multi-modal MRI and candidate genetic data. Brain Imaging and Behavior, 14(6), 2132–2147.

118.

Zhang-James

Chen

Kuja-Halkola

Lichtenstein

Larsson

Faraone

S. V.

(2020). Machine-Learning prediction of comorbid substance use disorders in ADHD youth using Swedish registry data. Journal of Child Psychology and Psychiatry, 61, 1370–1379.

119.

Zhang-James

Helminen

E. C.

Liu

, Enigma-Adhd Working Group Franke

Hoogman

Faraone

S. V.

(2021). Evidence for similar structural brain anomalies in youth and adult attention-deficit/hyperactivity disorder: A machine learning analysis. Translational Psychiatry, 11, 82.

120.

Zhu

C. Z.

Zang

Y. F.

Cao

Q. J.

Yan

C. G.

Jiang

T. Z.

Sui

M. Q.

Wang

Y. F.

(2008). Fisher discriminative analysis of resting-state brain function for attention-deficit/hyperactivity disorder. NeuroImage, 40(1), 110–120.

121.

Zhu

C. Z.

Zang

Y. F.

Liang

Tian

L. X.

X. B.

Sui

M. Q.

Wang

Y. F.

Jiang

T. Z.

(2005). Discriminative analysis of brain function at resting-state for attention-deficit/hyperactivity disorder. Medical iMage Computing and Computer-Assisted Intervention, 8(Pt 2), 468–475.

122.

Zou

Zheng

Miao

Mckeown

M. J.

Wang

Z. J.

(2017). 3D CNN based automatic diagnosis of attention deficit hyperactivity disorder using functional and structural MRI. IEEE Access, 5, 23626–23636.

123.

Gao

Munsell

Kim

Peng

Cohen

J. R.

Zhang

(2019). Identifying disease-related subnetwork connectome biomarkers by sparse hypergraph learning. Brain Imaging and Behavior, 13(4), 879–892.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.42 MB