Event-Related Potentials for Diagnosing Children and Adults With ADHD

Abstract

Objective: The diagnosis of ADHD is based on behavioral criteria, which allow for subjective variability and invite criticism regarding the reality of the disorder. In this situation, more objective criteria would be desirable. We review the scientific literature for diagnostic tests based on event-related potentials (ERPs). Method: Seven studies met the inclusion criteria of reporting the sensitivity and specificity of an ERP-based classifier discriminating participants with ADHD from healthy controls. Study quality was rated using the second version of the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) system. Results: Overall, study quality was acceptable. The largest biases were lack of representativeness and overfitting. Sensitivities and specificities ranged from 57% to 96%, and 63% to 92%, respectively. However, no two studies used the same diagnostic test. Conclusion: There is a serious lack of coordination in worldwide efforts to find more objective ERP-based criteria for the diagnosis of ADHD. Concerted action is needed.

Keywords

ADHD ERP diagnostic test review QUADAS-2

Introduction

ADHD affects about 4% of children and adults worldwide (Fayyad et al., 2007; Polanczyk & Rohde, 2007). It is diagnosed on the basis of psychological and behavioral criteria, as specified in the Diagnostic and Statistical Manual of Mental Disorders (DSM) and the International Classification of Diseases (ICD) and implemented in several rating scales (Snyder, Hall, Cornwell, & Quintana, 2006; Taylor, Deb, & Unwin, 2011). At present, there are no validated, reliable biomarkers for ADHD (Thome et al., 2012).

The fact that ADHD is still a controversial diagnosis in parts of the public and the medical establishment has made the search for such markers more pressing, as it may provide uncontentious evidence for a physical basis of the disorder. The field of electroencephalography (EEG) and event-related potentials (ERPs) seem particularly promising, as brain wave abnormalities have been widely observed in children with ADHD (Barry, Clarke, & Johnstone, 2003; Barry, Johnstone, & Clarke, 2003) and can be measured cheaply, noninvasively, and on-site.

This study reviews the literature on ERP-based markers for ADHD with the specific aim of assessing their usefulness as diagnostic tests.

Method

We searched MEDLINE, EMBASE, Web of Science Core Collection, BIOSIS, and Current Contents Connect (CCC) using the OVID SP web interface and the following search string:

(adhs or adhd or “attention deficit hyperactivity disorder” or “attention deficit with hyperactivity disorder”) AND (“evoked potential” or “evoked potentials” or “erp” or “event-related potential” or “event-related potentials” or p300 or p3 or p3a or p3b or “contingent negative variation” or “error-related negativity” or “early left anterior negativity” or “late positive component” or p600 or “syntactic positive shift” or “early posterior negativity” or “early directing-attention negativity” or “anterior directing-attention negativity” or “mismatch negativity” or “bereitschaftspotential” or “readiness potential”) AND (specificity or “receiver operating characteristic” or roc or classif* or discriminat*).

The abstracts of the search results were scanned for studies meeting, or appearing to meet, the following inclusion criteria:

Participants diagnosed with ADHD using DSM or ICD criteria;

Healthy control group;

ERP-based measure(s) used as index test to discriminate ADHD from controls; studies using the index test to discriminate between more than two groups (e.g., dyslexia, ADHD, and controls) were excluded;

Sensitivity and specificity of the index test and corresponding number of participants must be reported or derivable from the report.

The full texts of eligible abstracts were obtained. Studies not meeting the criteria were then excluded. In addition, the reference lists of all full texts were checked for additional studies meeting the requirements. This process was conducted independently by the two study authors. Disagreements were resolved by consensus.

The first author (A.G.) assessed the methodological quality of the studies using criteria from the second version of the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool (Whiting, 2011). Seven items were rated, with the three possible response options “yes,” “no,” and “unclear.” Strict application criteria for these options were defined. All items were phrased in a way that a positive answer (“yes”) always signaled the absence of bias. Because for the present set of studies, not all types of bias are equally serious, we added an importance weighting to the QUADAS-2 items (see next section). Table 1 lists quality items, item weightings, response options, and application criteria.

Table 1.

Items Used to Assess Study Quality.

Item	Signaling question	Application criteria	Item weighting*
Spectrum bias	Was the spectrum of patients representative of the patients who will receive the test in practice?	Yes: Sample representative No: Sample not representative Unclear: Not enough information to determine representativeness	2
Verification bias	Is the reference standard likely to classify the target condition correctly?	Yes: All reference standards are valid diagnostics for ADHD No: At least one reference standard is unlikely to correctly diagnose ADHD Unclear: Unclear reference standard	2
Recovery bias	Is the time period between reference standard and index test short enough to be reasonably sure that the target condition did not change between the two tests?	Yes: Interval shorter than 6 months No: Interval longer than 6 months Unclear: Interval not specified	1
Partial verification bias	Did the whole sample, or a random selection of the sample, receive verification using the intended reference standard?	Yes: All participants get some reference standard No: Not all participants get a reference standard Unclear: Not enough information	2
Differential verification bias	Did patients receive the same reference standard irrespective of the index test result?	Yes: Same reference standard in all participants No: Not the same reference standard in all participants Unclear: Not enough information	1
Exclusions	Were withdrawals/exclusions from the study explained?	Yes: It is clear what happened to all participants entering the study (i.e., possible dropouts are reported) No: Dropouts or exclusions happened but are not accounted for Unclear: Unclear how many participants entered and whether there were dropouts	1
Overfitting	Did the study involve an independent test sample to avoid overfitting?	Yes: An independent sample was used to test the classifier No: No independent sample was used Unclear: Use of an independent test sample cannot be determined	2

2 = of primary importance, 1 = of secondary importance.

Item Weightings

Spectrum bias, verification bias, partial verification bias, and overfitting were rated the most important. Spectrum bias means that a test might not be applicable to a particular sample. Verification bias means the test may not apply well to the target disorder. Partial verification means that the disorder has not been assessed in all participants, making a test unreliable. Overfitting means the test’s performance may be much worse than implied by the nominal estimates of classification accuracy. These four kinds of bias, alone or in combination, may more or less invalidate a test.

Recovery bias, differential verification bias, and exclusions of participants after recruitment were rated as of secondary importance. Recovery bias is less important because ADHD is a chronic condition whose nature will change only slowly, if at all. This means even intervals of several months between reference and index test may not bias index test accuracy. Differential verification, in the present set of studies, only applied to controls not receiving the same reference standards as ADHD patients. Assuming that confirming the absence of ADHD symptoms in controls should not be too error-prone even if done with less rigorous instruments as those used to evaluate ADHD patients, the effect of this bias should be relatively small. Exclusions of recruited patients after the fact may render samples less representative of the target population, but given that spectrum bias already looms large, the added effect of exclusions should be small and could even go in the direction of making the stayers more representative.

Results

The search yielded a total of 311 papers, whose abstracts were scanned by the authors. Forty-nine papers were selected for inspection of full texts, of which seven met all inclusion criteria and were selected for final review. These represented 284 ADHD patients and 219 controls. Figure 1 shows the details of the search strategy.

Figure 1.

Search strategy and results.

Five out of seven studies were published after 2000, and all used DSM-III/DSM-III-R/DSM-IV (American Psychiatric Association, 1980, 1987, 1994) criteria as the reference standard. Patient groups consisted of children in three studies, children and adolescents in two studies, and adults in two studies. ADHD and control participants were well matched for age and sex, but overall, there was a 2.5:1 preponderance of male participants. All studies had participants do a task while recording ERPs. Importantly, however, all studies differed in their index tests, that is, the experimental tests used to classify participants into the ADHD or control group. All studies used some kind of machine learning algorithm, most of them based on peak amplitudes and latencies of various ERP components. The details of these methods, however, differed, and sometimes considerably. A meta-analysis of the results was therefore not conducted. For the same reason, we did not assess study heterogeneity. An overview of the studies, including year of publication, country, patient and control groups, task condition used for classification, classification method (index test), and reference standard is shown in Table 2.

Table 2.

Characteristics of Included Studies.

Study	Year	Country	Participants	Control group	Test condition^a	Index test	Reference standard
Castro-Cabrera, Gomez-Garcia, Restrepo, Moscoso, and Castellanos-Dominguez (2010)	2009	Colombia	23 children 4-15 years ?	23 children 4-15 years ?	Visual and auditory oddball tasks	SVM classification of independent components of wavelet, morphological, and nonlinear dynamic ERP features	DSM-IV criteria
Dickhaus and Heinrich (1996)	1996	?	25 children ? 25 M	25 children ? 25 M	Selective attention to high (vs. low) tones in right (vs. left) ear	Wavelet/Neural Net classification of time-frequency domain and scalp topography of ERPs	DSM-III-R criteria
Heinrich, Hoegl, Moll, and Kratz (2014)	2014	Germany	19 children + adolescents 9-14 years 16 M, 3 F	21 children + adolescents 9-14 years 16 M, 5 F	Visual Go/NoGo task (Go and NoGo responses) during TMS	LDA of P350 peak latency and prestimulus TMS-elicited SICI	DSM-IV combined subtype assessed by German FBB-HKS, and scores ≥1 in “inattention” and “hyp/imp” subscales.
Müller et al. (2011)	2011	Switzerland	Training sample: 75 adults 36 ± 8 years 38 M, 37 F Test sample: 17 adults 31 ± 9 years 6 M, 11 F	75 adults 36 ± 9 years 38 M, 37 F	Visual Go/NoGo task (Go and NoGo responses)	SVM classification of independent ERP components of latency, peak, and average amplitude	DSM-IV criteria using Barkley’s Current Symptoms Scale (self- and parents/partner-administered), Childhood Symptoms Scale, structured clinical interview by trained psychologists
Robaey, Breton, Dugas, and Renault (1992)	1992	France	12 children 6-8 years 12 M	12 children 6-8 years 12 M	Pictorial and semantic classification tasks, geometric and numerical seriation tasks (correct responses to targets and nontargets)	LDA of mean amplitude and latency of ERP components N150-P250, N250-P350, P500, separately for left and right montage	DSM-III-R criteria using CPRS and CTRS
Shahaf et al. (2012)	2012	Israel	13 adults 25 ± 4 years ?M, ?F	13 adults 25 ± 4 years ?M, ?F	Correct NoGo trials	Clustering based on similarity of frequency–amplitude–electrode patterns	DSM-IV combined subtype assessed by translated version of Conners’ Adult ADHD Rating Scales
Smith, Johnstone, and Barry (2003)	2003	Australia	100 children + adolescents 8-18 years 80 M, 20 F	50 children + adolescents 8-18 years 40 M, 10 F	Auditory oddball task	Discriminant function analysis of ERP component amplitudes and latencies	DSM-IV criteria using CPRS-48, CTRS-48, CBCL, and developmental interview of parent(s) by experienced psychologist

Note. ? = no information available; CBCL = Child Behavior Checklist; CPRS = Conners’ Parent Rating Scale; CTRS = Conners’ Teacher Rating Scale; DSM-III-R = Diagnostic and Statistical Manual of Mental Disorders (3rd ed., rev.); DSM-IV = Diagnostic and Statistical Manual of Mental Disorders (4th ed.); ERP = event-related potential; FBB-HKS = “Fremdbeurteilungsbogen für Hyperkinetische Störungen”; F = female; hyp = hyperactivity; imp = impulsivity; LDA = linear discriminant analysis; M = male; SICI = short-interval intracortical inhibition; SVM = support vector machine; TMS = transcranial magnetic stimulation.

Condition on which classification is based.

Quality Assessment (QUADAS-2)

Overall, the quality of the studies was acceptable (Figure 2). The biggest problems, faced by the majority of studies, were spectrum bias and overfitting. Spectrum bias means that the study sample is not representative of the target population, which in this case was defined as any person who might show up in primary care because of suspected or diagnosed ADHD. Most studies more or less severely restricted the range of eligible participants. Age range was limited in all studies, gender was restricted to boys in two studies, medication to methylphenidate in four studies, a drug washout period was implemented in three studies, comorbid disorders were excluded or limited in four studies, and the presentation of ADHD was limited to the combined subtype in two studies. One study included only right-handers. Recruitment populations were also circumscribed: Patients often came from outpatient clinics, whereas controls were frequently recruited from local schools. As a collective, these studies sample a more varied population than any single study by itself, but they still fall short of being fully representative of the target population.

Figure 2.

Quality ratings of included studies.

Overfitting refers to the situation when a statistical model (here: a diagnostic test) fits the data very well, and therefore predicts class membership very well, but completely fails to generalize to new data. This happens when a method does not only model the regularities in the sample that it shares with other samples from the same population but also the idiosyncrasies of the sample that are unique to it and therefore not generalizable. In other words, it occurs when a model does not just fit the “signal” (a systematic pattern in the population) but also the noise (chance deviations in a single sample). Overfitting is a problem that has become particularly prominent with the advent of machine learning algorithms that allow searching large numbers of variables for combinations that optimally predict class membership (James, Witten, Hastie, & Tibshirani, 2013). The availability of great numbers of predictors will virtually guarantee a solution that fits the data at hand well, but usually such models do not generalize. This also means that the test accuracy parameters from such models will be exaggerated.

Our criteria for the absence of overfitting bias are strict, but, we believe, appropriate. Only when a diagnostic test procedure has not only been evaluated in the study sample but also in an independent sample that is not part of the same population as the original sample, do we consider that the best has been done to avoid overfitting. While four of seven studies used cross-validation to estimate classification error, only Müller et al. (2011) went on to test their classifier in an independent sample, a group of 17 participants recruited and tested in a different country. However, this group was very small and raises another problem, that is, whether it has sufficient power to adequately test the classifier.

The only other bias found more than once was differential verification, where ADHD patients and controls do not receive the same reference standard. This could be an issue when the different reference tests lead to conflicting results and, hence, misclassification of some participants. The resulting overlap in class membership would decrease the discriminatory power of the diagnostic test. We have not rated this item as very important, however, as the studies use adequate reference tests for ADHD participants and we assume that correctly identifying participants without ADHD will generally be achievable even if a method other than the reference standard is used.

However, a general problem, not just with this item, was poor reporting. One fourth of reportable items (12 out of 49) could not be assessed because of insufficient information in the text. Fortunately, the most underreported items also tend to be the least important ones: Apart from differential verification, recovery bias was the item with the most missing ratings (only one study provided enough information). We weighted this item down because with a chronic condition like ADHD, significant changes in its presentation are expected to happen very slowly, if at all, so even delays of several months between reference and index test should not introduce too much bias.

Test Accuracy

Estimates of test accuracy are summarized in Table 3. Sensitivities achieved in discriminating between participants with ADHD and healthy controls ranged from 57% to 96%, while specificities fell between 63% and 92%. Four studies achieved pairs of sensitivity and specificity of ≥85% each. Müller et al. (2011), with a sample size of 150 and the best control of overfitting, reached a sensitivity and specificity of 91% each, and a correct classification of 94% in an independent test sample. If 10% of people being evaluated for ADHD using this test actually had the disorder, a positive test result would be correct in 50% of cases. If 25% had the disorder, a positive result would correctly identify the disorder in 75% of cases.

Table 3.

Diagnostic Accuracy of ERP-Based Tests for Discriminating ADHD From Control Participants.

Study	n (ADHD)	n (C)	TP	FN	FP	TN	Se%	Sp%	Remarks
Castro-Cabrera, Gomez-Garcia, Restrepo, Moscoso, and Castellanos-Dominguez (2010)	23	23	22	1	1	20	96	87
Dickhaus and Heinrich (1996)	25	25	19	6	2	23	76	92
Heinrich, Hoegl, Moll, and Kratz (2014)	19	21	17	2	2	19	90	91
Müller et al. (2011)	75	75	68	7	7	68	91	91	94% overall classification accuracy in an independent test sample
Robaey, Breton, Dugas, and Renault (1992)	12	12	10	2	3	9	83	75
Shahaf et al. (2012)	13	13	11	2	3	10	85	92
Smith, Johnstone, and Barry (2003)	49	26	35	14	6	20	71	77	Aged 8-12 years
Smith et al. (2003)	51	24	29	22	9	15	57	63	Aged 13-18 years

Note. ERP = event-related potential; n = number; C = controls; TP = true positives; FN = false negatives; FP = false positives; TN = true negatives; Se = sensitivity; Sp = specificity.

Discussion

Research on the use of ERPs to diagnose ADHD is severely underdeveloped. Despite a long tradition of exploring ERP deviations in childhood disorders, we were able to find only seven studies that quantified the ability of such measures to discriminate between individuals with ADHD and healthy controls. These studies have their shortcomings, such as low sample sizes, nonrepresentativeness, and inflated estimates of test accuracy due to overfitting. But their most important problem occurs before these methodological details start to matter. It is the fact that there is no standardization with regard to the specifics of the tests used. The performance of the different classification methods used demonstrates that there is scope for valid and precise tests of ADHD, but this potential has not been used so far in collaborative efforts to develop a single test to the point of clinical applicability. What is needed therefore is concerted action among different research centers to develop test criteria that are not only diagnostically valid but can also be implemented for robust and simple use in clinical practice.

The lack of structure in diagnostic test research appears to reflect a similar disarray in the whole field of ERP studies in ADHD. Johnstone, Barry, and Clarke (2013) remarked that although there clearly exists a potential for improved understanding of the brain dysfunctions underlying ADHD, the field would benefit from a more rigorous approach to defining clinical groups and a stronger emphasis on replication studies. They noted in particular, in their literature review, that “no two studies use the same participant, task, and analysis parameters” (Johnstone et al., 2013, p. 655). These authors also recommended an increased focus on the diagnostic use of ERPs (Barry, Johnstone, & Clarke, 2003; Johnstone et al., 2013), but noted that not much progress had been made in the preceding decade.

We concur with this assessment. A successful ERP-based diagnosis requires a foundation of solid, reproducible research into the differences between the brains of ADHD patients and healthy individuals. If potential differences cannot be well established through systematic research programs, diagnostic markers of the disorder will not come to light or will not pass validation.

Particular attention will have to be given to the effects of age and gender, comorbidity, and the use of medication. These are some of the main determinants of the nature and size of alterations observed in ADHD, and combinations of different levels of these factors already produce a vast heterogeneity in potential study samples. To some extent, the search for diagnostic markers might therefore have to proceed on parallel tracks, with some researchers focusing on children, some on adolescents and/or adults, some on male, some on female participants, and so forth. It should not be assumed without further proof that one and the same diagnostic marker will be optimal for each subgroup. To assist in finding promising markers for different subgroups, a number of excellent literature reviews exist (e.g., Barry, Johnstone, & Clarke, 2003; Johnstone et al., 2013).

Recommendations

To advance ERP-based diagnosis of ADHD, the most pressing need is to establish collaborative networks that systematically explore test variants. Ideally, such tests should (a) be simple and easy to administer (e.g., use the signal from only one or a few electrodes), (b) be short, (c) produce results immediately, (d) be based on open source software, and (e) be implementable in a wide range of available EEG systems.

In the long run, (a) tests should be developed that tap into circumscribed biomarkers with known functions, so that (i) tests can contribute to an etiological understanding of the disorder, and (ii) test results are interpretable and communicable to patients, and (b) tests should be validated in a wide range of samples that map out as much as possible of the population of people who might show up in primary care with the need for a diagnostic evaluation of ADHD.

Footnotes

Declaration of Conflicting Interests

The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: A.G. is part-time employed by A. Müller, the first author of one of the reviewed studies. Both A.G. and O.K. are in active collaboration with A. Müller.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Author Biographies

Alex Gamma is a PhD in neurobiology with training in positron emission tomography (PET) imaging and electroencephalography/event-related potential (EEG/ERP). He has used these methods to study the effects of psychotropic drugs in humans. For 10 years, he has also worked as an applied statistician in psychiatric epidemiology.

Olga Kara has an MSc in neurobiology and a PhD in psychophysiology. She is specialized in quantitative EEG and ERP methods as well as in repetitive transcranial magnetic stimulation (rTMS), navigated brain stimulation (NBS), transcranial direct current stimulation (tDCS), and in brain computer interfaces.

References

American Psychiatric Association. (1980). Diagnostic and statistical manual of mental disorders (3rd ed.). Washington, DC: Author.

American Psychiatric Association. (1987). Diagnostic and statistical manual of mental disorders (3rd ed., rev.). Washington, DC: Author.

American Psychiatric Association. (1994). Diagnostic and statistical manual of mental disorders (4th ed.). Washington, DC: Author.

Barry

R. J.

Clarke

A. R.

Johnstone

S. J.

(2003). A review of electrophysiology in attention-deficit/hyperactivity disorder: I. Qualitative and quantitative electroencephalography. Clinical Neurophysiology, 114, 171-183.

Barry

R. J.

Johnstone

S. J.

Clarke

A. R.

(2003). A review of electrophysiology in attention-deficit/hyperactivity disorder: II. Event-related potentials. Clinical Neurophysiology, 114, 184-198.

Castro-Cabrera

Gomez-Garcia

Restrepo

Moscoso

Castellanos-Dominguez

(2010, August-September). Evaluation of feature extraction techniques on event-related potentials for detection of attention-deficit/hyperactivity disorder. 32nd Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Buenos Aires, Argentina.

Dickhaus

H. H.

Heinrich

H. H.

(1996). EP parametrization and classification using wavelet networks—Theoretical concept and medical application. Studies in Health Technology and Informatics, 43(Pt. B), 541-545.

Fayyad

De Graaf

Kessler

R. C.

Alonso

Angermeyer

Demyttenaere

. . . Jin

(2007). Cross-national prevalence and correlates of adult attention-deficit hyperactivity disorder. The British Journal of Psychiatry: The Journal of Mental Science, 190, 402-409.

Heinrich

Hoegl

Moll

G. H.

Kratz

(2014). A bimodal neurophysiological study of motor control in attention-deficit hyperactivity disorder: A step towards core mechanisms? Brain, 137(Pt. 4), 1156-1166.

10.

James

Witten

Hastie

Tibshirani

(2013). An introduction to statistical learning. New York, NY: Springer.

11.

Johnstone

S. J.

Barry

R. J.

Clarke

A. R.

(2013). Ten years on: A follow-up review of ERP research in attention-deficit/hyperactivity disorder. Clinical Neurophysiology, 124, 644-657.

12.

Müller

Candrian

Grane

V. A.

Kropotov

J. D.

Ponomarev

V. A.

Baschera

G.-M.

(2011). Discriminating between ADHD adults and controls using independent ERP components and a support vector machine: A validation study. Nonlinear Biomedical Physics, 5, Article 5.

13.

Polanczyk

Rohde

L. A.

(2007). Epidemiology of attention-deficit/hyperactivity disorder across the lifespan. Current Opinion in Psychiatry, 20, 386-392.

14.

Robaey

Breton

Dugas

Renault

(1992). An event-related potential study of controlled and automatic processes in 6-8-year-old boys with attention deficit hyperactivity disorder. Electroencephalography and Clinical Neurophysiology, 82, 330-340.

15.

Shahaf

Reches

Pinchuk

Fisher

Ben Bashat

Kanter

. . . Geva

A. B.

(2012). Introducing a novel approach of network oriented analysis of ERPs, demonstrated on adult attention deficit hyperactivity disorder. Clinical Neurophysiology, 123, 1568-1580.

16.

Smith

J. L.

Johnstone

S. J.

Barry

R. J.

(2003). Aiding diagnosis of attention-deficit/hyperactivity disorder and its subtypes: Discriminant function analysis of event-related potential data. Journal of Child Psychology and Psychiatry, 44, 1067-1075.

17.

Snyder

S. M.

Hall

J. R.

Cornwell

S. L.

Quintana

(2006). Review of clinical validation of ADHD behavior rating scales. Psychological Reports, 99, 363-378.

18.

Taylor

Deb

Unwin

(2011). Scales for the identification of adults with attention deficit hyperactivity disorder (ADHD): A systematic review. Research in Developmental Disabilities, 32, 924-938.

19.

Thome

Ehlis

A.-C.

Fallgatter

A. J.

Krauel

Lange

K. W.

Riederer

. . . Gerlach

(2012). Biomarkers for attention-deficit/hyperactivity disorder (ADHD): A consensus report of the WFSBP task force on biological markers and the World Federation of ADHD. World Journal of Biological Psychiatry, 13, 379-400.

20.

Whiting

P. F.

(2011). QUADAS-2: A revised tool for the Quality Assessment of Diagnostic Accuracy Studies. Annals of Internal Medicine, 155, 529-536.