Abstract
Evaluations of early screening tests for autism commonly rely on receiver operating characteristic (ROC) analysis and comparisons of area under the curve (AUC). Whether AUC differs significantly from chance or between test items is not always assessed. Two recent and independent evaluations of the Brief Autism Detection in Early Childhood (BADEC) constructed a short-form by selecting the five items with the highest AUC values, leading to inconsistencies regarding appropriate item content (Nah et al., 2018; Nevill et al., 2019). Using significance testing to compare AUC values for each test item from each dataset, we demonstrate which items justify inclusion in the BADEC, which items can be ruled out, and highlight key factors influencing AUC significance testing outcomes.
It is accepted that developmental outcomes for individuals with autism spectrum disorder (ASD) will benefit from diagnosis at a young age and subsequent early intervention (e.g., Schreibman, 2000). To facilitate widespread screening for ASD in children aged 12–36 months, researchers have developed Level 1 screeners for screening children in the general population who exhibit signs of autism (e.g., Dietz et al., 2006) and Level 2 screeners to distinguish ASD from other developmental disabilities (e.g., Stone et al., 2004; Young, 2007). Here we address an important issue regarding the evaluation of screeners: namely, the consequences of failing to determine whether items differ meaningfully from one another in their ability to detect the condition of interest (Nah et al., 2018; Nevill et al., 2019).
An important part of developing a screener is to collect data evaluating its diagnostic accuracy; through this process the screener’s ability to differentiate between those with and without the target condition is established. It is common to evaluate the discriminative performance of early screening test items, or whole tests, using receiver operating characteristics (ROC) analysis. ROC analysis involves plotting the proportion of individuals among those with the relevant condition who screen positive on the item or test (i.e., sensitivity or SE) against the proportion of individuals without the condition who screen negative (i.e., specificity or SP) across the range of possible cutoff scores. The result is a ROC curve which plots SE values against 1–SP values (i.e., hits against false alarms). An area under the curve (AUC) statistic is then used to assess the test’s (or item’s) discriminative performance. AUCs of 0.50 and 1 indicate chance and perfect discrimination, respectively. A significance test is necessary to formally evaluate AUC (e.g., whether performance differs meaningfully from chance or another AUC value); however, this approach is not commonly used in practice.
Concerns regarding the inconsistent use of significance testing to evaluate AUC in the context of autism screener development were raised by Brewer et al.’s (2020) review of Level 2 ASD screeners, using as an example two evaluations of brief versions of the Autism Detection in Early Childhood (ADEC; Young, 2007). Brief screeners—comprised of key items, often from a more comprehensive assessment tool—allow quicker assessments that are able to be more widely implemented. Nah et al. (2018) and Nevill et al. (2019) constructed short-forms of the ADEC (i.e., the Brief Autism Detection in Early Childhood; BADEC) by administering the full ADEC to infants with or without ASD and selecting the five items with the highest AUC. Three items—response to name, reciprocity of smile and following verbal commands—were included in both BADEC versions, while two items differed across versions. A lack of meaningful AUC comparisons between items prevents firm conclusions regarding the most discriminating items. Consequently, we used significance testing to re-evaluate the evidence for BADEC item selection in the Nah et al. and Nevill et al. 1 datasets and contrasted our results with the original conclusions reached by selecting the five items with the highest AUC.
Method
Participants
Nah et al.’s (2018) sample included 270 children aged 12–36 months (M = 25.4, SD = 7.0). Based on a best estimate clinical (BEC) DSM-5 diagnosis, 106 had ASD, 86 were non-typically developing (non-TD) and 78 were considered typically developing (TD). Characteristics of the non-TD children included language delay, hearing impairment and learning difficulty. As Nah et al.’s focus was the ability of the BADEC to discriminate ASD from other developmental disorders, we only used the ASD and non-TD children’s data in our re-analyses. Nevill et al.’s (2019) sample comprised 110 children aged 14–36 months (M = 28.8, SD = 5.4). Based on assessments from a multidisciplinary team at a hospital pediatric center, 49 children had a diagnosis of ASD and 61 had ASD ruled out. Missing data for a small number of participants in the dataset provided to us meant our evaluations of the ADEC items are based on the slightly smaller sample of 107 children, 48 and 59 with and without ASD, respectively.
The Autism Detection in Early Childhood
Area Under the Curve and 95% Confidence Intervals for Individual Autism Detection in Early Childhood Items.
Note. AUC = Area Under the Curve.
Note. AUC values in bold font indicate the item’s inclusion in the BADEC.
Results and Discussion
Using the roc and auc.ci functions in the pROC package (version 1.16.2; Robin et al., 2011) in R (version 3.6.1; R Development Core Team, 2019), we recalculated the AUC values and 95% confidence intervals for each ADEC item in the Nah et al. (2018) and Nevill et al. (2019) datasets (see Table 1). 2 . Using the roc.test function, AUC values were compared with paired ROC tests using 10,000 bootstrap samples; the R code for constructing and comparing the AUCs, as well as the significance values for each AUC comparison can be found in Supplemental Material (Tables S1–S3, pp. 1–3).
In Nah et al. (2018), the items with the highest AUCs comprising the BADEC were response to name, reciprocity of smile, following verbal commands, joint attention/social referencing and use of gestures; in Nevill et al. (2019), the items were again response to name, reciprocity of smile and following verbal commands, accompanied by gaze monitoring and task switching instead of joint attention/social referencing and use of gestures. In both datasets, the five items comprising the BADEC had AUC values that did not differ significantly from one another (ps > .134 and >.128 in Nah et al. and Nevill et al., respectively). In the Nah et al. dataset two additional items (functional play and gaze monitoring) had AUC values that were not significantly different from any of their BADEC items (ps > 0.119). Based on the strategy of selecting the most discriminating items, significance testing suggests that seven items could have been used interchangeably in Nah et al.’s BADEC. Moreover, since one of the additional candidates (i.e., gaze monitoring) was used in Nevill et al.’s BADEC, four rather than three items were consistent contenders for inclusion in a 5-item short-form.
ADEC Items Ranked from Strongest to Weakest BADEC Contenders on the Basis of AUC Comparisons in Nah et al. (2018) and Nevill et al. (2019).
Note. AUC = Area under the curve; BADEC = Brief autism detection in early childhood.
Note. For the Nah et al. (2018) dataset we have included in the “BADEC items” the two additional items (i.e., gaze monitoring and functional play) that were not part of Nah et al.’s BADEC but, on the basis of our significance testing, could have been used interchangeably with their BADEC items.
For both datasets substantial AUC differences were necessary for the difference to be significant—≈ .08 in Nah et al. (2018) and ≈.11 in Nevill et al. (2019)—suggesting that further testing with more robust sample sizes might reduce the number of BADEC contenders. To appreciate the AUC difference that would be significant for different sample sizes, we compared AUC when doubling, tripling and quadrupling the Nah et al. and Nevill et al. datasets. In one example, increasing the N to ≈400—by doubling the Nah et al. dataset and quadrupling the Nevill et al. dataset—decreased the range for a significant AUC difference to ≈.06 (see Tables S4 & S5 in Supplemental Material, pp. 4–5). However, one cannot simply use sample size to infer whether two AUC values are likely to differ significantly. The level of inter-item correlation affects whether a difference of the same magnitude will reach significance (Robin et al., 2011). To illustrate, we compared AUC values in mock datasets where the AUC values were the same as the original dataset but inter-item correlations were optimized (see Supplemental Material Tables S10 & S11, pp. 10 &11; for inter-item correlations in the original and mock datasets see Supplemental Material Tables S6–S9, pp. 6–9). When inter-item correlations were higher, smaller differences in AUC were necessary for the difference to be significant.
Conclusion
The current investigation reinforces the importance of evaluating the results of ROC analysis with significance testing. Our re-analyses of Nah et al.’s (2018) and Nevill et al.’s (2019) data revealed that seven and five of the most discriminating ADEC items, respectively, did not differ meaningfully from one another and that four items were in both of these groups across studies. Compared to the strategy of simply identifying the five items with the highest AUC, which suggested that three items were consistently among the most discriminative ADEC items, the approach of comparing AUC with significance testing showed that four items fell into this category. When developing a five-item instrument, the difference between identifying four instead of three items as consistently amongst the most discriminating across studies seems non-trivial.
The current example emphasizes the important information that can come to light when using significance testing to evaluate the most discriminating items in a diagnostic test. However, the results also have implications for future work involving other types of AUC comparison. For example, when evaluating which test items are appropriate to include in an assessment instrument, researchers may wish to establish that all of their items perform better than chance levels. The current work reinforces that it would be best to use a significance test rather than simply relying on the AUC values being higher than 0.50. Similarly, since the minimum acceptable AUC of a test is often considered to be ≈0.70 (e.g., Compton et al., 2006), the AUC of the combined items should be measured against this threshold with a significance test. And when comparing the discriminative performance of two different assessment instruments, a significance test is the only way to determine whether one test is performing better than the other.
Our results also reinforce that sample size needs to be considered when interpreting significance test outcomes. In addition to re-evaluating the most discriminating items in Nah et al. (2018) and Nevill et al. (2019), significance testing also facilitated delineation of which additional items could be considered contenders for inclusion in the BADEC. The large number of items that could not be definitively ruled out as candidates emphasized that, ideally, item discriminability should be compared with sufficiently robust samples for moderately sized AUC differences to be meaningful. Although it is challenging to obtain sufficiently powered samples in research involving many clinical populations, this should be an objective of future research concerning the development of ASD screeners.
Supplemental Material
sj-pdf-1-jpa-10.1177_07342829211067128 – Supplemental Material for Pitfalls When Using Area Under the Curve to Evaluate Item Content for Early Screening Tests for Autism
Supplemental Material, sj-pdf-1-jpa-10.1177_07342829211067128 for Pitfalls When Using Area Under the Curve to Evaluate Item Content for Early Screening Tests for Autism by Carmen A. Lucas, Neil Brewer, and Robyn L. Young in Journal of Psychoeducational Assessment
Footnotes
Acknowledgments
We are very grateful to Yong-Hwee Nah and Rose Nevill for providing us with their data files. Nah et al.’s (2018) dataset is available at ![]()
Declaration of Conflicts of Interest
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: The university with which the authors are affiliated receives royalties from the sale of the ADEC.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Australian Research Council [DP 190100162] and the Hamish Ramsay Fund.
Supplemental Material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
