Abstract
The diagnostic validity of the new research algorithms of the Autism Diagnostic Interview–Revised and the revised algorithms of the Autism Diagnostic Observation Schedule was examined in a clinical sample of children aged 18–47 months. Validity was determined for each instrument separately and their combination against a clinical consensus diagnosis. A total of N = 268 children (n = 171 with autism spectrum disorder) were assessed. The new Autism Diagnostic Interview–Revised algorithms (research cutoff) gave excellent specificities (91%−96%) but low sensitivities (44%−52%). Applying adjusted cutoffs (lower than recommended based on receiver operating characteristics) yielded a better balance between sensitivity (77%−82%) and specificity (60%−62%). Findings for the Autism Diagnostic Observation Schedule were consistent with previous studies showing high sensitivity (94%−100%) and alongside lower specificity (52%−76%) when using the autism spectrum cutoff, but better balanced sensitivity (81%−94%) and specificity (81%−83%) when using the autism cutoff. A combination of both the Autism Diagnostic Interview–Revised (with adjusted cutoff) and the Autism Diagnostic Observation Schedule (autism spectrum cutoff) yielded balanced sensitivity (77%−80%) and specificity (87%−90%). Results favor a combined usage of the Autism Diagnostic Interview–Revised and Autism Diagnostic Observation Schedule in young children with unclear developmental problems, including suspicion of autism spectrum disorder. Evaluated separately, the Autism Diagnostic Observation Schedule (cutoff for autism) provides a better diagnostic accuracy than the Autism Diagnostic Interview–Revised.
Introduction
In clinical practice and research, it has become increasingly important to diagnose toddlers and young preschoolers with autism spectrum disorder (ASD) before the age of 48 months (Bölte et al., 2013; Elsabbagh et al., 2012; Zwaigenbaum et al., 2005). Parental concerns regarding their child’s development often appear already during the second year of life (Baghdadli et al., 2003; Chawarska et al., 2007; De Giacomo and Fombonne, 1998; Hess and Landa, 2012), and a formal diagnosis is often a prerequisite for access to clinical service, such as early intervention, which may improve long-term outcomes for individuals with ASD (Dawson, 2008; Reichow, 2012; Rogers et al., 2012; Fein et al., 2013; Yirmiya, 2010).
A multitude of research from North America and the United Kingdom has provided evidence endorsing the possibility of early and reliably diagnosing ASD (Charman et al., 2005; Gonzalez et al., 1993; Lord et al., 2006; Piven et al., 1996; Sigman et al., 1999; Turner and Stone, 2007; Venter et al., 1992), particularly for experienced clinicians using standardized instruments (Corsello et al., 2013; Lord et al., 2006; Risi et al., 2006). Diagnostic and Statistical Manual of Mental Disorders (5th ed., DSM-5; American Psychiatric Association (APA), 2013) acknowledges the value of using behavioral standardized diagnostic instruments in the assessment of ASD by recommending questionnaires, caregiver interviews and observation measures to improve reliability of ASD diagnoses over time and between clinicians. Currently, the best evaluated standardized diagnostic instruments for evaluating suspected ASD are the Autism Diagnostic Interview–Revised (ADI-R; Rutter et al., 2003) and the Autism Diagnostic Observation Schedule (ADOS; Lord et al., 1999, 2012). The ADI-R is a comprehensive, structured caregiver interview that operationalizes the Diagnostic and Statistical Manual of Mental Disorders (4th ed., text rev.; DSM-IV-TR)/International Statistical Classification of Diseases and Related Health Problems 10th Revision (ICD-10) criteria for autism through an assessment of descriptions of an individual’s behavior, elicited by an experienced interviewer and a set of standardized questions. The ADOS is a structured observation protocol to collect a sample of an individual’s social-communication and stereotypic, restricted behavior patterns (Lord et al., 1999, 2012).
The main aim of this study was to examine the diagnostic validity of the ADI-R and the ADOS in an independent clinical sample consisting of children below 48 months of age. Until recently, neither instrument had been specifically validated for use with children below age 48 months, and few studies applied the tools’ revised algorithms. Most research have been conducted in US lab settings, and especially with regard to the ADOS, results have not been presented for all ASDs combined versus nonspectrum (NS), but often rather separately for autistic disorder or pervasive developmental disorder not otherwise specified (PDD-NOS) versus NS.
More precisely, the existing body of literature on the ADI-R and ADOS in young children with ASD can be summarized as follows: Previous studies indicate that the ADI-R might be either overinclusive (Lord et al., 1993; Risi et al., 2006) or underinclusive (Ventola et al., 2006; Wiggins and Robins, 2008) of autism in children with a nonverbal mental age below 24 months. Thus, the usage of the standard ADI-R in young and very low functioning individuals has not been encouraged in the past. Several attempts have been made to adapt the ADI-R for younger ages. For instance, studies used a “toddler” version of the ADI-R (Chawarska et al., 2007; Lord et al., 2004; Richler et al., 2007), which essentially represent an extended form of the ADI-R without diagnostic algorithm. Recently, Kim and Lord (2012b) developed and evaluated a new set of ADI-R research algorithms for toddlers and young preschoolers aged 12–47 months with mental ages of 10 months and above. Three novel algorithms depending on the child’s age and expressive language showed promising diagnostic validity: children aged 12–20 months and those aged 21–47 months without speech (NV), children aged 21–47 months using single words (SW), and children aged 21–47 months with developed phrase speech (PH). The validity of these algorithms were confirmed by Kim et al. (2013) in two independent research samples. Sensitivities for ASD versus NS disorders using the clinical cutoff were 85%−90% in the NV group, 94%−97% in the SW group, and 80%−89% in the PH group, while specificities ranged from 64% to 94% in the NV group, from 58% to 83% in the SW group, and from 70% to 94% in the PH group. When excluding children with ASDs other than autism, sensitivities against NS increased. Compared to the standard diagnostic algorithm, the new ADI-R research algorithms for young children yielded better balanced sensitivities and specificities, that is, increased specificity especially for the NV group and decreased sensitivity in some cases (Kim and Lord, 2012b). Kim et al. (2013) concluded that the new research algorithms, more closely reflecting the conception of ASD in DSM-5, seemed to better correspond to clinical judgment, at least in the United States.
Despite a rich literature corroborating the diagnostic validity of the ADOS, it is known that chronological age and developmental level may substantially influence ADOS scores (Gotham et al., 2007). In particular, the ADOS module 1 was overly inclusive in children with nonverbal mental ages < 15 months (Gotham et al., 2007). Thus, more recently, the new ADOS toddler module (Luyster et al., 2009) was developed. The toddler module is part of an updated version of the ADOS, the ADOS-2 (Lord et al., 2012), which also includes new algorithms for modules 1–3, which are improved to better fit the child’s expressive language level and age and also comprise items from the repetitive and restricted behavior domain (Gotham et al., 2007). Aside from two large US research samples (Gotham et al., 2007, 2008), the validity of the new algorithms has been examined in two Dutch research samples (De Bildt et al., 2009; Oosterling et al., 2010). In Gotham et al. (2007, 2008), sensitivities for module 1 (no words and some words) and module 2 (<5 years of age algorithm) ranged from 86% to 98%, whereas specificities were 80% to 100% using the autism cutoff for autism only versus NS. The diagnostic validity for non-ASD versus NS was generally lower. The largest gain of accuracy was found in module 1 (no words) for children with nonverbal mental age below 15 months. Here, the specificity increased from 19% to 50% for autism versus NS. Thus, the problem of overinclusivity of the ADOS for very young and developmentally delayed children was reduced but remained a concern. In the Dutch samples, specificities and particularly sensitivities were generally lower than in the US studies (De Bildt et al., 2009; Oosterling et al., 2010). Oosterling et al. (2010) reported sensitivities (61%−93%) and specificities (70%−86%) for ASD versus NS for module 1 (no words and some words) and module 2.
The revised algorithms have also been used in clinical settings. Gray et al. (2008), in a clinical Australian sample, obtained sensitivities (89%−98%) and specificities (82%−86%) for module 1 (no words and some words) comparable to those found in US research samples using the autism cutoff for the comparison autism versus non-ASD plus NS. ASD versus NS using the autism spectrum cutoff yielded somewhat lower sensitivities (78%−92%) and specificities (86%−92%). In a US clinical sample, using the autism spectrum cutoff, Molloy et al. (2011) reported sensitivities for ASD versus NS of 76%−98% for module 1 (no words and some words) and module 2; specificities were 29%−60%. When the autism cutoff was applied, sensitivities were 63%−83% and specificities 65%−81%. In summary, the evidence generally endorses the usefulness of the new ADOS algorithms. Nevertheless, as the research described above examined children of a large age range (13–144 months) without an explicit separate analysis of young children, the diagnostic accuracy of the ADOS in children aged less than 48 months is unknown.
The ADI-R and the ADOS represent considerably different but equally important strategies and sources of data collection: interviewer guided verbal parent-report, including history taking (ADI-R), and expert rating of a prompted social interaction sample (ADOS). A moderate to substantial agreement of the ADOS with clinical diagnosis (κ > .59), but only a slight to moderate agreement between the ADI-R and clinical diagnosis (κ = .15–.46) has been reported (Gray et al., 2008; Ventola et al., 2006). The convergence of the ADI-R and ADOS has shown mixed results. Ventola et al. (2006) found that their agreement of the diagnostic classifications of the ADI-R and the ADOS was below chance (κ = −.07), while Gray et al. (2008) reported a fair agreement (κ = .35). Le Couteur et al. (2008) found significant correlations between different domain totals of the ADI-R and the ADOS (r = .51–.71). In young children aged below 36 months, Risi et al. (2006) found a correlation of r = .60 between ADI-R and ADOS domain.
Generally, a combined rather than a separate usage of the ADI-R and ADOS is recommended and has sometimes being labeled the “gold standard” of diagnosing ASD. Risi et al. (2006) reported well-balanced sensitivities and specificities (~80%) in most cases for both autism versus NS and ASD versus NS in children below 36 months of age, when diagnosis was based on combined ADI-R and ADOS use. In Le Couteur et al. (2008), the categorical agreement of the combination of the ADI-R and the ADOS with diagnosis was 67% for autism and 16% for ASD.
Studies investigating the value of the ADI-R/ADOS combined and the single use of the instruments have applied the standard algorithms of the ADI-R and the ADOS. They often presented results separately for children with different forms of ASD (not the whole autism spectrum group versus NS) and were in the majority conducted in research settings. Only one study examined the diagnostic validity of combined ADI-R/ADOS usage and their new algorithms in children below 48 months of age. In a sample of N = 604 children aged 12–47 months, recruited from both research projects and clinical practice, Kim and Lord (2012a) found well-balanced sensitivity and specificity (≥80%) for all ASD versus NS comparisons (ADI-R clinical and ADOS autism spectrum cutoff). The separate usage of the ADI-R and the ADOS showed lower and less well-balanced sensitivity and specificity compared to their combined usage.
As mentioned initially, and described in this overview, despite rich evidence generally supporting the diagnostic validity of the ADI-R and the ADOS, there is an apparent scarcity of studies on their accuracy using the new algorithms in young children with suspicion of ASD, particularly in purely clinical settings outside of the United States. In addition, only one study has yet examined the combined use of the instruments in this age group.
Methods
Participants
The current study comprised N = 268 (76% boys) children below 48 months of age assessed between 2006 and 2012 at the Neuropsychiatric Resource Team Southeast, Division of Child and Adolescent Psychiatry, Stockholm County Council. The unit is a multidisciplinary diagnostic specialist clinic and part of the public health care system. It serves preschool children aged below 48 months of age (~60 children/year). Children are either referred to the unit via other departments of the health care system or directly by their caregivers. The study was approved by the Regional Board of Ethical Vetting, Stockholm.
The mean age of the participating children was 37.9 months (standard deviation (SD) = 7.2 months, range: 18–47 months). Following assessment, 171 children were given a diagnosis of ASD (autism: 103, PDD-NOS: 68). Ninety-seven children were classified as a NS: 67 children received a neurodevelopmental diagnosis other than ASD (intellectual disability: 9, attention deficit hyperactivity disorder (ADHD): 16, language disorder: 42), and 30 children no psychiatric diagnosis. The majority of the children without psychiatric diagnosis exhibited different kinds of special needs due to developmental delays and adaptive and behavioral problems that were too subtle or vague for qualify for a diagnosis, and therefore, they were included in the NS group in all analyses. Gender distribution did not differ between the diagnostic groups (χ2(1) = 2.64, p = .10). The maternal country of origin was used as a proxy for ethnicity (it is illegal to register race or ethnicity in Sweden): 59% of the children were of Swedish origin, 11% of other European, and 30% of non-European origin (on average in Stockholm County, around 80% of the children are of Swedish origin). Ethnicity was not associated with diagnosis (χ2(2) = 2.25, p = .32). Sample characteristics by ADI-R algorithms and ADOS modules are presented in Tables 1 and 2.
Sample description by ADI-R developmental cells.
NVIQ: nonverbal IQ; VABC: Vineland-II Adaptive Behavior Composite; ADI-R SA/SC: Autism Diagnostic Interview–Revised Social Affect/Social Communication; RRB: Restricted and Repetitive Behavior; IGP/RPI: Imitation, Gestures, and Play/Reciprocal Peer Interaction; ADOS: Autism Diagnostic Observation Schedule; SA: Social Affect; RRB: Restricted and Repetitive Behavior; 12–20/NV21–47: all children 12–20 months + nonverbal children 21–47 months; SW21–47: children 21–47 months with single words; PH21–47: children 21–47 months with phrase speech; SD: standard deviation.
In the 12–20/NV21–47 group, three children were less 21 months old and NVIQ was available for 43/59 children; SW21–47: NVIQ was available for 49/56 children; PH21–47: NVIQ available for 37/47 children with ASD and VABC for 46/47 children.
Sample description by ADOS module and algorithm.
ASD: autism spectrum disorder; NS: nonspectrum; VIQ: verbal IQ; NVIQ: nonverbal IQ; VABC: Vineland-II Adaptive Behavior Composite; ADOS: Autism Diagnostic Observation Schedule; SA: Social Affect; RRB: Restricted and Repetitive Behavior; ADOS SA + RRB: total score; nine children in module 1 no words (NW) had a nonverbal mental age (NVMA) of 15 months or below.
Measures
ADI-R
The Swedish version of the ADI-R using the new algorithms (Kim and Lord, 2012b) was administered. The new research algorithms reflect the conceptualization of ASD in DSM-5 and are consistent with the ADOS-2 algorithms. Items compose the domains of Social Affect (SA)/Social Communication (SC), Restricted and Repetitive Behavior (RRB), and Imitation, Gestures, and Play (IGP)/Reciprocal Peer Interaction (RPI). Different item combinations are used in the algorithms to generate a single total score in each ADI-R developmental group. For the NV and SW groups, the SA and RRB domains, but not the items of the IGP domain, are combined to generate cutoff scores (NV: 13 items and SW: 16 items), while for the PH group, the total scores of the SC, RRB and RPI domains (20 items) are combined. There is a clinical lower cutoff, optimizing sensitivity, and a higher research cutoff, which optimizes specificity, as well as ranges of concern (little-to-no, mild-to-to moderate, and moderate-to-severe), comparable to the scoring and interpretation on the ADOS toddler module (Lord et al., 2012; Luyster et al., 2009). According to the new ADI-R research algorithm developmental group nomenclature (Kim and Lord, 2012b) (see introduction), 72 participants fell into the NV group (three of them were aged < 21 months), 88 into the SW group, and 94 into the PH group. In the present sample, in the NV group, verbal IQ (VIQ) correlated with the different ADI-R domain totals (r = −.45 to −.50, p < .005) and age correlated with both the total score and the SA domain score (r = −.34, p ≤ .004). No other participant characteristics (age, VIQ, and nonverbal IQ (NVIQ)) and domain totals correlated significantly or r > .40.
ADOS
The Swedish version of the ADOS (Lord et al., 1999) was used applying the revised algorithms included in the ADOS-2 (Gotham et al., 2007; Lord et al., 2012). Depending on the child’s expressive language level and/or age, the observer chooses a specific set of activities operationalized through the modules in order to minimize the influence of expressive language and developmental level. Module 1 is conceived for children that have not developed any fluent phrase speech. Herein, depending on the expressive language level, two different algorithms are used for scoring: one if the child has no words (module 1 no words), the other if the child has some words (module 1 some words). For the youngest children without phrase speech, between 12 and 30 months, the new toddler module is now available (Luyster et al., 2009). Module 2 addresses children who use phrases flexibly but not yet in a complex way. Again, there are two ways of scoring depending on the child’s age: below 5 years or 5 years and older. Module 3 is designed for children with fluent speech used in a complex way and is scored using one algorithm. In the revised ADOS algorithms, items from the Social Interaction and Communication domains have been restructured to form the SA domain. The RRB domain, not part of the original algorithm scoring, has been added. Even though consisting of different combinations, all revised algorithms have the same number of items being summed to a total score to increase their comparability. On all algorithms, the scores of 14 items from the SA and the RRB domains are combined. Like in the standard algorithms, the new algorithms provide cutoffs for autism spectrum and autism, except for the toddler module that has three different “ranges of concern” instead of a diagnostic classification. Five children were given the toddler module, 93 were given module 1 no words algorithm, 93 module 1 some words algorithm, 72 were given module 2 (<5 years algorithm), and two module 3. In the current sample, correlations between age, VIQ and NVIQ and ADOS domain totals were r ≤ −.40 in most cases. In module 1 no words, VIQ and NVIQ and the ADOS total score was r = −.47 (p < .001) and in module 1 some words, VIQ and the ADOS total and SA domain totals was r = −.41 (p < .001) in both cases. The correlation between age and domain totals was r ≤ −.21 in all modules.
Intellectual and adaptive functioning
Merrill-Palmer-R (Roid and Sampers, 2005), Wechsler Preschool and Primary Scales of Intelligence–Third Edition (Wechsler, 2004), or Mullen Scales of Early Learning (Mullen, 1995) were used to measure NVIQ and VIQ (see Tables 1 and 2). They were calculated by averaging the age equivalents of the nonverbal subtests (Visual Reception and Fine Motor (Mullen) and Cognitive and Fine Motor (Merrill-Palmer-R)), and the verbal subtests to obtain mental age, which was divided by chronological age and multiplied by 100. ASD children had lower NVIQ than NS children, except for the NV ASD group. To measure adaptive function, the Vineland Adaptive Behavior Scales-II (Sparrow et al., 2005) was used, a structured caregiver interview that assesses age appropriate self-sufficiency skills. We applied the summary score of the Vineland-II, the Adaptive Behavior Composite, to describe functional level of the participants (Table 1). Levels of adaptive functioning in ASD were generally lower than in NS.
Procedure
Participants were referred to the Neuropsychiatric Resource Team Southeast due to unclear developmental concerns, for instance, language delay or global developmental delay, interaction difficulties and internalizing or externalizing behavior problems. Subsequently, children underwent a developmental assessment routine by a multidisciplinary team consisting of child psychiatrists, psychologists, and social workers. First, history taking was carried out by a child psychiatrist interviewing at least one caregiver, followed by a psychologist testing the child’s cognitive abilities and administering the Vineland-II. Parents not having enough knowledge in Swedish language were interviewed (history taking, ADI-R, Vineland-II) assisted by a certified interpreter. In addition, a psychologist, social worker, or child psychiatrist spent half a day on observing the child’s in his or her (natural) preschool environment and interviewing the staff. The ADOS was administered by a psychologist and/or a social worker, one administering and the other only passively observing. Both scored the examination independently and reached consensus scores after discussion. The ADI-R was administered by any of the team members but most often by one of the child psychiatrists not familiar with the child and not taking part in the diagnostic clinical consensus discussion. Clinicians administering the ADI-R or ADOS were either official clinically trained on the instruments or a certified trainer. The first and last authors are certified ADI-R and ADOS trainers. All available information, results and observational data from the assessments were discussed by the team having seen the child and his or her parents to generate a clinical consensus diagnosis according to DSM-IV-TR.
Analyses
Diagnostic validity was determined by calculating sensitivity and specificity with 95% confidence intervals (Wilson score method; Newcombe, 1998) as well as classification accuracy (% correctly diagnosed) and positive likelihood ratios (LR+) for different combinations of single and combined use of the ADI-R and the ADOS compared to clinical consensus diagnosis. In order to enable comparison with previous studies, the sample was divided into developmental groups according to the new ADI-R algorithms (NV, SW, and PH) when analyzing ADI-R data ADI-R/ADOS combined, and according to the module 1 (no words and some words) and module 2 when analyzing the ADOS data alone (no separate analyses for the toddler module (n = 5) and module 3 (n = 2)). Receiver operating characteristics (ROC) statistics with area under the curve (AUC) were computed for all developmental groups of the ADI-R and modules of the ADOS to examine the discriminative properties of the instruments. Pearson’s correlation between the domain and the total scores of the ADI-R and the ADOS were calculated to assess the agreement between the two instruments, and correlation coefficients differences were analyzed with Fischer’s Z transformation (Cohen and Cohen, 1983). Kappa statistics (κ) were used to determine categorical agreement between the ADI-R, ADOS, and clinical consensus diagnosis. Characteristics of correctly and misclassified children as well as the differences of total scores on the ADI-R and the ADOS between groups were examined with Bonferroni post hoc tests and effect sizes were calculated with Cohen’s d.
Results
ADI-R
The domain totals in all developmental groups differed statistically significantly between the children with ASD and the children with NS except for the RRB domain in the NV group (F(1, 71) = 3.80, p = .055) and PH group (F(1, 92) = 3.59, p = .061) (Table 1). Using the clinical cutoff of the new algorithms, sensitivities ranged from 53% to 70% and specificities from 69% to 81%, while the research cutoff yielded sensitivities from 44% to 52% and specificities >90%. The amount of correctly diagnosed cases ranged between 60% and 70% for both sets of algorithms. LR+ were 2.2–2.8 for the clinical cutoff, and 4.6–10.5 for the research cutoff. The AUC was NV: .79, SW: .75, and PH: .74 (all p < .001). Inspecting the ROC curves revealed that lowering cutoff points in the current sample compared with the ones proposed by Kim and Lord (2012b) improved the balance between sensitivity and specificity and the classification accuracy in all groups (see Table 3). A more useful clinical cutoff, that is, higher sensitivity and accuracy, for this sample in the NV group was 8 (instead of 11 in Kim and Lord, 2012b), 7 (instead of 8) in the SW group, and 9 (instead of 13) in the PH group. The sensitivities then increased to around 80% and the accuracies to >70% in two of the three groups but did not affect the LR+s in a positive way (1.9–2.2).
Validity of ADI-R, ADOS, and ADI-R/ADOS combined for all conditions tested, with the sample divided by developmental cells of the ADI-R.
12–20/NV21–47: all children 12–20 months and nonverbal children 21–47 months; SW21–47: children 21–47 months with single words; PH21–47: children 21–47 months with phrase speech; ADI-R: Autism Diagnostic Interview–Revised: ADOS Autism Diagnostic Observation Schedule; CLI: clinical cutoff, RES: research cutoff; ADJUSTED: own adjusted cutoff; AS: autism spectrum cutoff; AUT: autism cutoff; CI: confidence interval; TP: true positives; TN: true negatives; FP: false positives; FN: false negatives; ASD: autism spectrum disorder; NS: nonspectrum.
Accuracy % correctly classified. The cutoffs of the 12–20/NV21–47 group are 13 (RES), 11 (CLI), and 8 (ADJUSTED); of the SW21–47 group are 13 (RES), 8 (CLI), and 7 (ADJUSTED); and of the PH21–47 are 16 (RES), 13 (CLI), and 9 (ADJUSTED).
Kappas for the agreement between different ADI-R cutoffs and clinical consensus diagnosis were κ = .21 (p = .024) for the clinical cutoff, κ = .25 (p = .003) for the research cutoff, and κ = .36 (p = .001) for the adjusted cutoff in the NV group. The agreement in the SW group was κ = .37 (p < .001) (clinical), κ = .28 (p = .001) (research), and κ = .45 (p < .001) (adjusted). In the PH group corresponding kappas were κ = .34 (p = .001) (clinical), κ = .40 (p < .001) (research), and κ = .36 (p < .001) (adjusted).
When looking at the children who were misclassified by the ADI-R, some differences emerged regarding age and Vineland-II scores between true positives (TP) and false negatives (FN) as well as between false positives (FP) and true negatives (TN) on the ADI-R (Table 4). FP (NS children falsely classified as ASD) were younger than TN (NS children correctly classified as NS) on the ADI-R (F(3, 250) = 8.22, post hoc: FN > TP, p < .001, d = .90), while FN (children with ASD falsely classified as NS) had higher Vineland-II scores than the TP (children with ASD correctly classified) (F(3, 249) = 34.27, post hoc: TN > FP, p < .001, d = .83).
Characteristics of misclassified children.
PPV: positive predictive value; NPV: negative predictive value; ADI-R: Autism Diagnostic Interview–Revised; ADI-R adjusted: own adjusted cutoff; ADOS: Autism Diagnostic Observation Schedule; TP: true positives; FN: false negatives; FP: false positives; TN: true negatives.
Sample size was too limited to permit an analysis of data for the developmental cells separately.
ADOS
The ASD group had higher domain totals (SA and RRB together) than the NS group in all modules: module 1 no words: F(1, 91) = 76.5, p < .001; module 1 some words: F(1, 91) = 81.9, p < .001; and module 2: F(1, 70) = 109.0, p < .001 (see Table 2). When using the autism spectrum cutoff, sensitivities for module 1 (no words and some words) and module 2 ranged between 94% and 100%, while specificities ranged between 52% and 76%. Using the autism cutoff yielded more balanced sensitivities and specificities in all modules, that is, lower sensitivities (81%−94%), alongside higher specificities (81%−83%). The rate of correct classifications was in the same range for both sets of algorithms, 80%−88% for the autism spectrum cutoff and 82%−88% for the autism cutoff. The LR+s ranged between 2.1 and 3.9 for the autism spectrum cutoff and between 4.4 and 5.4 for the autism cutoff (Table 5).
Diagnostic validity of the ADOS by ADOS algorithms.
ADOS: Autism Diagnostic Observation Schedule; AS: autism spectrum cutoff; AUT: autism cutoff; CI: confidence interval; ASD: autism spectrum disorder; NS: nonspectrum; TP: true positives (children with ASD classified as ASD); TN: true negatives (children without ASD classified as NS); FP: false positives (children without ASD misclassified as ASD); FN: false negatives (children with ASD misclassified as NS).
The AUCs of the different modules were .91 for module 1 no words, .92 module 1 some words, and .95 for module 2 (all p < .001). ROC curve analyses showed that the published cutoffs of the revised algorithms suited the present sample. Agreement between the different modules and clinical diagnosis were for module 1 no words κ = .60 (autism spectrum cutoff) and κ = .57 (autism cutoff), and for module 1 some words κ = .72 (autism spectrum) and κ = .60 (autism). For module 2, agreement was κ = .62 (autism spectrum) and κ = .75 (autism) (all p < .001). No differences emerged regarding age and Vineland-II scores between TP and FN or between FP and TN on the ADOS (Table 4).
ADI-R and ADOS combined
In all developmental groups, the different combinations of ADI-R and the ADOS yielded higher specificities (88%−100%) than sensitivities (34%−64%) with correct classifications ranging from 57% to 73%. The LR+s ranged between 5.1 and 14.4 (for some analyses, LR+ could not be computed due to a specificity of 100%). Using the lower adjusted cutoff for the ADI-R and the autism spectrum cutoff for the ADOS gave better balanced sensitivities (77%−80%) and specificities (87%−90%), an increased amount of correct classification (81%−82%), and LR+s of 5.8–8.0 (see Table 3 for details).
ADI-R/ADOS agreement
Agreement between different combinations of the ADI-R and the ADOS comparing all participants yielded a κ = .23 (p < .001) for the ADI-R clinical and research cutoffs and ADOS autism spectrum cutoff. Using the ADOS autism cutoff and the ADI-R clinical and research cutoff gave a κ = .29 and κ = .30, respectively (both p < .001). The agreement between the adjusted ADI-R cutoff combined and the ADOS autism spectrum and autism cutoffs were κ = .31 and .34 (both p < .001) for all participants. Looking at each developmental group of the ADI-R combined with the ADOS autism spectrum and autism cutoffs, findings were more variable and ranged from κ = .15 to κ = .36 (all p ≤ .046). The correlations between the ADI-R and the ADOS total scores were r = .53 (NV, p < .001), r = .31 (SW, p = .004), and r = .42 (PH, p < .001). The correlations between the SA domain of both instruments showed the same pattern: NV r = .50 (p < .001), SW r = .28 (p = .009), and PH r = .45 (p < .001), while the RRB domains correlated weakly for the SW and PH groups: r = .19 (p = .076 and p = .062) but stronger for the NV group (r = .40, p = .001).
Discussion
Few studies have examined the psychometric properties of combined ADI-R and ADOS use, when applying their new diagnostic algorithms, in young children in purely clinical settings. Our findings indicate added value to the accuracy of ASD diagnoses through the combined use of the ADI-R and the ADOS in toddlers and young preschoolers with unclear developmental concerns. Diagnostic validity of combined ADI-R and ADOS use was superior to single instrument information. ADOS usage alone achieved an almost equally high diagnostic validity for the ADOS autism cutoff. Both instruments separately showed, in some case after cutoffs adjustment, diagnostic validity comparable to the results from US research settings.
We found the ADI-R alone to have a pattern of substantially lower sensitivities than specificities, especially when using the research cutoff, where sensitivities rarely reached 50%, while specificities exceeded 90%. In two previous studies that examined the ADI-R new research algorithms, higher and better balanced sensitivities and specificities were found (Kim et al., 2013; Kim and Lord, 2012b). Compared to the present study, their samples were not purely clinical. The lack of balance between sensitivity and specificity using the recommended cutoffs in our sample was to a certain degree remediated by the adjusted cutoffs, which also basically set aside the differences with the US samples. Lowering the cutoffs increased sensitivities to ~80%, decreased specificities to ~60%, and a larger part of the children were accurately classified. Nevertheless, LRs+ decreased, were lower than in previous studies, and none of the AUCs exceeded a fair level. The sample characteristics in this study differed to a certain extent from previous US research (Kim et al., 2013; Kim and Lord, 2012b). While age, NVIQ, and how the participant characteristics were associated with domain totals resembled the US samples rather closely; all domain totals were lower and especially those of the RRB domain. Finally, there were some differences in participant characteristics of large effect size (d ≥ .83) between correctly classified and misclassified children in this study. Consistent with Kim and Lord (2012a), the children with ASD misclassified as NS had higher Vineland-II composite scores, that is, were more adaptively able than the children with ASD correctly classified. The latter indicates a negative relationship between adaptive skills and autistic symptomatology as measured by the ADI-R. Contrary to Kim and Lord (2012a), children with NS misclassified as ASD were younger than the correctly classified children with NS. This suggests that even the new research algorithms of the ADI-R might be sensible to age effects, leading to some overinclusivity in young children. Thus, in sum, in our clinical sample of toddlers and young children with serious developmental concerns, the diagnostic validity of ADI-R for ASD showed limited accuracy and did not classify children as efficiently as in the US samples.
With regard to the evaluation of the ADOS in the current study, our sample and analyses differed from previous research. Particularly, no earlier study has presented data separately for each ADOS module using the new algorithms in a sample restricted to children aged 47 months and younger classified in an ASD versus a NS group. In most other studies, older children often up to the age 12 years were included, the diagnostic classification were conducted for autism versus NS groups and ASD (except autism) versus NS groups, respectively, or when the same age range were studied, data were presented for the different modules lumped together (i.e. Kim and Lord, 2012a). Besides, our sample was a purely clinical one, whereas most other studies reported data from research samples. Looking at NVIQ, it also seems that the participants in this study were more high functioning than those in the two Gotham et al. (2007, 2008) studies, the Gray et al. (2008), and the De Bildt et al. (2009) studies but lower functioning than in Oosterling et al. (2010), but all these studies reported data for children with autism and ASD apart. The association between participant characteristics and domain totals resembled that of Gotham et al. (2007), but the correlations were higher than in De Bildt et al. (2009). Therefore, the comparison with other studies where children were older and results were presented for children with autism and other ASDs separately is somewhat compromised. This notwithstanding, and taking the variation of the reported domain totals into account, the conclusion is that our ASD sample did not deviate from the reviewed studies in any substantial way, and especially not when considering the diagnostic validity of the ADOS.
For the diagnostic validity of the ADOS autism spectrum cutoff in our sample, we found excellent sensitivities alongside lower specificities being consistent with prior studies by Kim and Lord (2012a), Molloy et al. (2011) (except for module 2), and Risi et al. (2006). On the contrary, Oosterling et al. (2010) found lower sensitivities and higher specificities for module 1 (some words) and module 2. The autism cutoff showed more balanced sensitivities and specificities (≥81%), being considerably lower than those found by Gotham et al. (2007, 2008). However, these values were obtained when the autism cutoff was used in children with core autism (not ASDs) versus NS. When considering the 95% confidence intervals (when such are reported), many differences were attenuated or disappeared between this and previous studies. AUCs in our sample were excellent; kappas for autism spectrum and autism cutoffs versus clinical diagnosis were in the moderate to substantial range, while LR+s were modest.
In our cohort, the ADOS (κ = .57–.75) was superior to the ADI-R (κ = .27–.45) in terms of diagnostic accuracy, but the best diagnostic validity was generated by the combination of the ADI-R, and the ADOS, using the adjusted ADI-R cutoff and the ADOS autism spectrum cutoff. The latter generated sensitivities and specificities that were basically in the same range as in Kim and Lord (2012a) and Risi et al. (2006: using the standard algorithms with children below 36 months of age), except for a somewhat lower sensitivity of the SW group in our study. The agreement of the ADI-R and the ADOS was limited in terms of correlation between domain totals. However, aside from the NV group (Z = 2.90, p = .004), they did not differ significantly from those found by Kim and Lord (2012a) (Z = 1.39, p = .16 for SW and Z = 1.61, p = .11 for PH). Older studies using the standard algorithms of both instruments, Risi et al. (2006) and Le Couteur et al. (2008) reported substantial correlations between domain totals across instruments. Ventola et al. (2006), on the contrary, found a weak agreement measured by kappa statistics between the ADI-R and the ADOS classifications, probably due to a reduced capacity of the ADI-R alone to classify children correctly, like in our study. However, such findings should not be misinterpreted in terms of an underestimation of the added contribution of both instruments to the diagnostic decision making, as have been carefully pointed out by others (Kim and Lord, 2012a; Risi et al., 2006).
The considerably lower levels of the ADI-R totals in the current study compared to the previous three samples in Kim and colleagues (Kim et al., 2013; Kim and Lord, 2012b) need to be addressed. They were not associated with any differences in child characteristics, and scores on the ADOS in our sample did not reveal lower levels compared to the US samples. Thus, we speculate that the differences are linked to the parent/caregiver characteristics, to the fact that we examined a purely clinical sample, as well as cultural factors. According to our experience, parents of toddlers and young preschoolers who are referred to clinical services for the first time due to a suspicion of a neurodevelopmental disorder are initially often either reluctant to describe their child’s behavior in terms of abnormality or generally unaware that their child’s behavior might be developmentally altered. This is endorsed by governmental educational policies in Sweden being predominantly socially inclusive, and exclusion from regular schools is in the majority viewed as a societal failure (almost all children in our study attended regular preschools). Moreover, societal awareness or sensitivity to ASD symptoms and traits might still be lower in many countries compared to the United States. For instance, other cross-cultural studies on parent-report-based tools of autistic behaviors have found lower scores in the general population and ASD cases (see, for example, Bölte, 2012; Bölte et al., 2008). All of these latter aspects might account for the lower ADI-R scores in this study. Indeed, a multicountry European replication study of the new research algorithms of the ADI-R also found lower levels of domain totals than reported in the US studies (De Bildt et al., in press). Finally, a purely statistical note of caution concerning the scoring differences between the current and prior ADI-R studies. Except for the RPI domain of the PH developmental cell, there was an overlap in confidence interval with at least one of the samples for the SA/SC domain in all developmental cells but not for the RRB domain. Hence, at least some of the differences discussed above might be in the area of chance rather than truly significant.
This study has several limitations. First, currently, the ADI-R and the ADOS are considered the most efficient tools (“gold standard”) for standardized collection of information to assist clinical consensus diagnosis in clinical settings. Thus, in this study, using data from a specialized clinical unit, and comparable to most other diagnostic validation studies (Gotham et al., 2007, 2009; Kim et al., 2013; Kim and Lord, 2012a, 2012b; Le Couteur et al., 2008; Risi et al., 2006), the clinical consensus diagnosis was not fully independent of the results of the ADI-R and the ADOS. Nevertheless, the diagnostic procedure leading to the consensus diagnosis comprised comprehensive information from a variety of sources, enriching, and balancing results of the ADI-R and the ADOS, especially naturalistic observation of the child in its preschool. Second, in this study, 30 children of the NS group had no DSM-IV-TR diagnosis. It is well known that including typically developing children is likely to overestimate diagnostic validity (see, for example, Kim and Lord, 2012b: where specificities increased when introducing typically developing children in the comparison). However, the children without diagnosis in this sample were not typically developing in a narrow sense. They were clinically referred and assessed children owing to developmental concerns. Most of them exhibited different kinds of developmental problems even though these were not sufficiently severe to justify a diagnosis. As a matter of fact, the specificities did actually decrease slightly when excluding the NS children without diagnosis from the analysis.
Implications for clinical practice and research
Our results indicate that a combination of the ADI-R and the ADOS should be viewed the first choice to assist the diagnostic assessment of ASD in toddlers and young preschoolers in clinical practice. Earlier research shows that the same is equally true in research contexts (Kim and Lord, 2012a; Risi et al., 2006). Aside from the scientific advantages of combined ADI-R/ADOS use demonstrated by our and other studies, it can be argued that applying these two ways of collecting information have additional pedagogical effects for the parents who might serve the child’s future well-being. That is, being exposed to a comprehensive set of concrete questions about the child’s development and behavior plus watching it interacting with an expert in standardized setting, followed by feedback on the child’s strengths and weaknesses based on these assessments is often experienced as a key and first step to better understand the expression of ASD in their child. If for whatever reason a combined ADI-R/ADOS use is not possible, using the ADOS only (autism cutoff) can be viewed as the second choice to assist the diagnostic decision making, with seemingly equivalent psychometric properties, at least in clinical settings.
Kim and Lord (2012b) presented different cutoffs for the novel research algorithms of the ADI-R with different psychometric properties and discussed a rational for the usage of them. We completely agree with their reasoning: in rare clinical and virtually all research settings where FP need to be avoided, the higher research cutoff is appropriate, as it maximizes specificity. In most clinical and rare research settings where access to support and services is prioritized, the lower clinical cutoff or even our adjusted, even lower, cutoff could be applied. Theoretically, the same applies for the ADOS in young children, where the higher autism cutoff might be utilized as a counterpart to the research cutoff of the ADI-R, and the autism spectrum cutoff as the counterpart to the ADI-R’s clinical cutoff.
Footnotes
Acknowledgements
We sincerely thank Per-Olof Björck, head of the southeastern part of the Division of Child and Adolescent Psychiatry in Stockholm, for facilitating the realization of this study.
Funding
Sven Bölte was supported by the Swedish Research Council. This study was supported by the Swedish Research Council and the Swedish Research Council in partnership with FAS, FORMAS and VINNOVA (cross-disciplinary research program concerning children’s and youth’s mental health), Riksbankens Jubileumsfond and Jerringfonden.
