Abstract
The Autism Diagnostic Observation Schedule (ADOS) is a semi-structured, standardized assessment designed for use in diagnostic evaluation of individuals with suspected autism spectrum disorder (ASD). The ADOS has been effective in categorizing children who definitely have autism or not, but has lower specificity and sometimes sensitivity for distinguishing children with milder ASDs. Revised ADOS algorithms have been recently developed. The goals of this study were to analyze the predictive validity of different ADOS algorithms for module 3, in particular for high-functioning autism spectrum disorder. The participants were 252 children and adolescents aged between four and 16 years, with a full-scale IQ above 70 (126 with a diagnosis of ASD, 126 with a heterogeneous non-spectrum diagnosis). As a main finding, sensitivity was substantially higher for the newly developed ‘revised algorithm’, both for autism versus non-spectrum, as well as for the broader ASD versus non-spectrum, using the higher cut-off. The strength of the original algorithm lies in its positive predictive power, while the revised algorithm shows weaknesses in specificity for non-autism ASD. As the ADOS is valid and reliable even for higher functioning ASD, the findings of the present study have been used to make recommendations regarding the best use of ADOS algorithms in a high-functioning sample.
Keywords
The Autism Diagnostic Observation Schedule (ADOS; Lord et al., 1999), in combination with the Autism Diagnostic Interview-Revised (ADI-R; Rutter et al., 2003), is seen as the ‘gold standard’ in the diagnostic investigation of autism spectrum disorders (ASD). The ADOS is a semi-structured, standardized assessment of communication, social interaction, play and imagination, designed for use in the diagnostic evaluation of individuals with a suspected possible ASD. The ADOS includes four modules with different activities to observe the behaviour of participants with various developmental and language levels: ranging from those without expressive language to verbally fluent; from profound mental retardation to cognitively high-functioning children and adults. Activities are designed to provide systematic opportunities to elicit behaviours associated with ASD. The observed behaviour is coded on a four-point scale, with code 0 ‘when the behavior shows no evidence of abnormality as specified’ and code 3 ‘when the behavior is markedly abnormal in a way that interferes with the interview, or when the behavior is so limited that judgments about quality are impossible’ (Lord et al., 1999: 6). A selection of these coded behavioural aspects is entered into an algorithm (11 of 28 items in module 3). For an ADOS classification of autism or ASD, the score of the original diagnostic algorithms has to reach separate cut-offs in a communication domain, a social interaction domain and a summation of the two. Repetitive behaviours are coded as part of the clinical observation, but do not contribute to the total score that results in an autism/spectrum/or non-spectrum classification. This decision was based on the narrow timeframe available to observe such behaviours in the context of the administration (Lord et al., 2000). The algorithm of the ADOS was built by a priori operationalization of the Diagnostic and Statistical Manual of Mental Disorders (DSM)-IV/International Classification of Diseases (ICD)-10 criteria, factor analysis and receiver operating characteristic (ROC) curves (Lord et al., 2000).
Sensitivity and specificity constitute statistical measures of the performance of a diagnostic test, with sensitivity being the proportion of true positives that are correctly identified by the test and specificity being the proportion of correctly identified true negatives. Sensitivity, specificity and clinical validity of the ADOS have been examined in several studies (see Table 1). In the original norming sample, the ADOS generally achieved ranges for sensitivity from 0.80 to 1.00 and ranges for specificity from 0.68 to 1.00 (Lord et al., 2000). Lower values were found for the German version (Bölte and Poustka, 2004), especially for specificity (= 0.48).
Current studies with relevance for validity of the Autism Diagnostic Observation Schedule (ADOS).
Module 1: Pre-Verbal/Single Words; Module 2: Phrase Speech; Module 3: Fluent Speech, Child/Adolescent; Module 4: Fluent Speech, Adolescent/Adult. ASD: autism spectrum disorder, M: module, PDD NOS, pervasive developmental disorder – not otherwise specified.
Whereas most of the studies examined modules 1 and 2 of the ADOS, data are scarce for module 3 (for children/adolescents with fluent speech), and particularly for module 4 (for adolescents/adults with fluent speech). As the level of cognitive functioning and the age of the participants are mostly low, adults and high-functioning participants are rare. The main finding of these studies is that the ADOS is a reliable and valid method to identify children with a clear diagnosis of autism or to exclude this condition, but that it has less power to distinguish between children with milder forms of ASD.
Two recent studies (Gotham et al., 2007, 2008) examined the existing ADOS domain structure as well as the algorithm. Correlations between the domain scores and participant characteristics were also examined. In particular for module 2, a significant correlation was found with age as well as IQ. The results gave rise to a revision of the original ADOS algorithm, and a revised algorithm for improved diagnostic validity (referred to as ‘revised algorithm’ in the following) was developed. There are two cut-offs on the revised algorithms, the lower one for broad ASD, the higher one for more narrowly defined autism. Previous studies on the factor structure of ASD in different data sets (ADI-R, ADOS) identified one factor underlying the social and communication domains (for an overview, see Kamp-Becker et al., 2009), labelled ‘social communication’ or ‘social affect’. It was recommended that the original structure of two independent domains (as suggested in the classification systems) should be adapted. The findings of Gotham et al. (2007) confirmed this recommendation and found a main factor ‘Social Affect’, which includes items of both domains and constitutes the revised algorithm. Additionally, another finding of further studies was supported: observations of repetitive behaviours make an independent contribution to diagnostic stability (Lord et al., 2006). Restricted, repetitive behaviour (RRB) items were included in the total of the revised algorithm. Finally, algorithm revisions were intended to increase comparability across modules by creating algorithms with a fixed number of items of similar conceptual content (Gotham et al., 2007, 2008, 2009). The main objective was to improve the predictive validity of the ADOS modules used for children (modules 1 to 3) and to increase the comparability across modules 1 to 3. The revised algorithm increases specificity particularly in the classification of ASD in lower-functioning populations and generally maintains the high predictive validity of the ADOS for autism cases. Moreover, the new ‘Social Affect’ domain increases the construct validity. Furthermore, the inclusion of RRB items in the revised algorithm facilitates the differentiation of pervasive developmental disorders not otherwise specified (PDD-NOS) from non-spectrum cases. Like the original algorithm, the revised algorithm has different diagnostic thresholds; one for autism and one for ASD.
The algorithm was replicated in two independent samples (de Bildt et al., 2009; Gotham et al., 2008). A Dutch study (de Bildt et al., 2009) evaluated the revised algorithm, and the results showed better balanced sensitivity and specificity for modules 2 and 3. However, the study also reported that severely low functioning children are difficult to differentiate based on the results of the ADOS, and the evaluated specificity was lower than in previous studies. Another Dutch study (Oosterling et al., 2010) replicated predictive validity, factor structure and correlations with age and IQ for modules 1 and 2. Improvement of diagnostic validity was most apparent for autism, except in very young or low-functioning children, while results for other ASD were less consistent. Figure 1 illustrates the revised and the original algorithm.

Item affiliation of the original and the revised algorithm (complete names of items from the ‘Autism Diagnostic Observation Schedule Manual’, Lord et al., 1999).
To investigate the effect of age and IQ, correlations between the newly developed domain scores and participant characteristics were examined. In particular for module 2, a significant correlation was found between the social affect domain and age as well as IQ (Gotham et al., 2007, 2008). In contrast, the Dutch studies found no significant correlations with IQ (de Bildt et al., 2009), or only low to moderate correlations (Oosterling et al., 2010). As sensitivity and specificity were generally lower in module 3 than in other modules in all studies, studies with higher-functioning children and verbal adolescents were called for (Gotham et al., 2007: 622). Therefore, the present study was designed to examine sensitivity and specificity of the ADOS algorithms for module 3 in a group of participants with high-functioning ASD. This study enables comparability of the results to the revised algorithm (Gotham et al., 2007, 2008) as well as to the original study (Lord et al., 2000), by looking specifically at high-functioning individuals with ASD as a population of interest. Examining the behaviour of the ADOS using a high-functioning sample should enable us to deepen our understanding of both the behaviour and the properties of the ADOS, and possibly to extend our knowledge about the population.
Methods
Participants
The sample consisted of 252 children and adolescents (17 female) between four and 16 years of age (mean 10.2 ± 2.7). All participants were consecutive referrals with possible ASD to the outpatient clinic specializing in children and adolescents and/or were participants in an ASD research programme at the Philipps-University Marburg, Department of Child and Adolescent Psychiatry. They came from different geographical areas of Germany. Patients were referred by specialists or were self-referred. All participants completed a diagnostic evaluation and parents gave informed consent. The study design was approved by the ethics commission of the Philipps-University Marburg. Inclusion criteria were suspicion of ASD and full-scale IQ of at least 70. Consensus best estimate ICD-10 diagnoses were given by at least two examiners who examined the participants in one to three sessions and studied all assessment results (ADOS, ADI-R, IQ, neuropsychological testing, reports of other institutions, school reports and home videos). In all, 126 probands were diagnosed with ASD (42 autistic disorder, 77 Asperger’s syndrome, 7 atypical autism) and 126 were given a non-ASD (= non-spectrum) diagnosis (63 diagnosed with attention deficit/ hyperactivity disorder (ADHD), 29 emotional disorder, 15 unspecified disorder of psychological development, 10 combined diagnosis of emotional and conduct disorder, 4 learning disability, 2 conduct disorder, 2 receptive language disorder, 1 personality disorder). For an estimation of the level of cognitive functioning, the Wechsler Intelligence Scales for Children in the German adaptation were administered to the participants (Petermann and Petermann, 2007; Tewes et al., 1999), or other IQ tests (K-ABC, CFT20) were used. The sample characteristics are shown in Table 2.
Sample characteristics.
The ASD and non-spectrum groups did not significantly differ in age and IQ (t-test). As in a prior study (Gotham et al., 2008), the group of non-spectrum cases consisted of participants suspected of having ASD who did not meet the criteria for any ASD. Therefore, the sample shows a representative spectrum of patients who are likely to receive ADOS-testing (module 3) in practice.
In contrast to previous studies, the sample of this study is similar to the samples of the revised algorithm in terms of age (de Bildt et al., 2009; Gotham et al., 2007, 2008; total mean age in years = 9.5 ± 2.5), but slightly older than samples in the original study (Lord et al., 2000; mean age in years = 8.83 ± 2.84). The IQ of our sample is higher than in the samples of the revised algorithm (overall mean IQ ~ 90.74 ± ~ 3.07, range ~ 20–159) and the original sample (non-verbal mental age slightly more than one year lower than chronological age).
Measures and procedure
The ADOS was conducted by experienced clinicians (clinical psychologists or psychiatrists) who had completed research training and met standard requirements for research reliability (Lord et al., 1999). In all cases, module 3 (for children/adolescents with fluent speech) was implemented. The diagnostic evaluation began with the ADOS, and the examiner was blind to all other information on the patient. In most cases, IQ was assessed by means of the Wechsler scales (n = 184, 73%), with other tests (K-ABC, CFT20) applied in 27% of cases.
Statistical analysis
The domain totals for the original and revised algorithms were computed by adding the scores of the relevant items of the different algorithms. For explanatory analysis, correlations (Spearman correlation coefficient) were calculated between the algorithm domains and participants’ characteristics (IQ and age) in order to investigate influencing variables which might limit the validity. Sensitivity and specificity for cut-off points of the different algorithms were calculated, as well as predictive and efficiency values. Analyses from the ROC were used to measure changes in sensitivity and specificity. All statistical analyses were conducted with SPSS version 17.0.
Results
Correlations with participant characteristics
For both algorithms, modest correlations were only found in the ASD (but not in the non-spectrum) sample between the factor full-scale IQ and domain scores of the algorithms (original algorithm r = –.290, revised algorithm r = –.320, p < .01). The significant correlations for each item are given in Table 3. In order to control for multiple testing, the level of significance was restricted to p < .002.
Correlation (Spearman correlation coefficient) between Autism Diagnostic Observation Schedule (ADOS) items and IQ.
Depending on the classification of these items to the specific domains, the correlations vary for each algorithm domain: the highest correlation was found with the domain ‘Social affect’ (r = –.294, p = .001). To study this important variable, the sample was divided by IQ: as the overall mean IQ was 101 (median IQ = 102), we pragmatically divided the sample based on an IQ of 100, indexing one group with an IQ up to 99 and one group with an IQ ≥ 100. The characteristics of these subsamples are also shown in Table 2. To investigate the diagnostic validity for these characteristics, sensitivity, specificity and predictive values were also calculated for these groups. In contrast to the results for IQ, no significant correlations were found between age and ADOS domain scores, either for the original algorithm or the revised algorithm.
Sensitivity, specificity and predictive values
First, the diagnostic classification of the different cut-offs was analysed. Table 4 depicts the distribution of participants according to the final ADOS algorithm classification and clinical diagnosis. As demonstrated in Table 4, interestingly, more than half of the participants with non-autism ASD reached the autism cut-offs of the original algorithm (58.3%). Moreover, approximately one in six participants (16.7%) with autism reached only the non-autism ASD cut-offs. For the revised algorithm, the proportion of such ‘misclassified’ participants was lower for autism and considerably higher for non-spectrum cases.
Distributions of participants by Autism Diagnostic Observation Schedule (ADOS) classification and diagnosis.
ASD: autism spectrum disorder.
False positive misclassified diagnoses (ADOS diagnosis = ASD, overall diagnosis = non-spectrum) were at 27% (n = 33) using the original algorithm and 31% (n = 38) using the revised algorithm. The false positive classified non-spectrum cases received the following clinical diagnoses: attention deficit/hyperactivity disorder (ADHD) (19% of all participants with this diagnosis using the original algorithm/22% using the revised algorithm), emotional disorder (35%/38%), combined diagnosis of emotional and conduct disorder (50%/60%) and some cases with other diagnoses. ADOS classifications for each of these non-ASD clinical diagnoses are presented in Table 5. Sample size was too small for analysis but false positives on ADOS appeared to be most frequent in those with emotional and/ or conduct disorder.
Autism Diagnostic Observation Schedule (ADOS) classification and diagnoses of non-spectrum individuals.
ASD: autism spectrum disorders; ADHD: attention deficit/hyperactivity disorder.
An adequate measure of the quality of the algorithm is provided by the area under the curve (AUC) of the ROC (Zhou et al., 2002). We further examined the predictive values of the cut-offs: the positive predictive value (PPV) is the proportion of patients with positive test results who are correctly diagnosed. However, its value depends on the prevalence of the disease (which reflects the proportion of autistic children who have been tested). The negative predictive value (NPV) is the proportion of patients with negative test results who are correctly diagnosed.
AUC, sensitivity and specificity as well as positive and negative predictive values were calculated for both algorithms and both cut-offs. The results are illustrated in Table 6. In the following, the parameters sensitivity, specificity, NPV and PPV are displayed for all diagnostic groups. (Differences between the two algorithms of >5 points are highlighted in text; all others are listed in Table 6).
Characteristics of the different algorithms.
ASD: autism spectrum disorders, Sens: sensitivity, Spec: specificity, PPV: positive predictive power, NPV: negative predictive power.
Autism versus non spectrum: The revised algorithm shows higher sensitivity (0.93) than the original algorithm (0.79). But for specificity the original reaches higher values (0.89 in comparison to 0.81 for the revised one). The original algorithm produces slightly higher PPV (0.70 versus 0.62) for autism cases in comparison to the revised algorithm.
Non-autism ASD versus non-spectrum: Only minor differences between the two algorithms were found.
ASD versus non-spectrum: We conducted an additional analysis for the whole sample (autism and non-autism ASD versus non-spectrum), examining two cut-offs: one for high sensitivity and one for high specificity. For the lower cut-off sensitivity and specificity as well as PPV and NPV of the original and revised algorithms is very similar. The results for sensitivity and specificity for the ‘whole ASD versus non-spectrum’ are equivalent to those for ‘non-autism ASD versus non-spectrum’. But PPV for the whole ASD sample is better (0.78 for the original and 0.75 for the revised algorithm) than for the non-autism ASD (0.70 for the original and 0.67 for the revised algorithm) while NPV is only slightly reduced for the original algorithm (0.93 to 0.95). In comparison to the results for ‘autism versus non-spectrum’, sensitivity is better for the whole ‘ASD versus non-spectrum’ (0.79 to 0.94 for the original algorithm) but specificity is reduced (0.89 to 0.73 for the original and 0.81 to 0.69 for the revised algorithm). This leads to higher values for PPV (0.70 to 0.78 for the original and 0.62 to 0.75 for the revised algorithm) while NPV remains the same.
For the higher cut-off we found a considerably narrowed sensitivity for the original algorithm (0.65) compared to the revised algorithm (0.83). For specificity we found better values for the original (0.89) than the revised (0.81) algorithm. The higher cut-off of the revised algorithm shows better NPV 0.83) than the original one (0.72).
When comparing the results for the whole ‘ASD versus non-spectrum’ to the results for ‘non-autism versus non-spectrum’ as well as to the results for ‘autism versus non-spectrum’ we find a reduced sensitivity for both algorithms. But specificity is at about the same level as in the ‘autism versus non-spectrum’ results and increased in comparison with those for ‘non-autism versus non-spectrum’ (0.89 to 0.73 for the original and 0.81 to 0.69 for the revised algorithm). PPV is again higher for the whole ASD sample than for autism or non-autism ASD while NPV is lower.
IQ subgroups: The revised algorithm shows lower specificity for the subsample with IQ < 100 than the original one, especially for the lower cut-off (0.57). But for the higher cut-off, the revised algorithm shows better sensitivity (0.87) than the original one (0.78). Interestingly for the subsample with IQ ≥ 100, the original algorithm demonstrates reduced sensitivity (0.54), while the revised one has sufficient sensitivity (0.80) even for children with very high IQs.
Altogether, the values are very similar for the different algorithms, with a slight superiority of the revised algorithm, which shows a considerable gain in sensitivity for autism cases but a slightly reduced specificity for non-autism ASD. For the whole ASD sample higher PPVs for both cut-offs were found in comparison to the values for autism cases and non-autism ASD cases. The higher cut-off for the whole ASD sample leads to lower NPV in comparison to the autism sample.
Discussion
The current study reinvestigated the algorithm of the ADOS in module 3 by studying a high-functioning sample. It provides further information regarding the psychometric qualities of the algorithm of the ADOS in another independent and high functioning sample, following suggestions of previous studies (Gotham et al., 2007; Lord et al., 2000). Altogether, the most consistent, and potentially the most meaningful, finding is that the revised algorithm tends to have greater sensitivity, with a slight drop in specificity.
For module 3, we found equal measures for accuracy of the different algorithms. This becomes evident through the identical AUC values (see Table 5). In addition to existing studies, the results of our study demonstrate that both the original and the revised algorithms are effective in diagnostic procedures for high-functioning ASD. The original algorithm is superior with regard to PPV, while the revised algorithm produces better NPV. Whether better PPV or NPV is to be preferred depends on the purpose of the administration of the particular ADOS. For research purposes, the proportion of patients with positive test results who are correctly diagnosed (PPV) may be important. For clinical purposes, on the other hand, the proportion of patients with negative test results who are correctly diagnosed may be more meaningful. Higher sensitivity is important to avoid underdiagnosing cases of possible ASD, especially in milder variants of ASD, while higher specificity should be helpful to avoid overdiagnosing ASD. The reliable and early diagnosis of ASD is important for access to early intervention (Le Couteur et al., 2008). In any case, these pros and cons have to be taken into account by means of additional diagnostic procedures such as the ADI-R. The ADOS should never be used as the only tool in the diagnosis of ASD; full diagnostic protocols should be kept for all participants.
This requirement is also emphasized by the fact that in our study, around 30% of the non-spectrum participants were false positively classified by the results of the ADOS. ASD are serious, pervasive disorders with a worse outcome (Hofvander et al., 2009; Howlin et al., 2005) than many other disorders. Moreover, the PPV is of profound importance from a clinical perspective (Mazefsky and Oswald, 2006). Fortunately, around 80% of the participants receiving a clinical diagnosis of ADHD were correctly classified by the algorithms of the ADOS (i.e. cut-offs were not exceeded). Many studies have demonstrated that children with ADHD frequently exhibit social difficulties in a comparable degree to disorders of the autism spectrum (Mulligan et al., 2009; Santosh and Mijovic, 2004). Despite this degree of social difficulties, most of these children were differentiated from ASD by use of the ADOS. On the other hand, only half of the participants with a combined diagnosis of emotional and conduct disorder were not correctly differentiated as a result of ADOS, but this subsample is definitely too small to draw any representative conclusions. Likewise, for the subsample of participants with emotional disorder, the number of misclassified participants is not sufficient.
The ADOS is not a screening instrument, but is rather considered as part of the diagnosis of ASD. It requires clinical experience and very specific training that demands both time and resources. Therefore, the focus should be on PPV value and specificity more than on sensitivity (which is most important for screening tools) in order to avoid overdiagnosing ASD.
Compared with the findings of Gotham et al. (2008), this study (as well as the study by de Bildt et al., 2009) reports lower values of specificity for the original and the revised algorithms. Fortunately, we found better values for specificity than was the case in another German study (Bölte and Poustka, 2004). We also found better sensitivity, especially for the non-autism ASD group. Altogether, the original algorithm has a clear advantage in terms of specificity in comparison to the revised algorithm. The original algorithm is concentrated on a limited number of items, whereas the revised one includes some additional items, especially of stereotyped behaviour (‘Unusual Sensory Interest’, ‘Mannerisms’, ‘Highly Specific Topics’ and additional ‘Shared Enjoyment’ of the social interaction domain). The item ‘Insight’ is omitted in the revised algorithm. Through this procedure, specificity of the ADOS is reduced for participants with an IQ < 100. One could assume that stereotyped behaviour seen in participants with good cognitive functioning (IQ >100) is a specific indicator for ASD (Hartley and Sikora, 2009) but not for participants with an IQ between 70 and 99. Contrary to the results of Gotham et al. (2007) and de Bildt et al. (2009), we found significant correlations of the revised algorithm with IQ. These findings may be attributable to the number of participants with an IQ above average (IQ >115 = 23.4% of the sample), and the variation in IQ scores.
In actual fact, it is not the aim of the ADOS to differentiate between autism and non-autism ASD, although there is an ongoing debate about the two thresholds: there are two cut-offs, the lower one for ‘non-autism ASD’ and the higher one for ‘autism’. However, several studies claim that there is no qualitative difference between autism and non-autism ASD and have even found evidence for a continuum for any ASD (Howlin, 2003; Kamp-Becker et al., 2010; Macintosh and Dissanayake, 2004; Sanders, 2009; Witwer and Lecavalier, 2008). Additionally, studies have indicated that inter-rater reliability in distinguishing non-autism ASD from autism and in distinguishing ASD from non-spectrum disorders is often poor (Lord et al., 1999; Szatmari et al., 2002). In the newly developed DSM-5, a new category ‘autism spectrum disorder’ will be proposed, including autistic disorder (autism), Asperger’s syndrome, childhood disintegrative disorder and PDD-NOS. The rationale behind this is that ‘differentiation of ASD from typical development and other non-spectrum disorders is done reliably and with validity; while distinctions among disorders have been found to be inconsistent over time, variable across sites and often associated with severity, language level or intelligence rather than features of the disorder’ (American Psychiatric Association: www.dsm5.org/ProposedRevisions). This new conceptualization in DSM-5 may imply that the PPV and NPV should only be examined in the terms of the lower cut-offs.
Similar to our results, even in the original study (Lord et al., 2000), 53% of the PDD-NOS sample reached the higher scores for ‘autism’. The cut-offs were selected for the best sensitivity and specificity for autism or non-autism ASD versus non-spectrum. The argument was that a false positive categorization of a child with a clinical diagnosis of PDD-NOS as having autism was considered less detrimental than a false negative categorization of a child with autism as non-spectrum. Another argument for these different cut-offs was to provide a true indication of how well the measure performs within the most conservative diagnostic groupings (Gotham et al., 2007: 620). In our experience, however, in clinical practice the different cut-offs lead to confusion rather than helpful discrimination. The confusion arises from the diagnostic implications of the cut-offs, when, for example, a participant with the clinical diagnosis of autism reaches the autism spectrum cut-off.
With regard to these empirical and clinical arguments, our proposal is to make two cut-offs: a higher threshold for ‘higher specificity’ and a lower threshold for ‘higher sensitivity’. The reason for this procedure is as follows: the ‘gold standard’ for diagnosis of ASD is the combined use of the ADI-R and ADOS. Combining information from both sources results in higher sensitivity and particularly in higher specificity. A diagnosis by a single instrument only results in a significant loss of specificity (Risi et al., 2006). Therefore, we propose that if the lower cut-off is reached by a patient, it should be thoroughly checked whether the observed symptoms are specific for ASD or can also be found in other developmental or psychiatric disorders. A differentiated further diagnostic process should follow, including ADI-R, home videos, school reports, special neuropsychological testing and so on. A higher cut-off results in more specificity of the diagnosis and leads to more reliability in the diagnosis. Any further diagnostic process should act on this assumption. The results of our analysis (autism and non-autism ASD versus non-spectrum) indicate that through this advancement, the ADOS results in more sensitivity even for milder variants of ASD, combined with equal specificity with respect to the optimal cut-off point for all algorithms. As the ADOS should always be considered as just one piece of a comprehensive diagnostic evaluation, the lower threshold should be used if high sensitivity is the goal, and the higher threshold if high specificity is the goal, depending on the decisiveness of the other diagnostic findings. Further studies, as well as clinical practice, are needed to evaluate the benefit of these cut-offs.
The additional analysis carried out by dividing the sample into different IQ groups provided additional insight into the algorithms. The lower cut-off of the original algorithm is better at detecting ASD with IQ < 100 (efficiency of 0.92), whereas the revised algorithm has a deficit in specificity for this cut-off (efficiency of 0.83). However, the higher cut-off of the original algorithm demonstrates lower sensitivity for participants with IQ ≥ 100 but high specificity (= 0.90), which is not seen in the revised algorithm. Thus, our proposal of two cut-offs is especially useful for the original algorithm. Using this procedure, the sensitivity and specificity could be enhanced even for participants with high IQs.
Various limitations should be taken into account here. As in all previous studies, ADOS was part of the reference standard and thus included in the diagnostic process. Thus, it is possibly circular, confounding the results. Although the ADOS in our study was carried out first and blind to all other data, the results of ADOS were a constitutive part of the diagnostic classification. This might have biased the later diagnostic process.
The sensitivity and specificity of the ADOS may also depend on the rater’ experience in diagnosing ASD and administering the ADOS. A relevant aspect for generating a revised algorithm may be the inter-rater reliability for each item: the algorithm should include not only the items that differentiate well between ASD and non-spectrum but also the items with the best inter-rater reliability. Other limitations pertaining to the use of module 3 only and the external validity of the ADOS need to be considered. Further studies are required to shed light on these aspects.
Conclusions
The compared algorithms retain similar values for all measures of accuracy (AUC, sensitivity, specificity and predictive values). This indicates that – in the hands of experienced examiners – ADOS is of satisfactory predictive power. The revised algorithm shows increased negative predictive power with little loss of PPV, especially on the basis of the terms suggested by the use of two cut-offs. It covers the ASD symptom variability better and represents observed diagnostic features through empirically developed domains. Additionally, it results in comparability between modules concerning algorithm content and number. The original algorithm shows advantages in terms of positive predictive power, whereas the revised one shows disadvantages in terms of specificity for non-autism ASD. Our study demonstrates that the ADOS is a valid and reliable measure for higher-functioning ASD, but for all participants/clients, a full diagnostic protocol should be kept. We evaluated the suggestion of two cut-offs for any ASD and found that this provided the opportunity to increase sensitivity and specificity.
Footnotes
Acknowledgements
We thank Gerti Gerber (data input) and Elisabeth Goy (proofreading) for their help and the families of our patients for their participation in this study.
