Abstract
Autism Spectrum Disorder has seen a drastic increase in prevalence over the past two decades, along with discourse rife with debates and misinformation. This discourse has primarily taken place online, the main source of information for parents seeking information about autism. One potential tool for navigating information is ChatGPT-4, an artificial intelligence question and answer-style communication program. Although ChatGPT shows great promise, no empirical work has evaluated its viability as a tool for providing information about autism to caregivers. The current study evaluated answers provided by ChatGPT, including basic information about autism, myths/misconceptions, and resources. Our results suggested that ChatGPT was largely correct, concise, and clear, but did not provide much actionable advice, which was further limited by inaccurate references and hyperlinks. The authors conclude that ChatGPT-4 is a viable tool for parents seeking accurate information about autism, with opportunities for improvement in actionability and reference accuracy.
Introduction
Autism Spectrum Disorder (hereafter autism) is one of the most common neurodevelopmental disorders, currently diagnosed in one out of every 36 youth. 1 As the rates of autism have rapidly increased, 2 so has web-based information surrounding it. 3 Given that parents/caregivers use the internet as their primary source when seeking information about autism, 4 it is more important than ever to ensure the information available online is easily accessible and accurate.
With the sheer amount of online information, evidence-based information coexists among misinformation and myths. Misinformation has been particularly salient since autism's conception, with myths (e.g., the role of vaccines, “refrigerator mothers”) permeating common knowledge even after being debunked. 5 Separating fact from fiction can be overwhelming, especially when the majority of evidence-based information is presented in scientific writing, which is not always easily interpretable. 5 One potential solution to this conundrum has emerged in the arena of artificial intelligence: ChatGPT.
ChatGPT was launched in November 2022, boasting its approach as a conversational question-and-answer system. 6 ChatGPT was constructed and trained by OpenAI and is based on a revised Generative Pretrained Transformer framework currently on the fourth edition. 6 ChatGPT's natural language proficiency is further bolstered by reinforcement learning from human feedback. 6 It shows great promise: it can pass licensing exams, 7 generate literature reviews, 8 debunk myths and misconceptions about COVID-19 vaccines, 9 and demonstrate proficiency as a source of health information that could, “challenge conspiracy ideas with clear, concise, and nonbiased content.” 9 In a recent systematic review, benefits were highlighted in 51/60 studies, underscoring its utility for health care research, including improving health literacy. 10 These potential strengths make ChatGPT an excellent candidate for sharing evidence-based information about autism in an easily accessible format.
With the excitement about ChatGPT's abilities comes skepticism from academic communities.9,11,12 A systematic review reported concerns regarding its potential ethical issues, limited knowledge, inaccurate citations, and misinformation. 10 Thus, it has the potential to be a valuable resource about autism; and also another contributor to misinformation. The current study sought to characterize ChatGPT's responses to common questions about autism, including questions seeking general information, questions surrounding common myths and misconceptions, and questions seeking resources from a parent perspective.
Method
We conducted an IRB-exempt, pre-registered qualitative search using ChatGPT on April 7, 2023. Thirteen open-ended questions were divided into three sections addressing common questions asked by caregivers: (a) basic information about autism (e.g., “What are the first signs of autism?”; n = 7; Table 2), (b) myths and misconceptions about autism (e.g., “Do vaccines cause autism?”; n = 3; Table 3), and (c) resources (e.g., “How long is the waitlist for autism services?”; n = 3; Table 4). ChatGPT's responses were collected and qualitatively analyzed in a new ChatGPT account with no previous activity. Due to the rapid learning environment, ChatGPT was asked to “regenerate” a second response to each question 9 and thus provided two codable responses for each question: response 1 (R1) and response 2 (R2). ChatGPT was also asked to provide a list of references for each question.
Interrater Reliability Metrics
ICCs >0.90 and Kappa >0.80 = near-perfect reliability.
ICCs 0.75–0.9 and Kappa 0.61–0.81 = substantial reliability.
ICCs 0.5–0.75 and Kappa 0.4–0.6 = moderate reliability.
Kappa 0.21–0.41 = fair reliability.
ICCs <0.5 = poor reliability.
ICCs, intraclass correlations; URL, Uniform Resource Locator.
Basic Information About Autism
ADDM, Autism and Developmental Disabilities Monitoring; ASD, Autism Spectrum Disorder; CDC, Centers for Disease Control and Prevention; DSM-5, Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition; R1, Response 1, R2, Response 2.
Myths/Misconceptions About Autism
MMR, measles, mumps, and rubella.
Searching for Resources
Coding
Accuracy
The criteria for evaluating ChatGPT responses were replicated from previous research evaluating ChatGPT's responses to COVID-19 questions 9 and included the “3Cs”: (a) scientific accuracy of content (Correctness), (b) Clarity of response, and (c) Conciseness (degree to which all the available knowledge is conveyed). Scores for the Correctness, Clarity, and Conciseness are scored on a scale from one to four (4 = Completely correct, clear, or concise, 3 = Almost correct, clear, or concise, 2 = Partially correct, clear, or concise, 1 = Completely incorrect, unclear, or unconcise). Scores on the 3C domains were then averaged to one overall 3C score, replicating previous research. 9
Language
We assessed language use by ChatGPT to (a) mirror a bias measure used in previous research, 9 and (b) reflect recent debates about language preferences in the autistic community. 13 Language was coded according to recent recommendations related to “medical” versus “neurodiversity-affirming” language guidelines for discussing autisma, 13 (1 = use of medical language, 2 = use of a combination of medical and neurodiversity-affirming language, 3 = use of neurodiversity-affirming language).
Understandability and actionability
Responses were assessed for Understandability and Actionability using the Patient Education Materials Assessment Tool for Printable materials (PEMAT-P). The PEMAT-P is a validated instrument used to assess how digestible and actionable printed patient education materials are for a lay audience. 14 ChatGPT responses were coded for understandability (n = 10 items) and actionability (n = 6 items). Each PEMAT-P statement is scored as 0 (disagree) or 1 (agree). Understandability and Actionability are reported in percentages, wherein scores are averaged across construct items. As per PEMAT-P guidelines, 14 n = 8 codes were excluded due to the short nature of ChatGPT's responses.
References
References were evaluated for hyperlink accuracy (0 = incorrect Uniform Resource Locator [URL], 1 = correct URL) and date published. As responses often included multiple references, reference scores were coded as percentages of references that met the above criteria for each question (e.g., 80 percent URLs were correct).
Interrater reliability
ChatGPT responses were generated for 13 questions. Each evaluator independently assessed responses, followed by a comparison of the scores to assess the degree of interrater agreement for the seven scores. Interrater reliability was assessed using intraclass correlationsb for continuous measures (References, Understandability, Actionability) and weighted Cohen's Kappac for categorical measures (3C's and Language 15 ). Rater disagreements were resolved by consensus. The coding manual is available on OSF. Interrater reliability ranged from fair to near perfect (Table 1).
Results
Basic information about autism
Basic information questions and responses are available in Table 2. The average 3C score was 3.67 out of a maximum possible 4.0 (standard deviation [SD] = 0.27) for R1 and 3.76 (SD = 0.42) for R2; there was no significant difference between responses, t(6) = −0.548, p = 0.604. PEMAT-P Understandability was an average of 78 percent for R1 and R2. PEMAT-P Actionability was 0 percent for R1 and R2. Lastly, language was predominantly medical, comprising 85.7 percent (R1) and 71.4 percent (R2) of text. The two responses did not significantly differ in language use, X2 = 0.467, p = 0.495.
Myths and misconceptions about autism
Myth-related questions are available in Table 3. The average 3C score was 3.56 (SD = 0.19) for R1 and 3.33 (SD = 0.88) for R2; there was no significant difference between responses, t(2) = 0.378, p = 0.742. PEMAT-P understandability was an average of 78 percent for R1 and 75 percent for R2; there was no significant difference between responses, t(2) = 0.718, p = 0.547. PEMAT-P actionability was 0 percent for both R1 and R2. Lastly, language was evenly split between neurodiversity-affirming (33 percent), medical (33 percent), and a combination (33 percent), and did not significantly differ between responses, X2 = 6.00, p = 0.199.
Autism resources
Resource questions are available in Table 4. The average 3C score was 3.44 (SD = 0.38) for R1 and 3.89 (SD = 0.19) for R2; there was no significant difference between responses, t(2) = −4.00, p = 0.057. Understandability from the PEMAT-P was an average of 79 percent for R1 and 78 percent for R2; there was no significant difference between responses, t(2) = 1.00, p = 0.423. PEMAT-P actionability was an average of 40 percent for R1 (SD = 0.20) and 47 percent for R2 (SD = 0.12); there was no significant difference between responses, t(2) = −1.00, p = 0.423. Lastly, language was a combination of medical and neurodiversity-affirming, comprising 67 percent of text from R1 and R2. The two responses did not significantly differ in language use, X2 = 0.750, p = 0.386.
Quality of references
Functional hyperlinks were provided for 42 percent of references, with no significant differences between R1 and R2 for accuracy, t(12) = 0.880, p = 0.396 (MR1 = 45.5 percent, SD = 0.32; MR2 = 39.1 percent, SD = 0.30). When dates were available for references that existed (n = 57/123 total references, 46 percent), the range of dates was from 2006 to 2023 (Mode = 2023).
Domain similarities and differences
Scores on the three domains of interest (Basic Information, Myths/Misconceptions, Resources) were submitted to a one-way analysis of variance (ANOVA) to evaluate whether ChatGPT was more accurate, understandable, or actionable in one domain of questions over the other. Intuitively, actionability was significantly higher in the Resources domain compared with Basic Information or Myths/Misconceptions domains, R1: F(2,12) = 23.077, p < 0.001; R2: F(2,12) = 94.231, p < 0.001. No other significant differences emerged (ps > 0.401). Table 5 includes average scores for primary outcomes of interest for ChatGPT's overall responses as well as domain-specific responses.
ChatGPT Scores Overall and Within Question Domain
p < 0.001, **p < 0.01, *p < 0.05.
t, repeated measures t-test result; N/A, not applicable.
Discussion
In the current climate where the internet serves as a primary source of medical information for consumers, 4 the aims of this investigation were to evaluate whether ChatGPT could be an accurate, reliable, and useful tool for parents/caregivers seeking autism information. Using 13 questions across three domains, we scored ChatGPT's responses on correctness, conciseness, clarity, language use, understandability, actionability, and reference accuracy.
Was ChatGPT correct, concise, and clear?
ChatGPT produced accurate, concise, and clear information as indicated by domain-specific and overall scores on the 3C metric. Responses were, on average, completely clear, concise, and accurate. Of the three, ChatGPT had lower Conciseness scores. Some responses provided too much information, whereas others did not convey the information required to answer the question. The only areas of inaccuracy observed were due to lack of updated information (i.e., outdated statistics), which is a limitation noted by ChatGPT's creators. Importantly, the correctness of information was maintained when asked about common myths and misconceptions, which suggests ChatGPT is a viable tool for combating stigma and misinformation commonly found online.
Does ChatGPT produce actionable and understandable information?
ChatGPT's content was evaluated for actionability and understandability, two crucial elements for widespread use by a lay audience. Whereas understandability was high for ChatGPT's responses, lending support to its role as a lay-audience friendly source of information, actionability was limited with 0 percent for “Basic Information” and “Myths/Misconceptions,” and 44 percent for “Resources.” Although next steps may not be needed for understandability, users would benefit from more actionable recommendations.
Does ChatGPT provide accurate references?
A previously noted area of weakness, 16 references were often incorrect. Fewer than half the hyperlinks worked, let alone took the reader to the correct website. Incorrect references frequently generated a 404 error code and occasionally took viewers to unrelated websites. When links were correct, ChatGPT frequently cited relevant medical and government webpages (i.e., Centers for Disease Control and Prevention [CDC], World Health Organization [WHO]), which provided up-to-date references. The likely cause for this discrepancy is the outdated nature of ChatGPT sources (updated through 2021). 17 The inaccuracy of the references significantly hindered the requester's actionability, especially in the Resources domain, which represents a significant limitation and area of weakness.
What language did ChatGPT predominantly use?
Although ChatGPT used a combination of medical and neurodiversity-affirming language throughout, it primarily used “medical” language observed in the Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition. 18 This may reflect the language used in the predominantly medical and psychiatric resources where ChatGPT gets its information. As recent debates about language use have emerged in the autism research community,13,19 this may be an area of growth as ChatGPT adapts to specific user preferences and societal shifts.
Did regeneration improve ChatGPT's responses?
ChatGPT added a “regenerate” response feature to capture the quickly changing learning environment and to enable human feedback to improve responses. 6 As such, we were interested in whether a regenerated response to the same question would improve ChatGPT's scores. Our results suggested no significant differences on any scores between R1 and R2, suggesting no significant improvement at this time.
Conclusion
ChatGPT is a viable tool for parents/caregivers seeking information about autism. It provides responses that are clear, concise, accurate, and understandable to the public, but is limited by inaccurate references and hyperlinks. However, overall, as a tool to acquire information, learn more about their child's potential presentation, and combat myths and misconceptions, ChatGPT is a valuable instrument for parents and caregivers.
Notes
a. Examples of medical language include “person with autism spectrum disorder,” “disorder,” “comorbid,” “risk of autism.” Examples of neurodiversity-affirming language include “autistic person,” “autism,” “co-occurring,” and “elevated likelihood for autism” to name a few. For the full list, see Table 1 in Bottema-Beutel et al., 2021.
b. Intraclass correlations (ICC's) were interpreted according to Koo and Li 20 : <0.50 = poor reliability, 0.5–0.75 = moderate reliability, 0.75–0.9 = good reliability, and >0.9 = excellent reliability.
c. Kappa values were interpreted according to McHugh 21 : 0–0.20 = no to slight agreement, 0.21–0.40 = fair, 0.41–0.60 = moderate agreement, 0.61–0.80 = substantial agreement, 0.80–1.00 = almost perfect agreement.
Footnotes
Authors' Contributions
T.C.M.: conceptualization, methodology, software, validation, formal analysis, investigation, data curation, writing––original draft, writing––review and editing, visualization, and project administration. S.B.: software, validation, formal analysis, investigation, data curation, and writing––review and editing. O.P.: conceptualization, methodology, and writing––review and editing. C.H.: conceptualization, methodology, writing––review and editing, resources, and supervision.
Author Disclosure Statement
No competing financial interests exist.
Funding Information
This research was supported by a training grant from the U.S. Department of Education (H325D180099; O.P.) and a training fellowship from NICHD (T32 HD040127-21; T.C.M.).
