Abstract
Objective
ChatGPT is a popular artificial intelligence (AI) tool used to answer questions on any subject. Given ChatGPT’s popularity, it is prudent to investigate its ability to answer common patient questions in the field of hand therapy to better guide patients as they navigate the resources available to them.
Methods
This is a cross-sectional, rater-based comparison study. Four common hand therapy questions were entered into ChatGPT version 3.5. The first five answer tabs that appeared with a Google search for the same four questions were downloaded. Three certified hand therapists blindly graded ChatGPT and Google’s answers using Likert scales to assess for answer accuracy (0-6), comprehensiveness (0-3), and conciseness (0-3).
Results
ChatGPT was significantly more accurate, with an estimated marginal mean (EMM) of 5.75 (95% CI: 4.96, 6.54) compared to Google’s 3.48 (95% CI: 2.86, 4.10) (p < 0.001). ChatGPT was significantly more complete, with an EMM of 2.50 (95% CI: 2.10, 2.90) compared to Google’s 1.48 (95% CI: 1.19, 1.77) (p < 0.001). ChatGPT was significantly more concise, with an EMM of 3.00 (95% CI: 2.66, 3.34) versus 1.60 (95% CI: 1.29, 1.91) for Google (p < 0.001).
Conclusion
ChatGPT is a concise, comprehensive, and accurate alternative to a Google search for people seeking information on hand therapy. The free version of ChatGPT does not update its sourcing past 2019, and the software is known to occasionally present false information. Frequently updated academic websites should therefore remain the primary online medical resource for patients.
Introduction
Artificial intelligence (AI) programs like Chat Generative Pretrained Transformer (ChatGPT) are digital technologies that allow computers to replicate human intelligence. A user of an AI application like ChatGPT can provide the software with a question to, within seconds, receive a humanlike answer. ChatGPT version 3.5 (ChatGPTv3.5) is free to use, and the initial answer can be modified by engaging in a ‘text dialogue’ during which the user prompts the software to adjust its output. This technology has been instrumental in enhancing the business industry, the field of law, and the education system.1–3 ChatGPT writes code, translates text, and summarizes documents, along with many other functions. The technology is also used for more routine tasks applicable to individual users, like creating recipes or planning trips to new destinations. With so many potential uses, it is no surprise that ChatGPT exploded onto the market to become one of the fastest growing consumer applications in history.3,4 The role of AI software like ChatGPT in the healthcare field is more complicated given safety concerns associated with the potential for dissemination of incorrect information.
Rising healthcare costs and physician shortages combined with commonplace access to internet has lead many people to seek medical information online.5,6 There are a handful of websites crafted by academic institutions that provide evidence based medical advice, but the majority of information online is of questionable quality and has the potential to undermine the patient-provider relationship and plan of treatment or therapy.7–9 ChatGPT is similarly flawed, being known to ‘hallucinate’ and confidently present a user with false information. 10 With so many people using ChatGPT, there are likely a large number of patients asking the software medical questions. Researchers have examined ChatGPT’s role as an avenue for providers to look up information in clinic.11,12 Research has also been conducted to assess ChatGPT’s ability to answer patient questions relative to medical websites, and its ability to pass medical board exams.13–16
No studies have assessed ChatGPT’s proficiency at answering common questions patients have about hand therapy. We sought to determine whether ChatGPT or the first five tabs appearing for a Google search, which account for the majority of web traffic, can better answer common questions about hand therapy.
Methods
This study was a ‘cross-sectional, rater-based comparison study’. We systematically compared responses from Google and ChatGPTv3.5 based on predefined criteria without manipulating variables. 17
Selecting questions and downloading answers
Four questions about hand therapy selected for inclusion in this study from a poll and vote among all certified hand therapists at one institution.
Data collection
Likert scales used to grade answers for accuracy, completeness, and conciseness.
Overall, for each question the therapists graded six answers. Five of these came from Google, and one was from ChatGPTv3.5. The order of answers provided to the therapists for grading for each question always followed the order that the Google websites appeared in after the search, although the therapists were not aware of this. A six-sided dice was used to randomize which spot in the order ChatGPTv3.5 would be presented in (see Table S1).
Statistical analysis
Descriptive statistics, including means, were calculated for accuracy, completeness, and conciseness. A linear mixed model (LMM) approach was used to analyze differences between the ChatGPT and Google while accounting for potential variability among graders and questions. Three separate LMMs were constructed with accuracy, completeness, and conciseness as the dependent variables. Each model included answer source (Google vs ChatGPT), question type (Q1–Q4), and their interaction as fixed effects. A random intercept for grader was included to account for repeated measures.
LMMs produced estimated marginal means (EMMs) and 95% confidence intervals (CIs), and pairwise comparisons were conducted using Holm’s adjustment to control for multiple comparisons. The significance level was set at p < 0.05.
To assess whether graders could correctly distinguish between ChatGPT and Google responses, the proportion of correct guesses was calculated for each source. Cohen’s kappa statistic was used to assess inter-rater agreement beyond chance.
All analyses were conducted using R (version 4.4.2).
Results
Descriptive statistics for accuracy, completeness, and conciseness scores for ChatGPT and Google responses are presented in Figure 1. Across all graded responses, ChatGPT consistently outperformed Google in all three metrics. Mean scores for ChatGPT and Google on Likert scales for accuracy (a), completeness (b), and conciseness (c). Scores from the three graders are averaged for question 1 (Q1), question 2 (Q2), question 3 (Q3), and question 4 (Q4). The comparison between ChatGPT and Google is also represented by averaging scores across Q1-Q4, represented by ‘overall’.
Accuracy
The estimated marginal mean (EMM) for accuracy was 5.75 (95% CI: 4.96, 6.54) for ChatGPT and 3.48 (95% CI: 2.86, 4.10) for Google. A linear mixed model analysis indicated that ChatGPT’s responses were significantly more accurate than Google’s (p < 0.001).
Completeness
ChatGPT was significantly more complete, with an EMM of 2.50 (95% CI: 2.10, 2.90) compared to Google’s 1.48 (95% CI: 1.19, 1.77) (p < 0.001).
Conciseness
ChatGPT was significantly more concise, with an EMM of 3.00 (95% CI: 2.66, 3.34) versus 1.60 (95% CI: 1.29, 1.91) for Google (p < 0.001).
Identifying answer source
Therapists correctly identified the source of the response in 91.7% of cases for ChatGPT and 96.7% of cases for Google. The Cohen’s kappa value was 0.85, indicating a high level of agreement beyond chance.
Discussion
The internet is an easily accessible source of information for healthcare patients with previous reports demonstrating that greater than 50% of patients use the internet to investigate health concerns.18,19 Given the popularity of AI platforms, such as ChatGPT, recent literature regarding the use of AI in the field of orthopedics has garnered increasing interest.13,20 However, to date, no studies have investigated the use of AI in the field of hand therapy.
We report that ChatGPT answers to common hand therapy questions were significantly more accurate, complete, and concise than Google answers when evaluated by certified hand therapists at our institution. These findings are similar to those previously reported in a study of ChatGPT responses to questions about common hand surgeries which demonstrated that ChatGPT provided high quality answers. 13 However, that study reported that ChatGPT-generated answers required a college reading level to comprehend and often failed to reference the source of their material calling into question its reliability. 13 While our paper did not assess the readability of ChatGPT and Google answers, we demonstrated that ChatGPT responses were significantly more concise than Google answers.
Figure 1 shows the average conciseness scores for Google and ChatGPT were 1.60 and 3.00, respectively. An average conciseness score less than 2.00, according to the scale shown in Table 2, suggests that Google answers sometimes did not even answer the question at all. ChatGPT routinely answered questions concisely, achieving a perfect score on the conciseness scale as would be expected for AI software with its programming. This is because ChatGPT is designed to provide direct, well-organized answers only to what is being asked, without unnecessary details. Google provides, from its collection of webpages, the best matches to what is searched. Google responses therefore vary significantly in length and sometimes fail to directly answer the queried question. Regarding accuracy and comprehensiveness, ChatGPT scored consistently higher because Google is unable to perfectly answer the question being asked. Google provides the user with webpages based on what key words are used in the search, but Google itself is not responsible for the content of the webpages. Therefore, questions are often only partially answered leaving the reader with only some of the information they need whereas ChatGPT responses are tailored to the exact question being asked. Moreover, many Google pages are biased in that they aim to sell the user something, whereas ChatGPT is not designed to advertise products, although it may occasionally generate content that resembles promotional language. One potential approach to optimize information gathering is to use ChatGPT to obtain an initial, synthesized overview of a topic, followed by verification through traditional search engines such as Google, which provide access to multiple primary sources and allow users to evaluate information across different websites.
Previous literature has questioned the accuracy of AI generated responses. 21 While the current study demonstrates high quality responses that were deemed both more accurate and complete than answers generated by Google, these results do not negate the possibility for ChatGPT to mislead a patient with inaccurate information. The term ‘hallucination’ has been coined to refer to situations where ChatGPT confidently presents false information. The AI software has even been known to also fabricate citations. 10 Google is not without flaws as well, being known to frequently present websites that contain incorrect medical information. 22 Therefore, just as a provider should be hesitant to instruct a patient to ‘Google it’, unless of course they direct patients to trustworthy academic cites, we believe patients should not be told to ‘ChatGPT it’. ChatGPT will continue to be used by patients to seek information, including answering medical questions. Our results are helpful for context, as it is important for providers to understand the resources frequently accessed by their patients in order to ensure that their patients refer to accurate sources. As mentioned previously, ChatGPT is a great tool for obtaining an overview of a topic that can be followed up with a Google search to access more in-depth information from reputable sites.
Accurate sources include websites constructed by academic institutions that are regularly updated using evidence-based information. ChatGPT version 3.5, which was used in this study, does not have access to real-time data and can only source information up to the most recent update which was 2 years prior to conducting this study. The new version of ChatGPT, version 4.0, is frequently updated but requires a paid subscription for use and is therefore less likely to be used by patients. Previous studies have shown that ChatGPT version 3.5 is worse at answering orthopedic hand questions compared to an academic website. 13 The present study compared ChatGPT with the first five Google tabs to more accurately reflect patient searching practices. An individual can select whichever website they want on Google, but research shows that the first result on Google captures 28.5% of users, and the top five results account for 70% of clicks. 23 Therefore, we sought to compare ChatGPT with the websites that patients often navigate to.
This study has several limitations. First, despite blinding efforts, the hand therapists correctly identified which answers were produced by ChatGPT and which came from Google which introduces a bias. Second, ChatGPT is a novel tool and with further development, like updated sourcing of information for the free version, could become a more reliable resource for patients. This study did not compare readability of information between ChatGPT and Google which also affects the way people interpret information. The conciseness scale was developed specifically for a prior study for our group and lacks validation, thus limiting the interpretation of its results. Google results were presented in the order of which they appeared on the Google search page, introducing possible order effects. As all of our graders were certified hand therapists with experience in the field, our results cannot necessarily be extrapolated to people who are unfamiliar with medical jargon. Users can instruct ChatGPT to provide answers at a lower readability level, but not all users are familiar with this function. When selecting questions, hand therapists likely had recall bias. Moreover, patients at our institution may ask different questions than patients at other locations. Importantly, patients may ask ChatGPT questions that they do not feel comfortable asking their hand therapist. Finally, our study obtained responses from Google and ChatGPT using a cleared browser with location services turned off. It is well known that Google and ChatGPT both use an individual’s search history on the respective platform to modify the output. Google also incorporates a users’ location to provide more relevant answers. It is therefore likely that users would obtain different answers than we tested in this study which could affect the comparison. For example, the Google output for a student in school to become a certified therapist would differ from someone who works as an artist. Moreover, despite clearing history and disabling locations services the Google search return is still influenced by algorithmic variability which could be mitigated in future studies by using multiple devices or locations. Clinical experts were selected to evaluate the accuracy and appropriateness of responses; however, future work including individuals with lived experience may provide additional insight into how patients interpret and apply this information. Furthermore, the clinical experts all came from one institution possibly limiting the extrapolation of the findings. Finally, large language models have also been shown to reproduce societal biases present in training data, including gender and racial biases. Such biases may influence the perceived trustworthiness or usability of AI-generated medical information among individuals from equity-deserving groups and represent an important consideration when evaluating the role of AI in patient education.
Conclusion
ChatGPT version 3.5 is more accurate, comprehensive, and concise compared to a Google search when answering common questions about hand therapy. Nevertheless, version 3.5 in its current form is not qualified to replace websites built by medical societies or academic institutions and therefore should not be recommended by providers to patients as a source of medical information.
Supplemental material
Suppplemental Material - ChatGPT is a concise, comprehensive, and accurate alternative to google for answering common hand therapy questions
Suppplemental Material for ChatGPT is a concise, comprehensive, and accurate alternative to google for answering common hand therapy questions by Jack C Casey, Liangkang Wang, John D. Milner, Joseph Cusano, Mohammad Daher, Karen Carney, Susanna Gregorzek, Michael Platz, Joseph A. Gil in Hand Therapy
Footnotes
Author contributions
JCC and JAG were responsible for the conception and design of the study and data collection; LW and MD were involved in the processing and statistical analysis of data; JCC, JDM., JC, MD, KC, SG, and MP were involved in the drafting of the manuscript and data production. All authors contributed to the interpretation of the data for the work and revising it critically for important intellectual content. All authors have read and agreed to the final version of the manuscript.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
