ChatGPT is a concise,comprehensive,and accurate alternative to google for answering common hand therapy questions

Abstract

Objective

ChatGPT is a popular artificial intelligence (AI) tool used to answer questions on any subject. Given ChatGPT’s popularity, it is prudent to investigate its ability to answer common patient questions in the field of hand therapy to better guide patients as they navigate the resources available to them.

Methods

This is a cross-sectional, rater-based comparison study. Four common hand therapy questions were entered into ChatGPT version 3.5. The first five answer tabs that appeared with a Google search for the same four questions were downloaded. Three certified hand therapists blindly graded ChatGPT and Google’s answers using Likert scales to assess for answer accuracy (0-6), comprehensiveness (0-3), and conciseness (0-3).

Results

ChatGPT was significantly more accurate, with an estimated marginal mean (EMM) of 5.75 (95% CI: 4.96, 6.54) compared to Google’s 3.48 (95% CI: 2.86, 4.10) (p < 0.001). ChatGPT was significantly more complete, with an EMM of 2.50 (95% CI: 2.10, 2.90) compared to Google’s 1.48 (95% CI: 1.19, 1.77) (p < 0.001). ChatGPT was significantly more concise, with an EMM of 3.00 (95% CI: 2.66, 3.34) versus 1.60 (95% CI: 1.29, 1.91) for Google (p < 0.001).

Conclusion

ChatGPT is a concise, comprehensive, and accurate alternative to a Google search for people seeking information on hand therapy. The free version of ChatGPT does not update its sourcing past 2019, and the software is known to occasionally present false information. Frequently updated academic websites should therefore remain the primary online medical resource for patients.

Keywords

orthopaedics hand therapy occupational therapy wrist hand

Introduction

Artificial intelligence (AI) programs like Chat Generative Pretrained Transformer (ChatGPT) are digital technologies that allow computers to replicate human intelligence. A user of an AI application like ChatGPT can provide the software with a question to, within seconds, receive a humanlike answer. ChatGPT version 3.5 (ChatGPTv3.5) is free to use, and the initial answer can be modified by engaging in a ‘text dialogue’ during which the user prompts the software to adjust its output. This technology has been instrumental in enhancing the business industry, the field of law, and the education system.^1–3 ChatGPT writes code, translates text, and summarizes documents, along with many other functions. The technology is also used for more routine tasks applicable to individual users, like creating recipes or planning trips to new destinations. With so many potential uses, it is no surprise that ChatGPT exploded onto the market to become one of the fastest growing consumer applications in history.^3,4 The role of AI software like ChatGPT in the healthcare field is more complicated given safety concerns associated with the potential for dissemination of incorrect information.

Rising healthcare costs and physician shortages combined with commonplace access to internet has lead many people to seek medical information online.^5,6 There are a handful of websites crafted by academic institutions that provide evidence based medical advice, but the majority of information online is of questionable quality and has the potential to undermine the patient-provider relationship and plan of treatment or therapy.^7–9 ChatGPT is similarly flawed, being known to ‘hallucinate’ and confidently present a user with false information.¹⁰ With so many people using ChatGPT, there are likely a large number of patients asking the software medical questions. Researchers have examined ChatGPT’s role as an avenue for providers to look up information in clinic.^11,12 Research has also been conducted to assess ChatGPT’s ability to answer patient questions relative to medical websites, and its ability to pass medical board exams.^13–16

No studies have assessed ChatGPT’s proficiency at answering common questions patients have about hand therapy. We sought to determine whether ChatGPT or the first five tabs appearing for a Google search, which account for the majority of web traffic, can better answer common questions about hand therapy.

Methods

This study was a ‘cross-sectional, rater-based comparison study’. We systematically compared responses from Google and ChatGPTv3.5 based on predefined criteria without manipulating variables.¹⁷

Selecting questions and downloading answers

Ten certified hand therapists at a single health care institution each individually noted the most common questions patients ask them. Four questions were listed most frequently and were selected, after a group discussion and agreement, to represent common things their patients ask them and potentially search on the internet (Table 1). These questions were then answered using Google and ChatGPT version 3.5 (ChatGPTv3.5) (OpenAI, San Francisco, CA, USA) in January of 2024. To remove bias and influence from previous searches, the browsing history and cache were cleared for both Google and ChatGPTv3.5 prior to conducting the search. No additional information beyond the question was given to either Google or ChatGPT during the search. Location services were also turned off. All searches were performed on the same device on the same network. The answer outputs from ChatGPTv3.5 and the content from the first five webpages on Google, excluding advertisements, for each search were copied and anonymized. Anonymization included removal of any content that would indicate the source of the information so that blinded grading of the answers could be performed. Non-text content from Google, including images and links, was not included. All responses were reformatted to a uniform font, size, and spacing prior to grading.

Table 1.

Four questions about hand therapy selected for inclusion in this study from a poll and vote among all certified hand therapists at one institution.

Question#	Question
Q1	When can I resume activities after carpal tunnel surgery?
Q2	When will the swelling go down after carpal tunnel surgery?
Q3	Should I squeeze a ball after carpal tunnel surgery?
Q4	When will my finger stop hurting after a finger fracture?

Data collection

Three certified hand therapists (KC, SG, MP) blindly and individually graded the answers from ChatGPTv3.5 and Google using Likert scales (Table 2). The three participating therapists were Certified Hand Therapists with clinical experience ranging from early-career to more than 30 years in hand therapy. Accuracy was assessed using a six-point Likert scale.¹² Completeness was assessed using a three-point Likert scale.¹² A simple, three-point Likert scale was used to assess the conciseness of answers¹⁷ (Table 2). The hand therapists also indicated whether they thought the answer they were grading was generated by ChatGPTv3.5 or Google. All raters participated in a brief orientation session to explain the study design and randomization method to ensure consistent application of the scales with minimized bias.

Table 2.

Likert scales used to grade answers for accuracy, completeness, and conciseness.

What is being graded	Likert scale
Accuracy	1- Completely incorrect
	2- More incorrect than correct
	3- Approximately equal correct and incorrect
	4- More correct than incorrect
	5- Nearly all correct
	6- Correct
Completeness	1- Incomplete (addresses some aspects of the question, but significant parts are missing or incomplete)
	2- Adequate (addresses all aspects of the question and provides the minimum amount of information required to be considered complete)
	3- Comprehensive (addresses all aspects of the question and provides additional information or context beyond what was expected)
Conciseness	1- Did not answer the question
	2- Answered the question, but not concisely
	3- Answered the question concisely

Overall, for each question the therapists graded six answers. Five of these came from Google, and one was from ChatGPTv3.5. The order of answers provided to the therapists for grading for each question always followed the order that the Google websites appeared in after the search, although the therapists were not aware of this. A six-sided dice was used to randomize which spot in the order ChatGPTv3.5 would be presented in (see Table S1).

Statistical analysis

Descriptive statistics, including means, were calculated for accuracy, completeness, and conciseness. A linear mixed model (LMM) approach was used to analyze differences between the ChatGPT and Google while accounting for potential variability among graders and questions. Three separate LMMs were constructed with accuracy, completeness, and conciseness as the dependent variables. Each model included answer source (Google vs ChatGPT), question type (Q1–Q4), and their interaction as fixed effects. A random intercept for grader was included to account for repeated measures.

LMMs produced estimated marginal means (EMMs) and 95% confidence intervals (CIs), and pairwise comparisons were conducted using Holm’s adjustment to control for multiple comparisons. The significance level was set at p < 0.05.

To assess whether graders could correctly distinguish between ChatGPT and Google responses, the proportion of correct guesses was calculated for each source. Cohen’s kappa statistic was used to assess inter-rater agreement beyond chance.

All analyses were conducted using R (version 4.4.2).

Results

Descriptive statistics for accuracy, completeness, and conciseness scores for ChatGPT and Google responses are presented in Figure 1. Across all graded responses, ChatGPT consistently outperformed Google in all three metrics.

Figure 1.

Mean scores for ChatGPT and Google on Likert scales for accuracy (a), completeness (b), and conciseness (c). Scores from the three graders are averaged for question 1 (Q1), question 2 (Q2), question 3 (Q3), and question 4 (Q4). The comparison between ChatGPT and Google is also represented by averaging scores across Q1-Q4, represented by ‘overall’.

Accuracy

The estimated marginal mean (EMM) for accuracy was 5.75 (95% CI: 4.96, 6.54) for ChatGPT and 3.48 (95% CI: 2.86, 4.10) for Google. A linear mixed model analysis indicated that ChatGPT’s responses were significantly more accurate than Google’s (p < 0.001).

Completeness

ChatGPT was significantly more complete, with an EMM of 2.50 (95% CI: 2.10, 2.90) compared to Google’s 1.48 (95% CI: 1.19, 1.77) (p < 0.001).

Conciseness

ChatGPT was significantly more concise, with an EMM of 3.00 (95% CI: 2.66, 3.34) versus 1.60 (95% CI: 1.29, 1.91) for Google (p < 0.001).

Identifying answer source

Therapists correctly identified the source of the response in 91.7% of cases for ChatGPT and 96.7% of cases for Google. The Cohen’s kappa value was 0.85, indicating a high level of agreement beyond chance.

Discussion

The internet is an easily accessible source of information for healthcare patients with previous reports demonstrating that greater than 50% of patients use the internet to investigate health concerns.^18,19 Given the popularity of AI platforms, such as ChatGPT, recent literature regarding the use of AI in the field of orthopedics has garnered increasing interest.^13,20 However, to date, no studies have investigated the use of AI in the field of hand therapy.

We report that ChatGPT answers to common hand therapy questions were significantly more accurate, complete, and concise than Google answers when evaluated by certified hand therapists at our institution. These findings are similar to those previously reported in a study of ChatGPT responses to questions about common hand surgeries which demonstrated that ChatGPT provided high quality answers.¹³ However, that study reported that ChatGPT-generated answers required a college reading level to comprehend and often failed to reference the source of their material calling into question its reliability.¹³ While our paper did not assess the readability of ChatGPT and Google answers, we demonstrated that ChatGPT responses were significantly more concise than Google answers.

Figure 1 shows the average conciseness scores for Google and ChatGPT were 1.60 and 3.00, respectively. An average conciseness score less than 2.00, according to the scale shown in Table 2, suggests that Google answers sometimes did not even answer the question at all. ChatGPT routinely answered questions concisely, achieving a perfect score on the conciseness scale as would be expected for AI software with its programming. This is because ChatGPT is designed to provide direct, well-organized answers only to what is being asked, without unnecessary details. Google provides, from its collection of webpages, the best matches to what is searched. Google responses therefore vary significantly in length and sometimes fail to directly answer the queried question. Regarding accuracy and comprehensiveness, ChatGPT scored consistently higher because Google is unable to perfectly answer the question being asked. Google provides the user with webpages based on what key words are used in the search, but Google itself is not responsible for the content of the webpages. Therefore, questions are often only partially answered leaving the reader with only some of the information they need whereas ChatGPT responses are tailored to the exact question being asked. Moreover, many Google pages are biased in that they aim to sell the user something, whereas ChatGPT is not designed to advertise products, although it may occasionally generate content that resembles promotional language. One potential approach to optimize information gathering is to use ChatGPT to obtain an initial, synthesized overview of a topic, followed by verification through traditional search engines such as Google, which provide access to multiple primary sources and allow users to evaluate information across different websites.

Previous literature has questioned the accuracy of AI generated responses.²¹ While the current study demonstrates high quality responses that were deemed both more accurate and complete than answers generated by Google, these results do not negate the possibility for ChatGPT to mislead a patient with inaccurate information. The term ‘hallucination’ has been coined to refer to situations where ChatGPT confidently presents false information. The AI software has even been known to also fabricate citations.¹⁰ Google is not without flaws as well, being known to frequently present websites that contain incorrect medical information.²² Therefore, just as a provider should be hesitant to instruct a patient to ‘Google it’, unless of course they direct patients to trustworthy academic cites, we believe patients should not be told to ‘ChatGPT it’. ChatGPT will continue to be used by patients to seek information, including answering medical questions. Our results are helpful for context, as it is important for providers to understand the resources frequently accessed by their patients in order to ensure that their patients refer to accurate sources. As mentioned previously, ChatGPT is a great tool for obtaining an overview of a topic that can be followed up with a Google search to access more in-depth information from reputable sites.

Accurate sources include websites constructed by academic institutions that are regularly updated using evidence-based information. ChatGPT version 3.5, which was used in this study, does not have access to real-time data and can only source information up to the most recent update which was 2 years prior to conducting this study. The new version of ChatGPT, version 4.0, is frequently updated but requires a paid subscription for use and is therefore less likely to be used by patients. Previous studies have shown that ChatGPT version 3.5 is worse at answering orthopedic hand questions compared to an academic website.¹³ The present study compared ChatGPT with the first five Google tabs to more accurately reflect patient searching practices. An individual can select whichever website they want on Google, but research shows that the first result on Google captures 28.5% of users, and the top five results account for 70% of clicks.²³ Therefore, we sought to compare ChatGPT with the websites that patients often navigate to.

This study has several limitations. First, despite blinding efforts, the hand therapists correctly identified which answers were produced by ChatGPT and which came from Google which introduces a bias. Second, ChatGPT is a novel tool and with further development, like updated sourcing of information for the free version, could become a more reliable resource for patients. This study did not compare readability of information between ChatGPT and Google which also affects the way people interpret information. The conciseness scale was developed specifically for a prior study for our group and lacks validation, thus limiting the interpretation of its results. Google results were presented in the order of which they appeared on the Google search page, introducing possible order effects. As all of our graders were certified hand therapists with experience in the field, our results cannot necessarily be extrapolated to people who are unfamiliar with medical jargon. Users can instruct ChatGPT to provide answers at a lower readability level, but not all users are familiar with this function. When selecting questions, hand therapists likely had recall bias. Moreover, patients at our institution may ask different questions than patients at other locations. Importantly, patients may ask ChatGPT questions that they do not feel comfortable asking their hand therapist. Finally, our study obtained responses from Google and ChatGPT using a cleared browser with location services turned off. It is well known that Google and ChatGPT both use an individual’s search history on the respective platform to modify the output. Google also incorporates a users’ location to provide more relevant answers. It is therefore likely that users would obtain different answers than we tested in this study which could affect the comparison. For example, the Google output for a student in school to become a certified therapist would differ from someone who works as an artist. Moreover, despite clearing history and disabling locations services the Google search return is still influenced by algorithmic variability which could be mitigated in future studies by using multiple devices or locations. Clinical experts were selected to evaluate the accuracy and appropriateness of responses; however, future work including individuals with lived experience may provide additional insight into how patients interpret and apply this information. Furthermore, the clinical experts all came from one institution possibly limiting the extrapolation of the findings. Finally, large language models have also been shown to reproduce societal biases present in training data, including gender and racial biases. Such biases may influence the perceived trustworthiness or usability of AI-generated medical information among individuals from equity-deserving groups and represent an important consideration when evaluating the role of AI in patient education.

Conclusion

ChatGPT version 3.5 is more accurate, comprehensive, and concise compared to a Google search when answering common questions about hand therapy. Nevertheless, version 3.5 in its current form is not qualified to replace websites built by medical societies or academic institutions and therefore should not be recommended by providers to patients as a source of medical information.

Supplemental material

Suppplemental Material - ChatGPT is a concise, comprehensive, and accurate alternative to google for answering common hand therapy questions

Suppplemental Material for ChatGPT is a concise, comprehensive, and accurate alternative to google for answering common hand therapy questions by Jack C Casey, Liangkang Wang, John D. Milner, Joseph Cusano, Mohammad Daher, Karen Carney, Susanna Gregorzek, Michael Platz, Joseph A. Gil in Hand Therapy

Footnotes

ORCID iD

Jack C Casey

Author contributions

JCC and JAG were responsible for the conception and design of the study and data collection; LW and MD were involved in the processing and statistical analysis of data; JCC, JDM., JC, MD, KC, SG, and MP were involved in the drafting of the manuscript and data production. All authors contributed to the interpretation of the data for the work and revising it critically for important intellectual content. All authors have read and agreed to the final version of the manuscript.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental material

Supplemental material for this article is available online.

References

Dreyer

. ChatGPT for lawyers: embracing AI to enhance your legal practice, 2023.

Ghorashi

Ismail

Ghosh

, et al. AI-Powered chatbots in medical education: potential applications and implications. Cureus 2023; 15: e43271. https://doi.org/10.7759/cureus.43271

Nazir

Wang

. A comprehensive survey of ChatGPT: advancements, applications, prospects, and challenges. Meta Radiol 2023; 1: 100022. https://doi.org/10.1016/j.metrad.2023.100022

ChatGPT sets record for fastest-growing user base - analyst note. February 2. (Reuters, 2023).

McCarthy

Scott

Courtney

, et al. What did you google? Describing online health information search patterns of ED patients and their relationship with final diagnoses. West J Emerg Med 2017; 18: 928–936. https://doi.org/10.5811/westjem.2017.5.34108

Nolke

Mensing

Kramer

, et al. Sociodemographic and health-(care-)related characteristics of online health information seekers: a cross-sectional German study. BMC Public Health 2015; 15: 31. https://doi.org/10.1186/s12889-015-1423-0

Luo

Qin

Yuan

, et al. The effect of online health information seeking on physician-patient relationships: systematic review. J Med Internet Res 2022; 24: e23354. https://doi.org/10.2196/23354

Ren

Deng

Hong

, et al. Health information in the digital age: an empirical study of the perceived benefits and costs of seeking and using health information from online sources. Health Inf Libr J 2019; 36: 153–167. https://doi.org/10.1111/hir.12250

Tan

Goonawardene

. Internet health information seeking and the patient-physician relationship: a systematic review. J Med Internet Res 2017; 19: e9. https://doi.org/10.2196/jmir.5729

10.

Bhattacharyya

Miller

Bhattacharyya

, et al. High rates of fabricated and inaccurate references in ChatGPT-Generated medical content. Cureus 2023; 15: e39238. https://doi.org/10.7759/cureus.39238

11.

Daher

Koa

Boufadel

, et al.

Breaking barriers: can ChatGPT compete with a shoulder and elbow specialist in diagnosis and management?

JSES Int 2023; 7: 2534–2541. https://doi.org/10.1016/j.jseint.2023.07.018

12.

Goodman

Patrinely

Stone

Jr , et al. Accuracy and reliability of chatbot responses to physician questions. JAMA Netw Open 2023; 6: e2336483. https://doi.org/10.1001/jamanetworkopen.2023.36483

13.

Crook

Park

Hurley

, et al. Evaluation of online artificial intelligence-generated information on common hand procedures. J Hand Surg Am 2023; 48: 1122–1127. https://doi.org/10.1016/j.jhsa.2023.08.003

14.

Kung

Marshall

Gauthier

, et al. 3rd. Evaluating ChatGPT performance on the orthopaedic In-Training examination. JB JS Open Access 2023; 8: e23. https://doi.org/10.2106/JBJS.OA.23.00056

15.

Lum

. Can artificial intelligence pass the American board of orthopaedic surgery examination? Orthopaedic residents versus ChatGPT. Clin Orthop Relat Res 2023; 481: 1623–1630. https://doi.org/10.1097/CORR.0000000000002704

16.

Massey

Montgomery

Zhang

. Comparison of ChatGPT-3.5, ChatGPT-4, and orthopaedic resident performance on orthopaedic assessment examinations. J Am Acad Orthop Surg 2023; 31: 1173–1179. https://doi.org/10.5435/JAAOS-D-23-00396

17.

Casey

Dworkin

Winschel

, et al. ChatGPT: a concise google alternative for people seeking accurate and comprehensive carpal tunnel syndrome information. Hand Surg Rehabil 2024; 43: 101757. https://doi.org/10.1016/j.hansur.2024.101757

18.

Fox

. The social life of health information. Pew Research Center.

19.

Fraval

Ming Chong

Holcdorf

, et al. Internet use by orthopaedic outpatients - current trends and practices. Australas Med J 2012; 5: 633–638. https://doi.org/10.4066/AMJ.2012.1530

20.

Giorgino

Alessandri-Bonetti

Luca

, et al. ChatGPT in orthopedics: a narrative review exploring the potential of artificial intelligence in orthopedic practice. Front Surg 2023; 10: 1284015. https://doi.org/10.3389/fsurg.2023.1284015

21.

McGowan

Gui

Dobbs

, et al. ChatGPT and bard exhibit spontaneous citation fabrication during psychiatry literature search. Psychiatry Res 2023; 326: 115334. https://doi.org/10.1016/j.psychres.2023.115334

22.

Chung

Oden

Joyner

, et al. Safe infant sleep recommendations on the internet: let's google it. J Pediatr 2012; 161: 1080–1084. https://doi.org/10.1016/j.jpeds.2012.06.004

23.

Steffens

Koob

. [Diagnosis and therapy of tendovaginitis of the extensor carpi ulnaris (stenosis of the 6th extensor compartment)]. Z Orthop Ihre Grenzgeb 1994; 132: 437–440. https://doi.org/10.1055/s-2008-1039850

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.50 MB

0.00 MB