Abstract
Background/Aim:
To evaluate the performance of Chat Generative Pre-trained Transformer (ChatGPT), a large language model trained by Open artificial intelligence.
Materials and Methods:
This study has three main steps to evaluate the effectiveness of ChatGPT in the urologic field. The first step involved 35 questions from our institution's experts, who have at least 10 years of experience in their fields. The responses of ChatGPT versions were qualitatively compared with the responses of urology residents to the same questions. The second step assesses the reliability of ChatGPT versions in answering current debate topics. The third step was to assess the reliability of ChatGPT versions in providing medical recommendations and directives to patients' commonly asked questions during the outpatient and inpatient clinic.
Results:
In the first step, version 4 provided correct answers to 25 questions out of 35 while version 3.5 provided only 19 (71.4% vs 54%). It was observed that residents in their last year of education in our clinic also provided a mean of 25 correct answers, and 4th year residents provided a mean of 19.3 correct responses. The second step involved evaluating the response of both versions to debate situations in urology, and it was found that both versions provided variable and inappropriate results. In the last step, both versions had a similar success rate in providing recommendations and guidance to patients based on expert ratings.
Conclusion:
The difference between the two versions of the 35 questions in the first step of the study was thought to be due to the improvement of ChatGPT's literature and data synthesis abilities. It may be a logical approach to use ChatGPT versions to inform the nonhealth care providers' questions with quick and safe answers but should not be used to as a diagnostic tool or make a choice among different treatment modalities.
Introduction
In the modern era of health care, advancements in artificial intelligence (AI) and natural language processing (NLP) have allowed for the development of chatbots that can assist in answering patient questions and providing health care information. 1
One such chatbot is Chat Generative Pre-trained Transformer (ChatGPT), a large language model trained by OpenAI. ChatGPT can provide accurate and relevant information in various medical specialties, including urology. OpenAI has developed two major iterations of the ChatGPT model: GPT-3.5 and GPT-4. The latest version, GPT-4, has increased efficiency and accuracy compared to its predecessor. Significant improvements include enhanced contextual understanding, improved language fluency, and an expanded knowledge base. 2 One notable difference between these versions when this study was conducted was that version 4 was not accessible by free account users.
However, the reliability and accuracy of ChatGPT's information compared to those provided by human specialists have yet to be thoroughly investigated, neither for version 3.5 nor for version 4. One such previous study has shown that chatbots can be a valuable resource for patients seeking health care information. However, concerns have been raised regarding the accuracy and reliability of the information provided by these systems. 3 As such, it is essential to evaluate the performance of chatbots compared to human specialists to determine their potential role in health care.
Our study compares two ChatGPT versions' behaviors, knowledge, and interpretation capacity based on an academic institution's expert health care providers' sights and approaches, European Association of Urology Guidelines, and literature reviews by these three-staged surveys.
Materials and Methods
Our study has three main steps to evaluate the effectiveness of ChatGPT in the urologic field. We generated 35 questions on urology that were extracted from our institution's experts who have at least 10 years of experience in their fields, such as andrology, pediatric urology, functional urology, endourology, and urooncology as can be seen in Table 1.
35 Questions to Measure General Knowledge
18F-FDG = [18F]-Fluorodeoxyglucose; AML = angiomyolipoma; BCG = Bacille Calmette-Guerin; BRCA1 = Brest Cancer gene 1; BRCA2 = Breast Cancer gene 2; CIS = carcinoma in situ; DMSA = Dimercapto Succinic Acid; DRE = Digital Rectal Examination; EORTC = European Organisation for Research and Treatment of Cancer; HCG = Human Chorionic Gonadotropin; HOLEP = Holmium Laser Enucleation of the Prostate; HOXB13 = Homeobox-B13; İS = ; IV = intravenous; LH-RH = Luteinizing Hormone-Releasing Hormone; MSH2 = MutS homolog 2; mTOR = Mammalian Target of Rapamycin; NO = nitric oxide; OP = open prostatectomy; PARP = Poly-ADP Ribose Polymerase; PDL1 = Programmed Cell Death Ligand 1; PET = positron emission tomography; PSA = prostate specific antigen; SUI = stress urinary incontinence; T1HG = T1, high grade; TRUS = transrectal ultrasound; TURP = transurethral resection of prostate; UUI = urgency urinary incontinence; VEGF = vascular endothelial growth factor; VHL = Von Hippel-Lindau; VUR = vesicoureteric reflux.
Study data were collected and managed using REDCap 4,5 electronic data capture tools licensed to the Urology Department of Marmara University, School of Medicine. All responses were evaluated for consistency with the 2023 European Urological Society Guidelines and were double-checked by another academic expert. Then, we created an answer key, and after each question was posed to ChatGPT version 3.5 and ChatGPT version 4 separately.
The answers of two ChatGPT versions were compared with chi-square Fisher's Exact Test. Furthermore, at this stage, the questions were presented to the residents who received training at our clinic in different years, and their responses were compared with the responses of ChatGPT versions. This comparison was used to assess the reliability of ChatGPT versions' usage on a medical care provider's clinical practice habits and decision-making, high-quality evidence, and provide responses by evidence-based medicine.
The next step aimed to assess the reliability of ChatGPT versions about the current debate topics even between the academic urologists and mentors of the field. As the debate questions do not have an absolute correct answer, the success rate and approach of ChatGPT were assessed regarding the expert's most common opinions. In this context, we prepared a total of 15 “Debate Questions” for the expert urologists who work at Marmara University School of Medicine and collected the answers via an online survey (Table 2). The most common opinion was determined as the option which was selected by more than 3/4 of the experts. If there was no agreement on the subject, that question was excluded from the analysis. The most common answers between the health care providers were assessed and compared with ChatGPT versions' answers. For this comparison, Fisher's exact test was used.
Debate Questions
PET-BT = Positron Emission Tomography and Computed Tomography; PI-RADS = Prostate Imaging—Reporting and Data System; PSMA = Prostate-Specific Membrane Antigen; PTENS = Parasacral Transcutaneous Electrical Nerve Stimulation; SWL = extracorporeal shockwave lithotripsy; USG = ultrasonography.
The last part of the study was prepared to assess the reliability of ChatGPT versions' recommendations and directives on subjects that were commonly asked by the patients. Those 10 questions were generated after an interview with health care professionals and patients admitted to our outpatient clinic. ChatGPT versions 3.5 and 4 were asked separately, and their answers and medical directions were noted (Supplementary Table S1).
Seven expert urologists in their field were asked to score the responses from ChatGPT versions to these patient questions. The experts were asked to rate the responses to individual questions subjectively. The responses were rated between 0 and 10 based on the adequacy of the information, the clarity of language, and having a phase that refers to a health care professional for definitive information. The raters were also asked to note if there was any missing incorrect information or potentially dangerous responses. The final mean ratings of versions 3.5, and 4 were noted and compared with the Mann–Whiney U test for statistical significance.
Our three-staged survey was analyzed both quantitative and qualitatively. Comparison with chi-square tests and Mann–Whitney U tests were performed by IBM SPSS for Statistics for Windows, Version 27.0 (IBM Corp. Released 2020. IBM SPSS Statistics for Windows, Version 27.0; IBM Corp., Armonk, NY, USA).
Results
The overall success rate between ChatGPT version 3.5 vs version 4 was different at a statistically significant level, in favor of version 4 (p = 0.022) regarding the first step (Table 1).
Version 4 provided correct answers to 25 questions out of 35, while version 3.5 provided only 19 (71.4% vs 55.1%). The correct answer rate of version 4 was similar to the correct response rate of the residents who were in the last year of their education, which was also 25. Version 3.5, on the contrary, performed similarly to 4th year residents, whose mean score was 19.3.
For the second step of the study, 9 questions out of 15 were replied to with the same answers by 3/4 or more of the experts, therefore, we assessed those answers as references because an answer key actually did not exist due to medical care providers' different clinical approaches. Three out of those 9 reference questions were replied the same with the experts by ChatGPT v3.5 (33.3%), while ChatGPT v4 replied the same 1 out of 9 (11%), the p-value was 0.567 (Table 2).
For the ratings of the responses to the patients' questions, the mean value for version 3.5 was 7.4 ± 1.6, while version 4's was 8.6 ± 1.0 out of 10 points and there was no statistically significant difference in the mean response ratings between versions 3.5 and 4 (p = 0.309) (Table 3). The raters did not report any incorrect or potentially dangerous information in the responses given by both versions.
Patient Questions and Their Rankings, ChatGPT Version 3.5 vs Version 4 on Step-3
Discussion
Our study showed that ChatGPT version 3.5 and version 4 were both effective in informing patients about their commonly asked questions and providing basic guidance. However, as ChatGPT version 3.5's success rate was found to be 55% for the first step, it is thought to be not safe for diagnostics and treatment planning. On the contrary, both versions' low concordant answers across debate topics, in contrast to expert consensuses, demonstrate the lack of clinical integrity of chatbots for the present day.
The difference between the two versions of ChatGPT for the first step's 35 questions was thought to be due to the improvement of ChatGPT's literature and data synthesis abilities. Our study showed that ChatGPT version 4 could give correct responses at the level of the last-year residents, whereas version 3.5 was at the level of 4th year residents.
The overall concordance across the expert opinions and ChatGPT versions was both not enough as can be seen in the second step. Also, the statistically insignificant difference between the two versions and the superiority of ChatGPT v3.5 over ChatGPT v4 is thought to be about the current literature's inconsistencies across those topics. It has been claimed by its producers that, the latest version, GPT-4, has increased contextual understanding, efficiency, and accuracy compared to its predecessor, so it is understandable to have confusion about the “reference” answer. This step of the study does not directly compare the similarity rates as “better or worse” but outlines the results as high/low concordance because all the options of those debate questions are still being discussed in the literature, MasterClasses, and congresses. The answer options of those reference questions remain unclear across the different communities and departments, with each of them having high-quality evidence across the studies.
A recent study has demonstrated the competency of ChatGPT on medical licensing examinations. 6
Our study also showed the competency of ChatGPT in answering general knowledge questions.
Information is now incredibly accessible and readily available in the age of AI. A chatbot is a computer program that uses AI and NLP to understand questions and automate responses, simulating human conversation. These technologies rely on machine learning and deep learning elements and are becoming an increasingly granular knowledge base of questions and responses based on user interactions. This improves their ability to predict user needs accurately and respond correctly over time.
Considering the burnout of physicians all over the world, it is essential to obtain good quality information both from literature and guidelines without taking away much time. 7 There have been several studies comparing AI and real doctors, 8 but for the time being, it is a better option to use AI to ease health care services provided by professionals and support clinical practice such as creating discharge summaries, triaging patients and patient information forms, and so on. 8,9
Another issue is to give patients a high-level service about medical systems and refer them to the health care they need, even when the medical system is overloaded. On the other end of the scale, there are concerns about its utilization in real-world situations and ethical issues regarding patient data sharing and scanning. 10
It has previously been shown that ChatGPT brought wrong information and misreferred the resources. 11 In this study, we demonstrated that the responses gathered from both versions of ChatGPT were not always in concordance with the responses of the experts in the area. This information may suggest that ChatGPT may not be a reliable source of information, especially on matters that do not have completely clear evidence.
Also, it should be noted that AI is not always competent enough to assess patients' medical and socioeconomic conditions. It lacks information about the available therapeutic and diagnostic tools at the selected center. This lack of information may cause AI to give unrealistic or unfit responses, which in turn may cause a trust issue between patients and physicians. 12
Our study illustrates that it may be a logical approach to use ChatGPT versions to inform the nonhealth care providers' questions with quick and safe answers but should not be used to as a diagnostic tool or make a choice among different treatment modalities.
The patient information system may be developed even further and be much more beneficial for high-volume centers to provide high-quality patient care while saving manpower and time.
Therefore, as a free chatbot, “ChatGPT version 3.5” is competent enough to inform nonhealth care professional individuals seeking practical and fast medical directions. Regarding guidance to nonhealth care professional individuals, ChatGPT version 4 was more competent than version 3.5 in our study, although the difference was not statistically significant. But it should also be noted that version 4 was not publicly available for free use while this article was written.
It may be beneficial for health care professionals to use ChatGPT version 4 for only guideline-based epidemiologic information and classifications, as those sections are addressed more clearly than diagnostic workup, clinical management, decision-making, and follow-up protocols. The truth hiding behind this is that the chatbots were not proficient in understanding and applying guidelines to clinical scenarios, as the results of the study's second step obviously outline.
Limitations
There are several limitations of this study that need to be acknowledged. First, conditions are assessed in just one language (English). The real-world experience of patient counseling requires a native-language engine to see if people can or cannot impress themselves correctly and understand the directives given by chatbots. Also, another limitation is that the reference answers were only based on seven urology experts in a single center; expanding the observers to multiple may change the results, as mentioned before. 13
Conclusions
This study has encouraged academic institutions to investigate AI further and expand the sample size. We believe that AI will ease health care providers' workload and may prevent burnout of physicians while assisting them under professional supervision. However, our results clearly demonstrate that the accuracy and security of the information obtained with generative AI models should be strictly controlled, and the decisions taken during the patient treatment and follow-up process should never be solely based on the information obtained with the AI.
Footnotes
Authors' Contributions
Conception: B.Ş. Design: B.Ş. and S.Y. Supervision: T.T. and H.K.Ç. Fundings: None. Materials: B.Ş. Data collection and processing: Y.E.G., K.D., and T.E.Ş. Analysis and interpretation: B.Ş. and Ç.A.Ş. Literature review: B.Ş. and Ç.A.Ş. Writer: B.Ş. Critical review: Y.T., S.Y., H.K.Ç., and T.T.
Author Disclosure Statement
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.
Funding Information
No funding was received before, during, or after the study from any source.
Supplementary Material
Supplementary Table S1
Abbreviations Used
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
