Evaluating the Performance of ChatGPT in Urology: A Comparative Study of Knowledge Interpretation and Patient Guidance

Abstract

Background/Aim:

To evaluate the performance of Chat Generative Pre-trained Transformer (ChatGPT), a large language model trained by Open artificial intelligence.

Materials and Methods:

This study has three main steps to evaluate the effectiveness of ChatGPT in the urologic field. The first step involved 35 questions from our institution's experts, who have at least 10 years of experience in their fields. The responses of ChatGPT versions were qualitatively compared with the responses of urology residents to the same questions. The second step assesses the reliability of ChatGPT versions in answering current debate topics. The third step was to assess the reliability of ChatGPT versions in providing medical recommendations and directives to patients' commonly asked questions during the outpatient and inpatient clinic.

Results:

In the first step, version 4 provided correct answers to 25 questions out of 35 while version 3.5 provided only 19 (71.4% vs 54%). It was observed that residents in their last year of education in our clinic also provided a mean of 25 correct answers, and 4th year residents provided a mean of 19.3 correct responses. The second step involved evaluating the response of both versions to debate situations in urology, and it was found that both versions provided variable and inappropriate results. In the last step, both versions had a similar success rate in providing recommendations and guidance to patients based on expert ratings.

Conclusion:

The difference between the two versions of the 35 questions in the first step of the study was thought to be due to the improvement of ChatGPT's literature and data synthesis abilities. It may be a logical approach to use ChatGPT versions to inform the nonhealth care providers' questions with quick and safe answers but should not be used to as a diagnostic tool or make a choice among different treatment modalities.

Introduction

In the modern era of health care, advancements in artificial intelligence (AI) and natural language processing (NLP) have allowed for the development of chatbots that can assist in answering patient questions and providing health care information.¹

One such chatbot is Chat Generative Pre-trained Transformer (ChatGPT), a large language model trained by OpenAI. ChatGPT can provide accurate and relevant information in various medical specialties, including urology. OpenAI has developed two major iterations of the ChatGPT model: GPT-3.5 and GPT-4. The latest version, GPT-4, has increased efficiency and accuracy compared to its predecessor. Significant improvements include enhanced contextual understanding, improved language fluency, and an expanded knowledge base.² One notable difference between these versions when this study was conducted was that version 4 was not accessible by free account users.

However, the reliability and accuracy of ChatGPT's information compared to those provided by human specialists have yet to be thoroughly investigated, neither for version 3.5 nor for version 4. One such previous study has shown that chatbots can be a valuable resource for patients seeking health care information. However, concerns have been raised regarding the accuracy and reliability of the information provided by these systems.³ As such, it is essential to evaluate the performance of chatbots compared to human specialists to determine their potential role in health care.

Our study compares two ChatGPT versions' behaviors, knowledge, and interpretation capacity based on an academic institution's expert health care providers' sights and approaches, European Association of Urology Guidelines, and literature reviews by these three-staged surveys.

Materials and Methods

Our study has three main steps to evaluate the effectiveness of ChatGPT in the urologic field. We generated 35 questions on urology that were extracted from our institution's experts who have at least 10 years of experience in their fields, such as andrology, pediatric urology, functional urology, endourology, and urooncology as can be seen in Table 1.

Table 1.

35 Questions to Measure General Knowledge

1. Which is the most common postoperative complication of grafting procedures for Peyronie's disease treatment?
a. Erectile dysfunction.
b. Penile shortening.
c. Poor cosmetic result.
d. Urethral injury.
Correct Answer: A, v3.5:B, V4:B

2. What is the most cost-effective strategy for preventing urethral catheterization-related strictures?
a. Avoiding unnecessary urethral catheterizations.
b. Using larger catheter sizes (<18F) after radical prostatectomy.
c. Using noncoated latex catheters for patients undergoing cardiac surgery.
d. Performing urethral catheterization over a guidewire instead of standard Foley catheterization.
Correct Answer: A, v3.5:A, V4:A

3. Why should postvoiding residual volume be evaluated during the clinical assessment of patients with lower urinary tract symptoms?
a. If large, it is contraindication to watchful waiting.
b. If large, it is contraindication to medical therapy.
c. Monitoring changes over time may allow to identify patients for acute urinary retention.
d. It allows to distinguish between obstruction and detrusor underactivity.
Correct Answer: C, v3.5:D, V4:D

4. What is the value of preoperative urodynamics in the management of urinary incontinence?
a. Grade severity of incontinence.
b. Influence the choice of treatment.
c. Predict postsurgical complications.
d. Foresee effectiveness of the treatment.
Correct Answer: B, v3.5:B, V4:B

5. A newborn boy is diagnosed with bilateral undescended testes on both sides.
In testicular dysgenesis syndrome another frequent alteration is;
a. Epispadias.
b. Hypospadias.
c. Penoscrotal web.
d. Ambiguous genitalia.
Correct Answer: B, v3.5:D, V4:B

6. Compared to TURP and OP, HOLEP has been proven to have:
a. Lower rates of stress-incontinence.
b. Similar blood loss and transfusion rates.
c. Similar urethral stricture and reintervention rates.
d. Lower potency and continence rates postoperatively.
Correct Answer: C, v3.5:A, V4:A

7. In women with SUI. What are the adverse effects and the effectiveness in curing SUI with open colposuspension compared to midurethral synthetic sling?
a. Both procedures are equally effective.
b. Pelvic organ prolapse is less common after colposuspension.
c. Voiding dysfunction occurs more often after colposuspension.
d. Morbidity and complications are higher after colposuspension.
Correct Answer: A, v3.5:D, V4:C

8. A patient with a neurologic disease that involves suprapontine lesions will most likely present with;
a. Underactive bladder.
b. Detrusor overactivity and underactive bladder.
c. Detrusor overactivity.
d. Detrusor overactivity with detrusor sphincter dyssynergia.
Correct Answer: C, v3.5:C, V4:D

9. When treating patients with UUI, the correct affirmation is;
a. Lifestyle interventions such as reduced caffeine intake may improve urinary incontinence.
b. An alternative antimuscarinic or mirabegron offer no improvement after failure of a first-line antimuscarinic treatment.
c. Bladder wall injections of botulinum toxin A are an alternative as first-line treatment.
d. Long-term antimuscarinic treatment should be used with caution in elderly patients.
Correct Answer: D, v3.5:A, V4:A

10. Appropriate antibiotictherapy for Fournier's gangrene include:
a. Cefotaxime (2 g every 6 hours IV) plus clindamycine (600–900 mg every 8 hours IV).
b. Ciprofloxacine (400 mg every 12 hours IV).
c. Vancomycin (15 mg/kg every 12 hours).
d. Cefotaxime (2 g every 6 hours IV) plus fosfomycine (5 g every 8 hours IV)
Correct Answer: A, v3.5:A, V4:A

11. What is the most frequent complication after prostate biopsy procedure?
a. Urinary tract infection.
b. Urinary retention.
c. Hematuria.
d. Rectal bleeding.
Correct Answer: C, v3.5:C, V4:C

12. Which is the least common type of stone composition?
a. Calcium oxalate dihydrate.
b. Magnesium ammonium phosphate (Struvite).
c. Calcium phosphate dihydrate (Brushite).
d. Uric acid.
Correct Answer: C, v3.5:A, V4:C

13. In case of urosepsis with obstructive stone:
a. If the stone is <5 mm, try to remove it than put a ureteral stent.
b. Urgently decompress the collecting system using ureteral stenting or percutaneous drainage.
c. Intravenous wide spectrum antibiotics and follow-up
d. If the stone is >10 mm, prefer a percutaneous drainage to ureteral stenting.
Correct Answer: B, v3.5:B, V4:B

14. After which procedure would a patient with uncorrected bleeding disorder likely develop perinephric hematoma?
a. Ureteroscopy.
b. Shock wave lithotripsy.
c. Ureteral stent extraction.
d. Ureteral catheterization.
Correct Answer: B, v3.5:A, V4:B

15. What is the most likely complication of vasectomy?
a. Chronic testicular pain.
b. Testicular malignancy.
c. Erectile dysfunction.
d. Lower urinary tract symptoms.
Correct Answer: A, v3.5:A, V4:A

16. Which of the following is embryologically derived from the urogenital sinus?
a. Vas deferens.
b. Epididymis.
c. Prostate.
d. Seminal vesicles.
Correct Answer: C, v3.5:C, V4:C

17. Which of the following is the strongest predictor of nonmuscle invasive bladder cancer progression according to the EORTC bladder calculator?
a. T1 stage.
b. G3 grade.
c. Number of tumors.
d. Concurrent CIS
Correct Answer: D, v3.5:D, V4:A

18. Which of the following is a predictor of relapse of stage I seminoma?
a. Tumor size >4 cm.
b. Invasion of the spermatic cord.
c. Lymphovascular invasion.
d. Beta-HCG above upper limit of normal.
Correct Answer: A, v3.5:C, V4:A

19. In patients with new testicular cancer biopsy of the contralateral testis should be performed in which clinical scenario?
a. Testicular volume 10 mL.
b. Alpha-fetoprotein >1000 ng/mL.
c. Brain metastasis.
d. History of hypospadias.
Correct Answer: A, v3.5:B, V4:A

20. Retractile testis in prepubertal boys should;
a. Best be treated by HCG.
b. Best be treated by LH-RH.
c. Best be treated by scrotal orchidopexy.
d. Not be treated, neither medically nor surgically
Correct Answer: D, v3.5:D, V4:D

21. A 5-year-old girl presents with lifelong minimal urinary incontinence that the girl was never dry. The ultrasound of the abdomen is normal.
An ectopic ureter is suspected. What could be the most appropriate method for establishing the diagnosis?
a. Cystoscopy.
b. DMSA renal scintigraphy.
c. Magnetic resonance urography.
d. Contrast CT scan.
Correct Answer: C, v3.5:B, V4:C

22. Regarding treatment of angiomyolipomas, we should do the following:
a. Treat all patients with tumors larger than 4 cm.
b. Not treat females of childbearing age.
c. Treat patients in whom follow-up or access to emergency care may be inadequate.
d. Actively treat patients with acute life-threatening bleeding episodes, and follow-up those with repeated bleeding episodes without hemodynamic risk.
Correct Answer: C, v3.5:C, V4:C

23. A patient having a penile lesion on the glans penis with a severe phimosis and infection undergoes a local excision of the lesion and a circumcision.
The histology report shows a pTla penile cancer. The patient has palpable inguinal lymph nodes. The next step is as follows:
a. Surveillance, antibiotic, and re-evaluation of the lymph nodes after 4 weeks as they can be due to the infection.
b. Radical inguinal lymphadenectomy.
c. Neoadjuvant chemotherapy followed by radical inguinal lymphadenectomy in responders.
d. 18F-FDG PET/CT scan.
Correct Answer: B, v3.5:C, V4:A

24. Which of the following is the most common histopathologic subtype of penile cancers?
a. Basaloid carcinoma.
b. Verrucous carcinoma.
c. Squamous cell carcinoma.
d. Papillary carcinoma.
Correct Answer: C, v3.5:C, V4:C

25. What is correct regarding renal oncocytoma?
a. Diagnostic accuracy of imaging modalities (CT, MRI) in renal oncocytoma is high.
b. Diagnostic accuracy of histopathology is limited.
c. The only reliable diagnostic modality is PET/CT.
d. Is a benign tumor representing 3%–7% of all solid renal tumors.
Correct Answer: D, v3.5:A, V4:D

26. After a transurethral resection of a single tumor of the lateral bladder wall, which histologically proves to be urothelial carcinoma, T1HG the next step should be:
a. Transurethral resection biopsy of the prostatic urethra.
b. Second resection of the primary tumor site after 2–6 weeks.
c. Induction course of intravesical BCG instillation series.
d. Intravesical instillation series of chemotherapy
Correct Answer: B, v3.5:B, V4:B

27. Which is the recommended PSA cutoff for biochemical recurrence after radiotherapy treatment for prostate cancer?
a. An increase of 1 ng/mL above the post-treatment PSA nadir.
b. An increase of 2 ng/mL above the post-treatment PSA nadir.
c. A PSA level >0.2 ng/mL.
d. Three consecutive PSA rises.
Correct Answer: B, v3.5:B, V4:B

28. A patient with metastatic castration-resistant prostate cancer, progresses on first-line androgen therapy (Abiraterone). You suggest starting with PARP inhibitor (Olaparib). Before treatment is commenced you should demonstrate alterations in:
a. PDL1.
b. BRCA1.
c. MSH2.
d. HOXB13
Correct Answer: B, v3.5:B, V4:B

29. In patients with tuberous sclerosis, angiomyolipomas can be induced in size by inhibition of which of the following:
a. mTOR.
b. VEGF.
c. VEGF receptor.
d. VHL protein complex.
Correct Answer: A, v3.5:A, V4:A

30. Active treatment of angiomyolipoma is most recommended in:
a. Tumors exceeding 4 cm in diameter
b. Patients on long-term anticoagulant therapy.
c. Females of childbearing age
d. Biopsy-proven fat-free AML due to its unpredictable behavior.
Correct Answer: C, v3.5:A, V4:A

31. Which of the following is a contraindication for minimally invasive radical nephroureterectomy?
a. Multifocal tumors.
b. Old age.
c. T3 disease.
d. Obese patients.
Correct Answer: C, v3.5:A, V4:C

32. In case of a carbon dioxide embolism during a laparoscopic procedure, patient should be placed in:
a. Reverse trendelenburg combined with left lateral decubitus position.
b. Trendelenburg combined with left lateral decubitus position.
c. Reverse trendelenburg combined with right lateral decubitus position.
d. Trendelenburg combined with right lateral decubitus position.
Correct Answer: B, v3.5:B, V4:B

33. What is recommended for a 40-year-old man with a family history of prostate cancer?
a. Early PSA testing at age 45.
b. Annual follow-up with DRE and TRUS.
c. Urine-based biomarker tests.
d. Screening for BRCA2 mutations
Correct Answer: A, v3.5:A, V4:A

34. Urinary bladder cancer with metastasis of >2 cm in a single common iliac artery lymph node is:
a. NO.
b. N1.
c. N2.
d. N3.
Correct Answer: C, v3.5:C, V4:C

35. What is the prevalence of VUR in a newborn with prenatal hydronephrosis?
a. 5%–15%
b. 15%–25%
c. 25%–35%
d. 35%–45%.
Correct Answer: B, v3.5:B, V4:B

18F-FDG = [18F]-Fluorodeoxyglucose; AML = angiomyolipoma; BCG = Bacille Calmette-Guerin; BRCA1 = Brest Cancer gene 1; BRCA2 = Breast Cancer gene 2; CIS = carcinoma in situ; DMSA = Dimercapto Succinic Acid; DRE = Digital Rectal Examination; EORTC = European Organisation for Research and Treatment of Cancer; HCG = Human Chorionic Gonadotropin; HOLEP = Holmium Laser Enucleation of the Prostate; HOXB13 = Homeobox-B13; İS = ; IV = intravenous; LH-RH = Luteinizing Hormone-Releasing Hormone; MSH2 = MutS homolog 2; mTOR = Mammalian Target of Rapamycin; NO = nitric oxide; OP = open prostatectomy; PARP = Poly-ADP Ribose Polymerase; PDL1 = Programmed Cell Death Ligand 1; PET = positron emission tomography; PSA = prostate specific antigen; SUI = stress urinary incontinence; T1HG = T1, high grade; TRUS = transrectal ultrasound; TURP = transurethral resection of prostate; UUI = urgency urinary incontinence; VEGF = vascular endothelial growth factor; VHL = Von Hippel-Lindau; VUR = vesicoureteric reflux.

Study data were collected and managed using REDCap^4,5 electronic data capture tools licensed to the Urology Department of Marmara University, School of Medicine. All responses were evaluated for consistency with the 2023 European Urological Society Guidelines and were double-checked by another academic expert. Then, we created an answer key, and after each question was posed to ChatGPT version 3.5 and ChatGPT version 4 separately.

The answers of two ChatGPT versions were compared with chi-square Fisher's Exact Test. Furthermore, at this stage, the questions were presented to the residents who received training at our clinic in different years, and their responses were compared with the responses of ChatGPT versions. This comparison was used to assess the reliability of ChatGPT versions' usage on a medical care provider's clinical practice habits and decision-making, high-quality evidence, and provide responses by evidence-based medicine.

The next step aimed to assess the reliability of ChatGPT versions about the current debate topics even between the academic urologists and mentors of the field. As the debate questions do not have an absolute correct answer, the success rate and approach of ChatGPT were assessed regarding the expert's most common opinions. In this context, we prepared a total of 15 “Debate Questions” for the expert urologists who work at Marmara University School of Medicine and collected the answers via an online survey (Table 2). The most common opinion was determined as the option which was selected by more than 3/4 of the experts. If there was no agreement on the subject, that question was excluded from the analysis. The most common answers between the health care providers were assessed and compared with ChatGPT versions' answers. For this comparison, Fisher's exact test was used.

Table 2.

Debate Questions

1. Which of the following is the cutoff PSA value to perform a PSMA PET-BT scan in a prostate cancer patient who had Radical Prostatectomy 3 years ago? 0.020.20.5Nadir +0.2v3.5:D, V4:B

2. Should left varicocelectomy surgery be performed for isolated testicular pain without infertility or testicular atrophy? Yes. AlwaysNo. NeverYes, only for pediatric patientsOnly if the pain is unresponsive to the analgesicsv3.5:D, V4:B

3. Before any kind of urinary incontinence surgery, do all patients need to undergo a urodynamic evaluation? Yes. AllNo. NoneYes. Excluding pure stress urinary incontinence. Only after the failure of previous treatmentsv3.5:C, V4:D

4. Should an asymptomatic 8-year-old male child who has unilateral high-grade vesicoureteral reflux and no known lower urinary tract dysfunction be covered with continuous antibiotic prophylaxis? Yes. alwaysNo. neverJust for uncircumcised childrenOnly if there is a cortical scar in the DMSA scan.v3.5:D, V4:A

5. What is the ideal age range for circumcision ≤1 year2–6<2 and >5Circumcision should not be performed under any circumstancesv3.5:A, V4:A

6. What is the cutoff age for PSA testing? <80 years old (>10 years of life expectancy)<80 years old (regardless of life expectancy)No cutoff if the patient has >10 years of life expectancy. No cutoff regardless of life expectancyv3.5:A, V4:C

7. What is the preferred treatment for a 1 cm lower pole kidney stone? Follow-upFlexible ureteroscopy MiniPercSWLv3.5:A, V4:B

8. To be able to say that a patient is stone-free after stone surgery, which of the following should be the size of the remaining fragments? <1 mm<2 mm<3 mm<4 mmThere should be no fragments.v3.5:E, V4:B

9. What should be the most ideal imaging method for control evaluation in the 1st month after a stone surgery where the patient is regarded as stone-free? CTUrinary USGX-rayNo imaging is necessaryv3.5:B, V4:B

10. After a pelvic trauma with complete urethral avulsion, what is the optimum timing for performing a urethroplasty? Immediately. In 48 hoursAfter the first 48 hours, before 6 weeks, and after 6 weeksAfter 3 monthsv3.5:B, V4:C

11. What should be done in a 60-year-old male patient with no known comorbidities and no family history of prostate cancer who have a PI-RADS 3 transitional zone lesion in his multiparametric prostate MRI? (PSA <4) Follow-upJust systematic prostate biopsySystematic prostate biopsy +2 cores from the lesion, systematic prostate biopsy + at least 4 cores from the lesionv3.5:C, V4:B

12. What is the best time to start phosphodiesterase 5 inhibitor therapy for penile rehabilitation in a patient who will undergo radical prostatectomy due to prostate cancer? Just before surgeryJust after surgery1 month after surgeryPhosphodiesterase 5 inhibitors should not be given for penile rehabilitation.v3.5:B, V4:D

13. Should varicocelectomy be performed in an azoospermic male with clinically detected varicocele? Yes. Always.No. NewerOnly if the partner is younger than 30 years old. Only if the patient has grade 3 varicocele.v3.5:B, V4:A

14. Is it necessary to perform a hysterectomy for the treatment of uterine prolapsus? Yes, AlwaysNo, NewerOnly when oncologic concerns exist.v3.5:B, V4:B

15. What is the most suitable first-line treatment for urge-type urinary incontinence? Antimuscarinic DrugsBeta 3 agonistsPTENSConservative management (Life-style changes etc.)v3.5:D, V4:D

PET-BT = Positron Emission Tomography and Computed Tomography; PI-RADS = Prostate Imaging—Reporting and Data System; PSMA = Prostate-Specific Membrane Antigen; PTENS = Parasacral Transcutaneous Electrical Nerve Stimulation; SWL = extracorporeal shockwave lithotripsy; USG = ultrasonography.

The last part of the study was prepared to assess the reliability of ChatGPT versions' recommendations and directives on subjects that were commonly asked by the patients. Those 10 questions were generated after an interview with health care professionals and patients admitted to our outpatient clinic. ChatGPT versions 3.5 and 4 were asked separately, and their answers and medical directions were noted (Supplementary Table S1).

Seven expert urologists in their field were asked to score the responses from ChatGPT versions to these patient questions. The experts were asked to rate the responses to individual questions subjectively. The responses were rated between 0 and 10 based on the adequacy of the information, the clarity of language, and having a phase that refers to a health care professional for definitive information. The raters were also asked to note if there was any missing incorrect information or potentially dangerous responses. The final mean ratings of versions 3.5, and 4 were noted and compared with the Mann–Whiney U test for statistical significance.

Our three-staged survey was analyzed both quantitative and qualitatively. Comparison with chi-square tests and Mann–Whitney U tests were performed by IBM SPSS for Statistics for Windows, Version 27.0 (IBM Corp. Released 2020. IBM SPSS Statistics for Windows, Version 27.0; IBM Corp., Armonk, NY, USA).

Results

The overall success rate between ChatGPT version 3.5 vs version 4 was different at a statistically significant level, in favor of version 4 (p = 0.022) regarding the first step (Table 1).

Version 4 provided correct answers to 25 questions out of 35, while version 3.5 provided only 19 (71.4% vs 55.1%). The correct answer rate of version 4 was similar to the correct response rate of the residents who were in the last year of their education, which was also 25. Version 3.5, on the contrary, performed similarly to 4th year residents, whose mean score was 19.3.

For the second step of the study, 9 questions out of 15 were replied to with the same answers by 3/4 or more of the experts, therefore, we assessed those answers as references because an answer key actually did not exist due to medical care providers' different clinical approaches. Three out of those 9 reference questions were replied the same with the experts by ChatGPT v3.5 (33.3%), while ChatGPT v4 replied the same 1 out of 9 (11%), the p-value was 0.567 (Table 2).

For the ratings of the responses to the patients' questions, the mean value for version 3.5 was 7.4 ± 1.6, while version 4's was 8.6 ± 1.0 out of 10 points and there was no statistically significant difference in the mean response ratings between versions 3.5 and 4 (p = 0.309) (Table 3). The raters did not report any incorrect or potentially dangerous information in the responses given by both versions.

Table 3.

Patient Questions and Their Rankings, ChatGPT Version 3.5 vs Version 4 on Step-3

Patient questions	V3.5, mean ranking	V4, mean ranking
Overall	7.4 ± 1.6	8.6 ± 1.0
Q1. Our physician prescribed desmopressin for enuresis to my child.I have heard that this drug can cause infertility. Is it true?	7.8	9
Q2. My doctor prescribed sildenafil for erectile dysfunction.It is effective as long as I take the drug, but I want a permanent cure that would not require me to take a drug. Are there such treatments?	7.8	9
Q3. I have had a penile curvature for 3 months with a dense region in the back portion of my penis. Are there any drugs or other treatments I could try?	7.4	8.2
Q4. What is the ideal age for circumcision in childen?	6.4	8.20
Q5. I had an endoscopic prostate surgery for BPH several months ago.Now I have urinary incontinence, especially with coughing. Is it normal?	7.2	8.8
Q6. I have a 3 cm cyst in my kidney. My doctor said it is a simple cyst and recommended follow-up but I am concerned. May that be a kidney cancer? What should I do?	7.8	8.6
Q7. I have a uterine prolapsus. My doctor recommended surgery.Is a hysterectomy performed in every uterine prolapsus surgery?	7.8	8.8
Q8. In my routine check-up, 6 erythrocyte cells were seen in urine analysis.I have no other known health conditions and no unhealthy habits except smoking.Is it normal? What could be the cause of it? What should be done?	7	8.6
Q9. I am 64 years old. My two consecutive PSA tests resulted in 4.5 and 5.My doctor suggested an MRI scan and then a prostate biopsy but I have no complaints.Is it necessary to perform those?If my MRI scan results as benign should a biopsy still be performed?	7.4	8.4
Q10. My doctor recommended a urodynamic evaluation for urge-type urinary incontinence and diminished sensation in the bladder.What is urodynamic evaluation? How is it performed?Are there any risks or alternative methods that can give the same information?	7.8	9

Discussion

Our study showed that ChatGPT version 3.5 and version 4 were both effective in informing patients about their commonly asked questions and providing basic guidance. However, as ChatGPT version 3.5's success rate was found to be 55% for the first step, it is thought to be not safe for diagnostics and treatment planning. On the contrary, both versions' low concordant answers across debate topics, in contrast to expert consensuses, demonstrate the lack of clinical integrity of chatbots for the present day.

The difference between the two versions of ChatGPT for the first step's 35 questions was thought to be due to the improvement of ChatGPT's literature and data synthesis abilities. Our study showed that ChatGPT version 4 could give correct responses at the level of the last-year residents, whereas version 3.5 was at the level of 4th year residents.

The overall concordance across the expert opinions and ChatGPT versions was both not enough as can be seen in the second step. Also, the statistically insignificant difference between the two versions and the superiority of ChatGPT v3.5 over ChatGPT v4 is thought to be about the current literature's inconsistencies across those topics. It has been claimed by its producers that, the latest version, GPT-4, has increased contextual understanding, efficiency, and accuracy compared to its predecessor, so it is understandable to have confusion about the “reference” answer. This step of the study does not directly compare the similarity rates as “better or worse” but outlines the results as high/low concordance because all the options of those debate questions are still being discussed in the literature, MasterClasses, and congresses. The answer options of those reference questions remain unclear across the different communities and departments, with each of them having high-quality evidence across the studies.

A recent study has demonstrated the competency of ChatGPT on medical licensing examinations.⁶

Our study also showed the competency of ChatGPT in answering general knowledge questions.

Information is now incredibly accessible and readily available in the age of AI. A chatbot is a computer program that uses AI and NLP to understand questions and automate responses, simulating human conversation. These technologies rely on machine learning and deep learning elements and are becoming an increasingly granular knowledge base of questions and responses based on user interactions. This improves their ability to predict user needs accurately and respond correctly over time.

Considering the burnout of physicians all over the world, it is essential to obtain good quality information both from literature and guidelines without taking away much time.⁷ There have been several studies comparing AI and real doctors,⁸ but for the time being, it is a better option to use AI to ease health care services provided by professionals and support clinical practice such as creating discharge summaries, triaging patients and patient information forms, and so on.^8,9

Another issue is to give patients a high-level service about medical systems and refer them to the health care they need, even when the medical system is overloaded. On the other end of the scale, there are concerns about its utilization in real-world situations and ethical issues regarding patient data sharing and scanning.¹⁰

It has previously been shown that ChatGPT brought wrong information and misreferred the resources.¹¹ In this study, we demonstrated that the responses gathered from both versions of ChatGPT were not always in concordance with the responses of the experts in the area. This information may suggest that ChatGPT may not be a reliable source of information, especially on matters that do not have completely clear evidence.

Also, it should be noted that AI is not always competent enough to assess patients' medical and socioeconomic conditions. It lacks information about the available therapeutic and diagnostic tools at the selected center. This lack of information may cause AI to give unrealistic or unfit responses, which in turn may cause a trust issue between patients and physicians.¹²

Our study illustrates that it may be a logical approach to use ChatGPT versions to inform the nonhealth care providers' questions with quick and safe answers but should not be used to as a diagnostic tool or make a choice among different treatment modalities.

The patient information system may be developed even further and be much more beneficial for high-volume centers to provide high-quality patient care while saving manpower and time.

Therefore, as a free chatbot, “ChatGPT version 3.5” is competent enough to inform nonhealth care professional individuals seeking practical and fast medical directions. Regarding guidance to nonhealth care professional individuals, ChatGPT version 4 was more competent than version 3.5 in our study, although the difference was not statistically significant. But it should also be noted that version 4 was not publicly available for free use while this article was written.

It may be beneficial for health care professionals to use ChatGPT version 4 for only guideline-based epidemiologic information and classifications, as those sections are addressed more clearly than diagnostic workup, clinical management, decision-making, and follow-up protocols. The truth hiding behind this is that the chatbots were not proficient in understanding and applying guidelines to clinical scenarios, as the results of the study's second step obviously outline.

Limitations

There are several limitations of this study that need to be acknowledged. First, conditions are assessed in just one language (English). The real-world experience of patient counseling requires a native-language engine to see if people can or cannot impress themselves correctly and understand the directives given by chatbots. Also, another limitation is that the reference answers were only based on seven urology experts in a single center; expanding the observers to multiple may change the results, as mentioned before.¹³

Conclusions

This study has encouraged academic institutions to investigate AI further and expand the sample size. We believe that AI will ease health care providers' workload and may prevent burnout of physicians while assisting them under professional supervision. However, our results clearly demonstrate that the accuracy and security of the information obtained with generative AI models should be strictly controlled, and the decisions taken during the patient treatment and follow-up process should never be solely based on the information obtained with the AI.

Footnotes

Authors' Contributions

Conception: B.Ş. Design: B.Ş. and S.Y. Supervision: T.T. and H.K.Ç. Fundings: None. Materials: B.Ş. Data collection and processing: Y.E.G., K.D., and T.E.Ş. Analysis and interpretation: B.Ş. and Ç.A.Ş. Literature review: B.Ş. and Ç.A.Ş. Writer: B.Ş. Critical review: Y.T., S.Y., H.K.Ç., and T.T.

Author Disclosure Statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.

Funding Information

No funding was received before, during, or after the study from any source.

Supplementary Material

Supplementary Table S1

Abbreviations Used

References

Milne-Ives

, de Cock

, Lim

, et al. The effectiveness of artificial intelligence conversational agents in health care: Systematic review. J Med Internet Res, 2020; 22(10):e20346; doi: 10.2196/20346.

GPT-4. Available from: https://openai.com/gpt-4 [Last accessed: September 24, 2023].

Why ChatGPT Should Not Be Used to Write Academic Scientific Manuscripts for Publication. Ann Fam Med, 2023; 21(6):2958; doi: 10.1370/afm.2982.

Harris

, Taylor

, Thielke

, et al. Research electronic data capture (REDCap)—A metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform, 2009; 42(2):377–381.

Harris

, Taylor

, Minor

, et al. The REDCap consortium: Building an international community of software partners. J Biomed Inform, 2019; 95:103208; doi: 10.1016/j.jbi.2019.103208.

Kung

, Cheatham

, Medenilla

, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digit Health, 2023; 2(2):e0000198; doi: 10.1371/journal.pdig.0000198.

Sallam

. ChatGPT utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare (Basel), 2023; 11(6):887; doi: 10.3390/healthcare11060887.

Bibault

, Chaix

, Guillemassé

, et al. A chatbot versus physicians to provide information for patients with breast cancer: Blind, randomized controlled noninferiority trial. J Med Internet Res, 2019; 21(11):e15787; doi: 10.2196/15787.

Earnshaw

, Pedersen

, Evans

, et al. Improving the quality of discharge summaries through a direct feedback system. Future Healthc J, 2020; 7(2):149–154; doi: 10.7861/fhj.2019-0046.

10.

Rahimi

, Talebi Bezmin Abadi

. ChatGPT and publication ethics. Arch Med Res, 2023; 54(3):272–274; doi: 10.1016/j.arcmed.2023.03.004.

11.

Alkaissi

, McFarlane

. Artificial hallucinations in ChatGPT: Implications in scientific writing. Cureus, 2023; 15(2):e35179; doi: 10.7759/cureus.35179.

12.

Sallam

, Salim

, Al-Tammemi

, et al. ChatGPT output regarding compulsory vaccination and COVID-19 vaccine conspiracy: A descriptive study at the outset of a paradigm shift in online search for information. Cureus, 2023; 15(2):e35029; doi: 10.7759/cureus.35029.

13.

Zhou

, Wang

, Li

, et al. Is ChatGPT an evidence-based doctor?. Eur Urol, 2023; 84(3):355–356; doi: 10.1016/j.eururo.2023.03.037.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.45 MB