Evaluation of the Current Status of Artificial Intelligence for Endourology Patient Education: A Blind Comparison of ChatGPT and Google Bard Against Traditional Information Resources

Abstract

Introduction:

Artificial intelligence (AI) platforms such as ChatGPT and Bard are increasingly utilized to answer patient health care questions. We present the first study to blindly evaluate AI-generated responses to common endourology patient questions against official patient education materials.

Methods:

Thirty-two questions and answers spanning kidney stones, ureteral stents, benign prostatic hyperplasia (BPH), and upper tract urothelial carcinoma were extracted from official Urology Care Foundation (UCF) patient education documents. The same questions were input into ChatGPT 4.0 and Bard, limiting responses to within ±10% of the word count of the corresponding UCF response to ensure fair comparison. Six endourologists blindly evaluated responses from each platform using Likert scales for accuracy, clarity, comprehensiveness, and patient utility. Reviewers identified which response they believed was not AI generated. Finally, Flesch–Kincaid Reading Grade Level formulas assessed the readability of each platform response. Ratings were compared using analysis of variance (ANOVA) and chi-square tests.

Results:

ChatGPT responses were rated the highest across all categories, including accuracy, comprehensiveness, clarity, and patient utility, while UCF answers were consistently scored the lowest, all p < 0.01. A subanalysis revealed that this trend was consistent across question categories (i.e., kidney stones, BPH, etc.). However, AI-generated responses were more likely to be classified at an advanced reading level, while UCF responses showed improved readability (college or higher reading level: ChatGPT = 100%, Bard = 66%, and UCF = 19%), p < 0.001. When asked to identify which answer was not AI generated, 54.2% of responses indicated ChatGPT, 26.6% indicated Bard, and only 19.3% correctly identified it as the UCF response.

Conclusions:

In a blind evaluation, AI-generated responses from ChatGPT and Bard surpassed the quality of official patient education materials in endourology, suggesting that current AI platforms are already a reliable resource for basic urologic care information. AI-generated responses do, however, tend to require a higher reading level, which may limit their applicability to a broader audience.

Introduction

The recent surge in popularity of artificial intelligence (AI) chatbots, notably OpenAI's ChatGPT and Google's Bard, marks a transformative era in various fields, including medicine. These AI platforms have specifically garnered interest in the urology community for their ability to streamline administrative tasks and support clinical decision-making, among other capabilities.^1
–3 Of particular interest, however, is the potential role for AI chatbots in urologic patient education.

It is important to evaluate the quality of health information provided by these platforms as patients are increasingly turning to the internet for health guidance. This is especially true for sensitive topics, many of which are urological.^2,4 Furthermore, online information available for prevalent urologic conditions such as benign prostatic hyperplasia (BPH) and erectile dysfunction has been found to be of poor quality.^4
–6

Since patients are increasingly using AI chatbots for health information, it is important to critically evaluate their responses as misinformation could lead to inaccurate self-diagnosis or to delay in seeking necessary medical attention.^7

–10 For these reasons, we sought to evaluate the quality of AI-generated responses in endourology, which encompasses some of the most common urologic conditions such as kidney stones and BPH.

Several recent studies have evaluated the appropriateness of ChatGPT's responses to urologic patient questions, while no such studies exist for Bard.^11

–17 These studies have yielded mixed results, with ChatGPT providing accurate information for some queries, but not others. While informative, a major drawback to these studies is that most evaluate ChatGPT without comparison with a validated benchmark. This is a significant limitation as without a benchmark, it is difficult to discern ChatGPT's usefulness given the availability of other, potentially more useful, patient education resources.

Given these limitations, our objective was to blindly compare the quality of AI-generated responses against established questions and answers found in patient education documents issued by the Urology Care Foundation (UCF), the official patient-facing organization of the American Urological Association (AUA).¹⁸

We also sought to provide the first evaluation of Bard's ability to answer urology patient questions. We believe that including Bard is essential as newer platforms are rapidly entering the chatbot landscape and it remains unclear which AI platform will emerge as the most utilized in the future.

Together, we provide the first study to blindly compare AI-generated responses to common endourology patient questions from ChatGPT and Bard against traditional information sources.

Methods

Question identification

The UCF website was queried for endourology-focused patient education resources. All question-and-answer documents related to kidney stones, ureteral stents, BPH, and upper tract urothelial carcinoma (UTUC) were identified. Questions and answers were extracted verbatim, but images were excluded. Duplicate questions were only included once. Questions were classified as related to symptoms, etiology, diagnosis, treatment, prevention, or general/other common concerns.

A total of 32 questions and answers from UCF, spanning topics on kidney stones (n = 12), ureteral stents (n = 7), BPH (n = 7), and UTUC (n = 6), were included in the study (Table 1).

Table 1.

Common Endourology Questions Extracted from Urology Care Foundation Patient Education Materials

Kidney stones

1. What are kidney stones?

2. What are the symptoms of kidney stones?

3. What are the different types of kidney stones?

4. What causes kidney stones?

5. How are kidney stones diagnosed?

6. How are kidney stones treated?

7. What are the different types of surgeries available to remove a kidney stone?

8. What diet tips may prevent kidney stones?

9. What drugs may prevent kidney stones?

10. What is a staghorn kidney stone?

11. Will my children get kidney stones?

12. Can kidney stones damage my kidneys?

Ureteral stents

1. What is a ureteral stent?

2. What can I expect with a ureteral stent?

3. How do I handle ureteral stent symptoms?

4. When should I contact my doctor about my ureteral stent?

5. Will having a ureteral stent change my daily routine?

6. How is a ureteral stent removed?

7. What happens after a ureteral stent is removed?

Benign prostatic hyperplasia

1. What is BPH?

2. Who is at risk for BPH?

3. What are the symptoms of BPH?

4. How is BPH diagnosed?

5. How is BPH treated?

6. How can you prevent recurrence of BPH?

7. How can you prevent BPH?

Upper tract urothelial carcinoma

1. What is UTUC?

2. What causes UTUC and who is at risk?

3. What are the symptoms of UTUC?

4. What are the types of UTUCs?

5. How is UTUC diagnosed?

6. How is UTUC treated?

BPH = benign prostatic hyperplasia; UTUC = upper tract urothelial carcinoma.

Generation of AI responses

The UCF questions were inputted into ChatGPT 4.0 and Bard on October 10, 2023, and responses were recorded. A new session was created for each query to negate the influence of prior responses on novel output. For each question, the word count of the verbatim UCF response was calculated. Both platforms were instructed to limit the generated response to within ±10% of the word count of the corresponding UCF question to ensure a fair comparison between UCF and the AI chatbots.

Response evaluation

Six endourology fellowship-trained urologists (M.G., W.A., J.A.K., A.J.Y., R.K., and K.G.) evaluated responses from each of the three modalities through a survey that blinded the reviewer to the response source (Supplementary Fig. S1). The response order was randomized for each question. All reviewers indicated that they had not previously viewed the relevant UCF documents. Quality was assessed using 5-point Likert scales for accuracy, comprehensiveness, and clarity, similar to prior studies (Fig. 1).²

FIG. 1.

Evaluation instruments for responses to common questions in endourology.

Additionally, evaluators rated patient utility using the Global Quality Scale (GQS), which assigns a score from 1 to 5.^13,19 Treatment-related questions were further evaluated using Section 2 of the DISCERN instrument, which provides a score between 5 and 35 based on a series of questions related to the quality of information on treatment choices.² Reviewers also identified which of the three responses they believed was not AI generated.

To assess readability, validated measures, including the Flesch–Kincaid Reading Ease Score (FKRE) and Flesch–Kincaid Grade Level (FKGL), were calculated for each response.²⁰ The FKRE calculates a reading score from 0 to 100, where higher scores indicate text that is easier to read, while the FKGL assigns a reading level from “5th grade” to “College Graduate.”

Statistical analysis

Quality and readability measures were compared using T-tests or one-way analysis of variance (ANOVA) for continuous variables and chi-square tests for categorical variables. Multivariate logistic regression was performed to assesses whether ChatGPT or Bard was a predictor of high ratings (score ≥4) of each quality outcome when compared with UCF after controlling for the question category and type. Intraclass correlation coefficient (ICC) values were calculated to assess the degree of inter-rater reliability for ratings of accuracy, comprehensiveness, clarity, patient utility, and DISCERN score.

This study was exempt from institutional review board evaluation as it was deemed nonhuman subjects research.

Results

Each of the six reviewers evaluated 96 potential patient queries, generating 576 unique assessments across UCF, ChatGPT, and Bard. On average, ChatGPT responses scored the highest across all quality assessments, while UCF scored the lowest in each category (Fig. 2). For accuracy, mean ratings of UCF, ChatGPT, and Bard were 3.48, 3.79, and 3.58, respectively. These differences were significant between ChatGPT and UCF (p < 0.001) as well as ChatGPT and Bard (p = 0.014), but not between UCF and Bard.

FIG. 2.

Mean accuracy, comprehensiveness, clarity, and patient scores by response source.

For comprehensiveness, mean ratings of UCF, ChatGPT, and Bard were 3.21, 3.69, and 3.43, respectively, and each group was significantly different from each other, p < 0.05. For clarity, mean ratings of UCF, ChatGPT, and Bard were 3.53, 3.79, and 3.67, respectively. ChatGPT responses were significantly different than UCF responses (p < 0.001), while other comparisons were not significant.

For patient utility, mean ratings of UCF, ChatGPT, and Bard were 3.31, 3.70, and 3.45, respectively. The differences between ChatGPT and UCF/Bard were significant, both p < 0.01, while the difference between UCF and Bard was not significant.

Upon subanalysis, these trends in quality remained consistent across questions related to kidney stones, ureteral stents, BPH, and UTUC, in which ChatGPT responses were always rated highest and UCF responses were rated the lowest (Table 2). Treatment-related responses generated by ChatGPT received higher total DISCERN scores than UCF or Bard, but these differences were not significant, except for questions related to UTUC. Furthermore, inter-rater reliability was generally high for quality ratings across all questions, with the median ICC value being 0.845 (interquartile range [IQR]: 0.772–0.920).

Table 2.

Response Quality Outcomes for Urology Care Foundation, ChatGPT, and Bard

	Urology Care Foundation	ChatGPT	Bard	p
Mean accuracy (STD)
All questions	3.48 (±0.81)	3.79 (±0.81)	3.58 (±0.84)	<0.001^*
Kidney stone questions	3.29 (±0.85)	3.63 (±0.91)	3.43 (±0.84)	0.070
Ureteral stent questions	3.52 (±0.83)	3.64 (±0.69)	3.57 (±0.86)	0.790
BPH questions	3.57 (±0.86)	3.98 (±0.71)	3.67 (±0.87)	0.065
UTUC questions	3.69 (±0.58)	4.06 (±0.71)	3.78 (±0.76)	0.071
Mean comprehensiveness (STD)
All questions	3.21 (±0.93)	3.69 (±0.87)	3.43 (±0.89)	<0.001^*
Kidney stone questions	2.97 (±0.90)	3.47 (±0.95)	3.15 (±0.85)	0.004^*
Ureteral stent questions	3.21 (±0.92)	3.57 (±0.83)	3.40 (±0.80)	0.188
BPH questions	3.40 (±0.88)	3.86 (±0.78)	3.71 (±0.91)	0.053
UTUC questions	3.47 (±0.81)	4.08 (±0.65)	3.67 (±0.89)	0.005^*
Mean clarity (STD)
All questions	3.53 (±0.81)	3.79 (±0.69)	3.67 (±0.82)	0.006^*
Kidney stone questions	3.44 (±0.85)	3.69 (±0.71)	3.61 (±0.79)	0.155
Ureteral stent questions	3.57 (±0.83)	3.71 (±0.64)	3.67 (±0.87)	0.699
BPH questions	3.50 (±0.83)	3.83 (±0.66)	3.69 (±0.84)	0.151
UTUC questions	3.69 (±0.67)	4.00 (±0.72)	3.78 (±0.79)	0.190
Mean patient utility (STD)
All questions	3.31 (±0.87)	3.70 (±0.84)	3.45 (±0.91)	<0.001^*
Kidney stone questions	3.08 (±0.96)	3.49 (±0.92)	3.14 (±0.91)	0.020^*
Ureteral stent questions	3.43 (±0.86)	3.62 (±0.80)	3.52 (±0.83)	0.576
BPH questions	3.48 (±0.83)	3.86 (±0.75)	3.71 (±0.81)	0.091
UTUC questions	3.42 (±0.69)	4.03 (±0.70)	3.67 (±0.95)	0.006^*
Mean DISCERN total (STD)
All questions	16.83 (±3.38)	19.00 (±3.85)	16.83 (±4.18)	0.155
Kidney stone questions	15.67 (±3.50)	16.33 (±3.77)	15.50 (±3.39)	0.913
BPH questions	17.33 (±4.13)	20.17 (±3.60)	19.50 (±4.64)	0.482
UTUC questions	17.50 (±2.67)	20.50 (±3.21)	15.50 (±3.67)	0.049^*

p-Values derived from one-way ANOVA. Accuracy, comprehensiveness, clarity, and patient utility scores ranged from 1 to 5; Section 2 DISCERN scores have a possible range from 5 to 35. DISCERN scores not collected for the ureteral stent category. Significance set to p < 0.05 (shown in bold and ^* for reference).

ANOVA = analysis of variance; GQS = Global Quality Scale; STD = standard deviation.

On multivariate analysis, a ChatGPT-generated response was predictive of scores ≥4 for accuracy (odds ratio [OR] = 2.381), comprehensiveness (OR = 3.844), clarity (OR = 2.129), and patient utility (OR = 2.900) when compared with UCF, all p < 0.001 (Table 3). Compared with UCF, a Bard-generated response was predictive of comprehensiveness (OR = 1.937) and patient utility (OR = 1.668), both p < 0.05, but not accuracy or clarity.

Table 3.

Multivariate Logistic Regression Analysis for High Ratings of Quality Outcomes

Factor	Accuracy ≥4			Comprehensiveness ≥4			Clarity ≥4			Patient utility (GQS) ≥4
	OR	95% CI	p	OR	95% CI	p	OR	95% CI	p	OR	95% CI	p
Response source
UCF	Ref	—	—	Ref	—	—	Ref	—	—	Ref	—	—
ChatGPT	2.381	1.535–3.692	<0.001^*	3.844	2.477–5.965	<0.001^*	2.129	1.371–3.304	<0.001^*	2.900	1.881–4.471	<0.001^*
Bard	1.467	0.966–2.228	0.072	1.937	1.269–2.956	0.002^*	1.408	0.925–2.143	0.110	1.668	1.101–2.529	0.016^*
Question category
Kidney stones	Ref	—	—	Ref	—	—	Ref	—	—	Ref	—	—
Ureteral stents	1.001	0.602–1.665	0.997	1.341	0.810–2.220	0.254	0.763	0.453–1.286	0.310	1.326	0.803–2.191	0.271
BPH	1.640	1.007–2.670	0.047^*	2.840	1.727–4.669	<0.001^*	0.930	0.572–1.510	0.768	2.529	1.541–4.152	<0.001^*
UTUC	2.379	1.375–4.131	0.002^*	3.042	1.798–5.145	<0.001^*	1.335	0.780–2.285	0.292	2.191	1.309–3.668	0.003^*
Question type
General/other	Ref	—	—	Ref	—	—	Ref	—	—	Ref	—	—
Symptoms	0.995	0.551–1.797	0.986	0.824	0.466–1.457	0.505	1.318	0.712–2.440	0.379	1.356	0.752–2.446	0.311
Etiology	0.774	0.390–1.539	0.466	0.367	0.184–0.729	0.004^*	0.796	0.402–1.574	0.512	0.377	0.193–0.735	0.004^*
Diagnosis	0.928	0.460–1.871	0.835	1.335	0.667–2.674	0.414	1.041	0.514–2.111	0.911	1.103	0.557–2.185	0.779
Treatment	0.717	0.393–1.311	0.280	1.229	0.670–2.255	0.505	0.596	0.329–1.080	0.088	0.869	0.480–1.573	0.643
Prevention	0.545	0.295–1.005	0.052	0.619	0.331–1.158	0.133	0.579	0.313–1.070	0.081	0.514	0.277–0.955	0.035^*

A high rating was defined as a score of 4 or 5 on the 5-point Likert scales of accuracy, comprehensiveness, clarity, and patient utility. Significance set to p < 0.05 (shown in bold and ^* for reference).

CI = confidence interval; OR = odds ratio; Ref = reference; UCF = Urology Care Foundation.

When asked to identify which of the three responses to a particular question was not AI generated, 54.2% of responses indicated ChatGPT, 26.6% indicated Bard, and only 19.3% correctly identified it as the UCF response (Fig. 3). For readability, mean FKRE scores were the highest for UCF and lowest for ChatGPT (UCF = 63.60 ± 10.97, ChatGPT = 25.69 ± 9.66, and Bard = 49.83 ± 15.68), p < 0.001.

FIG. 3.

Frequency in which UCF, ChatGPT, or Bard responses were selected when evaluators were asked: “Which response do you think was not AI generated?” AI = artificial intelligence; UCF = Urology Care Foundation.

AI-generated responses were also more likely to be classified at an advanced reading level, while UCF responses showed improved readability (college or higher reading level: ChatGPT = 100%, Bard = 66%, and UCF = 19%), p < 0.001 (Fig. 4).

FIG. 4.

Flesch–Kincaid Reading Grade Level ratings by response source.

Discussion

As AI platforms become more mainstream, it is likely that these chatbots will be increasingly used to answer patient health care questions. We blindly evaluated AI-generated responses to endourology patient questions against UCF materials and found that ChatGPT, and to a lesser extend Bard, surpassed the quality of official patient education information. Moreover, our results suggest that readers were unable to distinguish between human-generated and physician-generated output. However, we also found that AI-generated responses were more likely to be classified at an advanced reading level than UCF materials.

A key finding in our study was that ChatGPT outperformed UCF and Bard on all quality metrics. It is difficult to contextualize these results as there are few similar endourology-related studies—none of which compare ChatGPT with an established benchmark. Among these studies, results are varied. One study found that ChatGPT elicited misinformation related to BPH, while in a different analysis, ChatGPT correctly answered 95% of urolithiasis-related questions.^12,16 While these studies are informative, the utility of AI compared with existing patient education resources is unclear without a benchmark.

Of note, one study adopted a similar methodology to our analysis, in which they reported no significant differences in accuracy or understandability between ChatGPT 3.5 responses and UCF materials on men's health.¹⁰ While this study covered a different subject, it suggests that ChatGPT can provide comparable responses to urologic questions when compared with UCF. However, our study even showed ChatGPT to have superior quality than UCF. One explanation for this discrepancy is that our study used ChatGPT 4.0—the more advanced model.

ChatGPT also generally outperformed Bard on quality. While we did not identify any other similar urology-focused studies, studies in other specialties have similarly affirmed that ChatGPT provides more accurate and understandable responses to patient questions than Bard.^21,22 We also found that while the Bard score was slightly higher than UCF across all quality domains, these differences were only significant for comprehensiveness. These findings suggest that Bard may be comparable with UCF's quality, but in their current iterations, ChatGPT 4.0 is a superior AI platform than Bard for urology patient education.

Our study also found that reviewers were rarely able to identify physician-created responses such as those from UCF. Perhaps the simplest explanation is that evaluators assumed that the highest quality answer was least likely to be AI generated, thus explaining why ChatGPT was selected as the non-AI-generated response in most cases. Nonetheless, in a prior study that asked respondents to identify whether medical advice was written by ChatGPT 3.5 or clinicians, survey takers correctly identified the human response in most cases.²³

Accordingly, we hypothesized that our evaluators would correctly identify human-generated responses more frequently and at least in 33% of cases based on random chance. Yet, UCF responses were only selected in 19% of cases on average. Collectively, our study suggests that with the advancement of AI, text generated by AI chatbots may be becoming less distinguishable from human-written text.

Despite their high quality, AI responses were more likely to be classified at an advanced reading level than UCF materials. These findings are not unique; several other studies have found that ChatGPT-generated text often shows poor readability.^2,10,24 A head-to-head comparison of AI chatbots also similarly found that ChatGPT required a higher reading level than Bard.²¹

The American Medical Association and National Institutes of Health recommend that patient education materials should be written in a sixth- to eighth-grade reading level—which happens to be the reading level for most UCF responses in our study.³

This may seem like a major issue for AI-generated materials, which tended to be at a college reading level. However, AI chatbots enable the user to adjust the readability of the output. For example, Shah et al. added the prompt “Explain it to me like I am in sixth grade” to improve the readability of ChatGPT responses related to men's health. Interestingly, when comparing ChatGPT output with and without the prompt, the response quality was found to be similar.

Therefore, AI readability may be less of a concern as responses can be modulated using user input to produce more readable output that may not experience a decrease in quality. Still, an important caveat is that if users do not request a readability modification, default responses from AI chatbot passages may show limited accessibility. Accordingly, it may be preferable for the developers of AI platforms to ensure that chatbots automatically respond at an appropriate reading level for the general public when queried with health care-related questions.

Almost three-quarters of Americans seek medical advice online and acknowledge that this information influences their medical decision-making.^25,26 Inaccurate or biased online health care information can lead to inaccurate self-diagnosis and delays in treatment, among a host of other adverse outcomes.

With the rising popularity of AI chatbots, it was important to question whether AI-generated patient education was appropriate. We demonstrate that AI platforms such as ChatGPT and Bard are useful for endourology-related questions and that ChatGPT may even be superior to existing resources in some cases.

Research on these AI platforms provides insights into how urologists could integrate emerging AI technologies into their practice. Our results suggest that it is conceivable that AI platforms could be used to create new patient education content that is later refined by urologists.

This would be advantageous as existing patient education materials are often confined to information on the most common questions for each urologic condition. Given that AI chatbots can theoretically respond to a seemingly infinite number of unique queries, this would allow for rapid development of more specific information on urologic diseases.

Looking ahead, organizations such as the AUA may have the opportunity to play a larger role in how AI impacts patients. For instance, OpenAI recently announced the introduction of custom GPTs that allow users to create a tailored version of ChatGPT that is optimized for specific topics or tasks.²⁷ The AUA could feasibly develop a custom patient education GPT that is trained specifically on AUA-vetted information, ensuring that the material provided is relevant, accurate, and up to date.

This study is not without limitations. First, the most obvious limitation to our study is that our AI queries may be unrealistic as it is unlikely that users would ask for a response of a specified length. However, if urologists used AI to generate de novo patient education materials, this scenario may be more realistic. Regardless, the reason for our methodology is that prior studies have demonstrated that AI chatbot output length is often significantly longer than physician-generated responses to the same question.^10,28

Therefore, we believe that equalizing response length with UCF was important as it allowed for a fair comparison of quality across all platforms. Furthermore, if certain responses were consistently longer or shorter than the other options, our survey takers may have associated these responses with a particular platform—thus biasing their ability to identify the non-AI-generated response.

Second, ChatGPT and Bard results are known to fluctuate across points in time as these technologies advance. To limit these stochastic implications, we asked all questions on a single date. Third, we utilized UCF materials as they are widely trusted and created by the AUA. Nonetheless, UCF may not accurately reflect the quality of all urologic patient education resources.

Last, all quality outcomes were subjectively evaluated, resulting in potential bias. However, we attempted to mitigate this risk by including six evaluators, exceeding those in most prior studies of AI response quality in urology, which typically used two to three reviewers.

Conclusions

In a blind evaluation, AI-generated responses from ChatGPT and Bard surpassed the quality of endourology patient education materials from UCF. Furthermore, expert urologists were unable to distinguish between AI-generated and human-generated responses. Together, our results suggest that AI platforms may be a reliable resource for basic urologic care information.

However, AI responses tended to require a higher reading level, meaning that AI-generated text may need to be modified to improve its readability to make it more accessible to a broader audience. Future studies should explore how patients perceive the usefulness and accessibility of AI-generated patient education materials.

Footnotes

Authors' Contributions

C.C. was involved in conceptualization, methodology, formal analysis, investigation, data curation, writing—original draft, writing—review and editing, and visualization. K.G. and J.K. were involved in conceptualization, methodology, investigation, writing—original draft, and writing review and editing. R.K. and A.Y. were involved in conceptualization, investigation, and writing—original draft. M.L. was involved in writing—original draft, and writing—review and editing. B.G. was involved in methodology, data curation, and project administration. W.A. was involved in conceptualization, investigation, and supervision. M.G. was involved in conceptualization, methodology, investigation, writing—original draft, writing—review and editing, supervision, and project administration.

Author Disclosure Statement

No competing financial interests exist.

Funding Information

No funding was received for this article.

Supplementary Material

Supplementary Figure S1

Abbreviations Used

References

Gabrielson

, Odisho

, Canes

. Harnessing generative artificial intelligence to improve efficiency among urologists: Welcome ChatGPT. J Urol, 2023; 209(5):827–829; doi: 10.1097/ju.0000000000003383

Cocci

, Pezzoli

, Lo Re

, et al. Quality of information and appropriateness of ChatGPT outputs for urology patients. Prostate Cancer Prostatic Dis, 2023; In Press; doi: 10.1038/s41391-023-00705-y

Eppler

, Ganjavi

, Knudsen

, et al. Bridging the gap between urological research and patient understanding: The role of large language models in automated generation of Layperson's Summaries. Urol Pract, 2023; 10(5):436–443; doi: 10.1097/upj.0000000000000428

Shah

, Beiriger

, Mehta

, et al. Analysis of patient education materials on TikTok for erectile dysfunction treatment. Int J Impot Res, 2023; In Press; doi: 10.1038/s41443-023-00726-0

Betschart

, Pratsinis

, Müllhaupt

, et al. Information on surgical treatment of benign prostatic hyperplasia on YouTube is highly biased and misleading. BJU Int, 2020; 125(4):595–601; doi: 10.1111/bju.14971

Loeb

, Reines

, Abu-Salha

, et al. Quality of bladder cancer information on YouTube. Eur Urol, 2021; 79(1):56–59; doi: 10.1016/j.eururo.2020.09.014

McCarthy

, Berkowitz

, Ramalingam

, et al. Evaluation of an artificial intelligence Chatbot for delivery of IR patient education material: A comparison with societal website content. J Vasc Interv Radiol, 2023; 34(10):1760–1768.e32; doi: 10.1016/j.jvir.2023.05.037

Millenson

, Baldwin

, Zipperer

, et al. Beyond Dr. Google: The evidence on consumer-facing digital tools for diagnosis. Diagnosis (Berl), 2018; 5(3):95–105; doi: 10.1515/dx-2018-0009

Tung

JYM

, Lim

DYZ

, Sng

GGR

. Potential safety concerns in use of the artificial intelligence chatbot ‘ChatGPT’ for perioperative patient communication. BJU Int, 2023; 132(2):157–159; doi: 10.1111/bju.16042

10.

Shah

, Ghosh

, Hochberg

, et al. Comparison of ChatGPT and traditional patient education materials for men's health. Urol Pract, 2024; 11(1):87–94; doi: 10.1097/upj.0000000000000490

11.

Caglar

, Yildiz

, Meric

, et al. Evaluating the performance of ChatGPT in answering questions related to pediatric urology. J Pediatr Urol, 2024; 20(1):26.e1–26.e5; doi: 10.1016/j.jpurol.2023.08.003

12.

Cakir

, Caglar

, Yildiz

, et al. Evaluating the performance of ChatGPT in answering questions related to urolithiasis. Int Urol Nephrol, 2024; 56(1):17–21; doi: 10.1007/s11255-023-03773-0

13.

Coskun

, Ocakoglu

, Yetemen

, et al. Can ChatGPT, an artificial intelligence language model, provide accurate and high-quality patient information on prostate cancer?. Urology, 2023; 180:35–58; doi: 10.1016/j.urology.2023.05.040

14.

Gabriel

, Shafik

, Alanbuki

, et al. The utility of the ChatGPT artificial intelligence tool for patient education and enquiry in robotic radical prostatectomy. Int Urol Nephrol, 2023; 55(11):2717–2732; doi: 10.1007/s11255-023-03729-4

15.

Musheyev

, Pan

, Loeb

, et al. How well do artificial intelligence chatbots respond to the top search queries about urological malignancies?. Eur Urol, 2024; 85(1):13–16; doi: 10.1016/j.eururo.2023.07.004

16.

Szczesniewski

, Tellez Fouz

, Ramos Alba

, et al. ChatGPT and most frequent urological diseases: Analysing the quality of information and potential risks for patients. World J Urol, 2023; 41(11):3149–3153; doi: 10.1007/s00345-023-04563-0

17.

Whiles

, Bird

, Canales

, et al. Caution! AI Bot has entered the Patient Chat: ChatGPT has limitations in Providing Accurate Urologic Healthcare Advice. Urology, 2023; 180:278–284; doi: 10.1016/j.urology.2023.07.010

18.

Stork

. The Urology Care Foundation™: Addressing medical misinformation in urology. J Urol, 2022; 208(4):765–766; doi: 10.1097/ju.0000000000002919

19.

Gul

, Diri

. YouTube as a source of information about premature ejaculation treatment. J Sex Med, 2019; 16(11):1734–1740; doi: 10.1016/j.jsxm.2019.08.008

20.

Jindal

, MacDermid

. Assessing reading levels of health information: Uses and limitations of flesch formula. Educ Health (Abingdon), 2017; 30(1):84–88; doi: 10.4103/1357-6283.210517

21.

Cheong

RCT

, Unadkat

, McNeillis

, et al. Artificial intelligence chatbots as sources of patient education material for obstructive sleep apnoea: ChatGPT versus Google Bard. Eur Arch Otorhinolaryngol, 2024; 281(2):985–993; doi: 10.1007/s00405-023-08319-9

22.

Rahsepar

, Tavakoli

, Kim

GHJ

, et al. How AI responds to common lung cancer questions: ChatGPT vs Google Bard. Radiology, 2023; 307(5):e230922; doi: 10.1148/radiol.230922

23.

Nov

, Singh

, Mann

. Putting ChatGPT's medical advice to the (Turing) test: Survey Study. JMIR Med Educ, 2023; 9:e46939; doi: 10.2196/46939

24.

Davis

, Eppler

, Ayo-Ajibola

, et al. Evaluating the effectiveness of artificial intelligence-powered large language models application in disseminating appropriate and readable health information in urology. J Urol, 2023; 210(4):688–694; doi: 10.1097/ju.0000000000003615

25.

Cisu

, Mingin

, Baskin

. An evaluation of the readability, quality, and accuracy of online health information regarding the treatment of hypospadias. J Pediatr Urol, 2019; 15(1):40.e1–40.e6; doi: 10.1016/j.jpurol.2018.08.020

26.

Perez

, Swindell

, Herndon

, et al. Assessing the readability of online information about Achilles Tendon Ruptures. Foot Ankle Spec, 2020; 13(6):470–477; doi: 10.1177/1938640019888058

27.

Introducing GPTs. Available from: https://openai.com/blog/introducing-gpts [Last accessed: November 28, 2023].

28.

Johnson

, King

, Warner

, et al. Using ChatGPT to evaluate cancer myths and misconceptions: Artificial intelligence and cancer information. JNCI Cancer Spectr, 2023; 7(2):pkad015; doi: 10.1093/jncics/pkad015

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.18 MB