When Wrong Answers Matter: Consequence-Weighted Evaluation of Large Language Models for ERCP Triage

Abstract

Background

Large language models (LLMs) increasingly generate clinical recommendations, but their ability to translate biliary guidelines into safe procedural triage remains uncertain. We evaluated next-generation LLMs for ERCP indication in suspected choledocholithiasis and tested whether errors could affect workflow.

Methods

A cross-sectional in-silico diagnostic accuracy study was conducted from May 14 to May 18, 2026. One hundred locked synthetic vignettes were mapped to ASGE/ESGE-based standards: 45 ERCP-indicated and 55 nonindicated cases. GPT-5.5, Gemini 3.0 Pro, and Claude 4 Opus were queried with an identical zero-shot prompt at temperature 0.0. Outcomes included accuracy, sensitivity, specificity, kappa, error phenotype, and simulated under-triage delay.

Results

GPT-5.5 achieved the highest accuracy (96.0%; 95% CI, 90.2%-98.4%), followed by Gemini 3.0 Pro (90.0%; 95% CI, 82.6%-94.5%) and Claude 4 Opus (84.0%; 95% CI, 75.6%-89.9%). Agreement was near-perfect for GPT-5.5 (kappa = 0.92), substantial for Gemini 3.0 Pro (kappa = 0.80), and weaker for Claude 4 Opus (kappa = 0.68). GPT-5.5 outperformed Claude 4 Opus (McNemar P = .004). Claude 4 Opus produced the most under-triage errors (n = 9) and the largest simulated delay burden (163.8 hours per 100 vignettes; Kruskal-Wallis P = .007).

Conclusion

Next-generation LLMs can approximate guideline-based ERCP triage, but clinically meaningful differences emerge when errors are weighted by procedural delay and safety. GPT-5.5 showed the most balanced profile; conservative under-triage remains the key hazard requiring supervision.

Graphical Abstract

Keywords

artificial intelligence biliary tract choledocholithiasis clinical decision support diagnostic accuracy endoscopic retrograde cholangiopancreatography large language model patient safety surgical triage

Get full access to this article

View all access options for this article.

References

Buxbaum

Abbas Fehmi

Sultan

Fishman

Qumseya

Cortessis

, et al. ASGE guideline on the role of endoscopy in the evaluation and management of choledocholithiasis. Gastrointest Endosc. 2019;89(6):1075-1105.e15. doi:10.1016/j.gie.2018.10.001.

Manes

Paspatis

Aabakken

, et al. Endoscopic management of common bile duct stones: European society of gastrointestinal endoscopy (ESGE) guideline. Endoscopy. 2019;51(5):472-491. doi:10.1055/a-0862-0346.

Kiriyama

Kozaka

Takada

, et al. Tokyo guidelines 2018: diagnostic criteria and severity grading of acute cholangitis. J Hepatobiliary Pancreat Sci. 2018;25(1):17-30. doi:10.1002/jhbp.512.

Cotton

Lehman

Vennes

, et al. Endoscopic sphincterotomy complications and their management: an attempt at consensus. Gastrointest Endosc. 1991;37(3):383-393. doi:10.1016/S0016-5107(91)70740-2.

Andriulli

Loperfido

Napolitano

, et al. Incidence rates of post-ERCP complications: a systematic survey of prospective studies. Am J Gastroenterol. 2007;102(8):1781-1788. doi:10.1111/j.1572-0241.2007.01279.x.

Kung

Cheatham

Medenilla

, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. doi:10.1371/journal.pdig.0000198.

Singhal

Azizi

, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172-180. doi:10.1038/s41586-023-06291-2.

Moor

Banerjee

Abad

ZSH

, et al. Foundation models for generalist medical artificial intelligence. Nature. 2023;616(7956):259-265. doi:10.1038/s41586-023-05881-4.

Caliskan

Basak

Erdem

. Can AI safely choose antibiotics over the knife? A STROBE-guided benchmark of GPT-4, GPT-5, and Gemini for non-operative acute appendicitis management. Int J Med Inform. 2026;213:106389. doi:10.1016/j.ijmedinf.2026.106389.

10.

Erdem

Canbak

Acar

Ceylan

Çakıt

Başak

. Guideline-based, but not error-free: multilingual risks in AI-powered patient counseling on gallstones. Int J Med Inform. 2026;212:106341. doi:10.1016/j.ijmedinf.2026.106341.

11.

Caliskan

Basak

Erdem

Kudas

. Beyond block time: a head-to-head comparison of reinforcement learning, genetic algorithms, and predict-then-optimize scheduling for operating room workflow using discrete-event simulation. Int J Med Inform. 2026;214:106426. doi:10.1016/j.ijmedinf.2026.106426.

12.

Caliskan

Basak

Erdem

Kudas

. From guidelines to clicklists: GPT-5-generated ERAS checklists improve guideline coverage for bariatric and gastrointestinal cancer surgery-a STROBE-compatible cross-sectional evaluation. World J Surg. 2026;50(5):1187-1194. doi:10.1002/wjs.70339. Online ahead of print.

13.

Erdem

Canbak

Acar

Basak

. Beyond the hype: mapping the evolution of artificial intelligence in general surgery through two decades of bibliometrics. World J Surg. 2025;49(12):3402-3409. doi:10.1002/wjs.70165.

14.

von Elm

Altman

Egger

, et al. The strengthening the reporting of observational studies in epidemiology (STROBE) statement: guidelines for reporting observational studies. Lancet. 2007;370(9596):1453-1457. doi:10.1016/S0140-6736(07)61602-X.

15.

Sounderajah

Guni

Liu

Collins

Karthikesalingam

Markar

, et al. The STARD-AI reporting guideline for diagnostic accuracy studies using artificial intelligence. Nat Med. 2025;31(10):3283-3289. doi:10.1038/s41591-025-03953-8.

16.

Collins

Moons

KGM

Dhiman

, et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024;385:e078378. doi:10.1136/bmj-2023-078378.

17.

Vasey

Nagendran

Campbell

Clifton

Collins

Denaxas

, et al. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat Med. 2022;28(5):924-933. doi:10.1038/s41591-022-01772-9.

18.

Liu

Cruz Rivera

Moher

Calvert

Denniston

SPIRIT-AI and CONSORT-AI Working Group . Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat Med. 2020;26(9):1364-1374. doi:10.1038/s41591-020-1034-x.

19.

Ayers

Poliak

Dredze

, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023;183(6):589-596. doi:10.1001/jamainternmed.2023.1838.

20.

Lewis

Perez

Piktus

, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv Neural Inf Process Syst. 2020;33:9459-9474.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.27 MB