The future of large language models in clinical and academic medicine

Abstract

The use of large language models (LLMs) has expanded at an unprecedented rate since 2022; ChatGPT itself has now over 800 million weekly users. Unsurprisingly, medical communities have explored the scope of their application in clinical and academic contexts. There is little doubt that LLMs are particularly useful for developing educational material, creating literature summaries and simplifying challenging concepts. There are also countless examples of LLM software which can improve the efficiency of patient communication and streamline administrative tasks. However, what challenges remain and what barriers must be overcome for the meaningful integration of LLMs within clinical and academic medicine?

The application of LLMs as a time efficient tool to aid diagnosis and support clinical decision-making has been proposed, including patient triage. Publications have evaluated their role as diagnostic adjuncts and report varying degrees of predictive accuracy. In practice, predictive performance alone does not necessarily result in improved diagnosis compared with standard clinical practice.^1,2 While these systems may correctly identify diagnoses in a proportion of cases, there is little evidence that their integration improves overall diagnostic accuracy in real world clinical settings. Whether the addition of LLM improves patient outcome (e.g. mortality, length of hospital stay) is even less concrete.

The principal barrier to utilising LLMs is the heterogeneity and complexity of clinical medicine. Effective decision-making extends beyond diagnostic prediction and requires dynamic risk stratification, incorporation of comorbidities and functional status, and shared decision-making. Unless developed, trained and prospectively validated within clearly defined scenarios and involving multiple disciplines of artificial intelligence (e.g., multi-modal LLMs), they may not address this clinical complexity or provide value beyond current practice.

Prior to their implementation into the clinical setting, LLMs require rigorous training and validation to ensure their safety. The potential consequences of incorporating LLMs are of course of a different order of magnitude to other industries. Where a hallucination (the presentation of misinformation as fact) may result in a logistical error or financial inefficiency in another sector, the same phenomenon may expose patients to avoidable harm. As long as hallucination remains a potential risk, LLMs are unlikely to adopt an independent or autonomous role in patient care.

For LLMs to be formally integrated into clinical workflows, they may fall under the regulatory framework of a ‘medical device’ as per the UK Medical Devices Regulations 2002.³ Since their training data and internal specifications are not publicly accessible, they cannot be easily validated and without the transparency of their data sources, these models may not reach ethical approval. Compliance with UK GDPR and the Data Protection Act 2018 to ensure patient confidentiality and appropriate governance of sensitive information is also essential.⁴

Within academic medicine, the application of LLMs has extended to the summarisation and review of relevant literature. However, recent evidence suggests that LLMs do not necessarily provide the granularity and subspecialty-specific expertise that may be achieved through targeted search-engine (e.g. Google) approaches. Evidence suggests that recall of eligible studies in systematic reviews or meta-analyses may approach 80% when relying on LLM. In academic questions that depend upon comprehensive and reproducible extraction of data, such performance may be insufficient. As with other analytical tools commonly used in medical research, the use of LLM should be reported in academic literature. Statistical software packages such as ‘SPSS’ or ‘R Studio’ are routinely cited in the methods section to allow reproducibility and critical appraisal. The same principle should apply to LLM in the drafting, literature identification, data synthesis or manuscript preparation. Clear disclosure would permit readers, reviewers and editors to understand the extent of LLM involvement and to identify sources of bias.

In conclusion, LLMs demonstrate clear benefit in education, communication and administrative efficiency but their contribution to advancing clinical and academic medicine remains limited by their safety, need for validation and lack of demonstrable patient benefit. Without rigorous prospective evaluation and adherence to regulatory and ethical standards, the role of LLMs may be limited to a supportive adjunct in clinical and academic medicine.

Footnotes

ORCID iD

James Lucocq

References

Masanneck

Schmidt

Seifert

, et al. Triage performance across large language models, ChatGPT, and untrained doctors in emergency medicine: comparative study. J Med Internet Res 2024; 26: e53297.

Zhou

Zhang

, et al. Large language models for disease diagnosis: a scoping review. npj Artif Intell 2025; 1: Article no: 9. DOI: 10.1038/s44387-025-00011-z

MDR. The medical devices regulations 2002. Nation Arch 2022; SI 2002/618. Published online 2022.

Data Protection Act 2018, c. 12. https://www.legislation.gov.uk/ukpga/2018/12/contents/enacted.