Application of large language models in medical diagnosis: A bibliometric review

Abstract

Background

The integration of Large Language Models into medical diagnosis represents an emerging field with the potential to support diagnostic workflows across diverse clinical settings. However, the trends and evolutionary trajectory of LLM-assisted diagnostic research remain insufficiently understood.

Objective

This bibliometric review aims to map the global research landscape, identify key research clusters, and analyze the development trajectory of LLM technologies in medical diagnosis, with an emphasis on descriptive synthesis rather than formal evaluation.

Methods

A bibliometric analysis was conducted on relevant publications retrieved from the Web of Science Core Collection, covering the period from Q1 2023 to Q1 2025. The extracted data were processed and visualized using Excel, ArcGIS, VOSviewer, CiteSpace, and Pajek. The analyses included publication trends, influential authors and institutions, collaboration networks, and research cluster mapping.

Results

A total of 650 publications were included in the analysis. Research output increased markedly from Q1 2023 onward, rising from 2 publications to 148 by Q1 2025, corresponding to an average quarterly growth rate of 71.25%. The United States (273 publications), China (135 publications), and Germany (65 publications) emerged as the leading contributing countries. The three most productive institutions were all based in the United States: Harvard University (26 publications), Stanford University (26 publications), and the Icahn School of Medicine at Mount Sinai (20 publications). Keyword co-occurrence analysis identified 10 core clusters, with a modularity Q value of 0.8231 and a silhouette S value of 0.9412, indicating a highly coherent clustering structure and strong internal consistency.

Conclusion

The development of LLM technologies has substantially influenced the research landscape of medical diagnostics. As this field continues to evolve, it is crucial to refine model performance, integrate multimodal data, and address ethical considerations. Future research should focus on optimizing LLMs for specific clinical applications and evaluating their implementation in real-world healthcare settings.

Keywords

large language models LLM-assisted diagnosis bibliometric analysis clinical reasoning artificial intelligence

Introduction

Artificial intelligence (AI) has been extensively applied across diverse scientific and engineering domains.^1–4 Within medicine, its potential applications have long attracted sustained scholarly attention and practical exploration.⁵ The release of OpenAI’s ChatGPT in late 2022, followed by subsequent large language models (LLMs), has intensified global interest in this domain. LLM-assisted diagnostic research refers to studies that examine large transformer-based language models (e.g., GPT, Gemini, and Llama) primarily as clinician-facing decision-support tools and natural-language interfaces for diagnostic workflows. As decision-support systems, LLMs may assist physicians by generating differential diagnostic suggestions, synthesizing clinical knowledge, and retrieving diagnostically relevant information from heterogeneous medical data sources, with the understanding that final clinical judgment remains under physician oversight.⁶ Simultaneously, as language interfaces, LLMs may facilitate interaction with electronic health records, medical literature, imaging reports, and other digital health resources through conversational interfaces.⁷ These functions distinguish LLM-assisted diagnosis from general AI paradigms, which encompass broader algorithmic approaches such as convolutional neural networks or rule-based systems,⁸ as well as from conventional clinical natural language processing, which has typically relied on smaller, task-specific architectures for narrow natural language processing tasks such as information extraction and text classification.⁹ Importantly, this study does not conceptualize LLMs as fully autonomous reasoning agents capable of independently establishing definitive diagnoses without human oversight.

The integration of LLMs into medical diagnosis has emerged as a rapidly expanding area of academic and clinical interest. A recent SWOT-based review examining artificial intelligence in clinical medicine highlighted the potential role of LLMs in supporting diagnostic processes.¹⁰ As a new generation of foundation models, LLMs are characterized by advanced natural language understanding, contextual reasoning, and knowledge synthesis capabilities, enabling them to process heterogeneous medical data such as clinical narratives, electronic health records, imaging reports, and biomedical literature.¹¹ These capabilities may offer potential avenues to support clinical decision-making, diagnostic reasoning, and knowledge-assisted medical interpretation; however, their clinical utility remains to be established through further empirical investigation.

This growing interest has prompted an expanding body of research investigating the application of LLMs in diagnostic contexts. A notable development in this field occurred in December, 2022, with the publication of “Large Language Models Encode Clinical Knowledge” by Karan Singhal et al. on arXiv. In this study, the authors introduced “MultiMedQA,” a comprehensive benchmarking framework encompassing seven medical question-answering datasets, including MedQA, MedMCQA, and PubMedQA, to evaluate LLM performance across diverse clinical tasks. Using this framework, they assessed the diagnostic capabilities of Flan-PaLM, a medically adapted version of the PaLM model, and suggested the potential applicability of LLMs in diagnostic contexts. This work attracted substantial academic attention and was subsequently published in Nature, contributing to growing scholarly interest in this research domain.¹² Following this influential publication, the literature on LLM-assisted medical diagnosis has expanded across multiple clinical specialties, including gastrointestinal pathology,¹³ hepatology,¹⁴ dermatology,¹⁵ epileptology,¹⁶ ophthalmology,¹⁷ orthopedics,¹⁸ oral medicine,¹⁹ pulmonology,²⁰ cardiology,²¹ cognitive neurology,²² and affective disorders such as depression.²³

Simultaneously, the integration of LLMs into clinical practice has begun to receive increasing attention. A survey conducted by the American Psychiatric Association in late 2023 revealed that more than 70% of psychiatrists reported having used ChatGPT in professional practice, with many respondents reporting its use in diagnostic deliberation.²⁴ Similarly, a 2024 report by the European Commission documented the implementation of LLM-assisted diagnostic tools in selected European hospitals for breast cancer screening, with preliminary findings suggesting potential clinical utility.²⁵ Evidence from Japan further suggested that ChatGPT-4 demonstrated diagnostic performance approaching that of board-certified physicians in selected experimental evaluation settings, with the correct diagnosis appearing among the top 10 differential diagnoses in 83% of cases and as the top-ranked diagnosis in 60% of cases.²⁶ In addition, in February 2025, Peking Union Medical College and the Chinese Academy of Sciences launched Xiehe·Taichu, an LLM designed for rare disease diagnostic support.²⁷ Complementing these developments, a recent global survey involving approximately 800 psychiatrists found that over half of respondents believed LLMs could assist diagnostic decision-making through the synthesis and interpretation of patient information.²⁸ Despite these advances, the integration of LLMs into clinical settings has also prompted ongoing discussion regarding ethical, legal, and practical challenges, including concerns regarding data privacy, model reliability, interpretability, and regulatory governance.

In response to the rapidly growing academic and clinical interest in the application of LLMs to diagnostic medicine, a comprehensive synthesis of the existing literature is both timely and essential. To date, no bibliometric analysis has systematically mapped the intellectual landscape or traced the research trajectory of this emerging field. Bibliometric analysis offers a robust, quantitative method for mapping scholarly output, identifying influential contributors, and characterizing dominant research themes.^29,30 This study applies bibliometric techniques, integrating descriptive, relational, and evolutionary approaches, to examine the literature on LLMs in diagnostic medicine, with a focus on publication trends, influential authors and institutions, collaborative networks, and research clusters. In addition, to complement the bibliometric findings, a targeted analysis of PubMed-indexed clinical studies was conducted to characterize clinical application domains, methodological features, evidence levels, and emerging translational challenges associated with LLM-assisted diagnosis. It should be emphasized that this study aims to provide high-level, trend-based insights rather than direct guidance for the design or validation of LLM-assisted diagnostic tools. The BIBLIO Checklist for this bibliometric review is provided in Supplemental File 1.

Methods

Literature search

The overall study framework is illustrated in Figure 1. The Web of Science Core Collection (WOSCC), a widely used multidisciplinary citation database that indexes high-impact, peer-reviewed scholarly publications, was systematically searched to retrieve relevant literature from database inception through April 13, 2025. Although LLMs gained widespread academic and public attention primarily after late 2022, the search strategy was intentionally extended to the inception date. This approach was adopted to minimize the likelihood of excluding potentially relevant earlier studies that may have employed precursor language-model architectures or conceptually related terminology within diagnostic contexts.

Figure 1.

Study framework.

The search strategy was as follows: TS=((“Large Language Model*” OR LLM OR GPT OR Claude* OR Gemini OR Llama* OR LaMDA OR GEMMA) AND (“medical diagnos*” OR “clinical diagnos*” OR “disease diagnos*” OR “patient diagnos*” OR “computer-aided diagnosis” OR “AI-assisted diagnosis” OR “LLM-assisted diagnosis” OR “intelligent diagnosis” OR “automated diagnosis” OR “diagnostic medicine” OR “diagnostic evaluation” OR “diagnostic conversation” OR “diagnostic process” OR “diagnostic support” OR “diagnostic consultation” OR “diagnostic aid” OR “diagnostic assistant” OR “diagnostic chatbot” OR “diagnostic reasoning” OR “diagnosis assistance” OR “diagnosis process” OR “disease identification” OR “disorder identification” OR “health condition identification” OR “health problem identification” OR “disease detection” OR “disorder detection” OR “health condition detection” OR “health problem detection” OR “symptom analysis” OR “clinical assessment” OR “clinical reasoning” OR “clinical decision-making” OR “medical decision-making” OR “diagnostic decision-making” OR “clinical decision support” OR “medical decision support” OR “diagnostic decision support”))

Inclusion/exclusion criteria

In this study, inclusion criteria were defined to ensure the selection of methodologically robust and thematically relevant literature. Specifically, studies were included if they (1) were original research articles, (2) were peer-reviewed publications, (3) were published in English, and (4) focused on the application, evaluation, or methodological development of LLMs in medical diagnostic contexts. Correspondingly, studies were excluded if they (1) were not original research articles, including editorials, letters, commentaries, correspondence, conference abstracts, and errata, (2) were not peer-reviewed publications, such as theses and dissertations, technical reports, preprints, and book chapters, (3) were published in languages other than English, or (4) lacked substantive relevance to LLM-assisted diagnosis.

Literature selection

The literature screening process was independently undertaken by two reviewers (Haokun Wang and Quan Zhang) according to the predefined inclusion and exclusion criteria. Prior to formal screening, a calibration assessment involving a random subset of 100 records was performed to ensure consistent interpretation of the inclusion and exclusion criteria. Inter-reviewer agreement was evaluated using Cohen’s kappa coefficient, yielding a value of 0.796, indicating substantial agreement between reviewers.³¹ Disagreements that arose during the screening process were adjudicated by a third researcher (Hongjuan Li), and all discrepancies were resolved through structured deliberation until consensus was achieved.³² The literature identification and selection workflow (see Figure 2) was conducted in alignment with the PRISMA reporting framework³³ and relevant methodological guidance for bibliometric studies.³⁴ A total of 826 records were initially identified from the WOSCC database. Following manual deduplication, 3 duplicate records were removed, leaving 823 unique records for subsequent evaluation. The titles and abstracts of all remaining records were then screened, and no studies were excluded due to inaccessible bibliographic information. Full-text eligibility assessment was subsequently conducted according to the predefined inclusion and exclusion criteria. Specifically, non-original research articles (n = 42) were excluded, alongside non-peer-reviewed publications (n = 24), non-English studies (n = 13), and articles lacking substantive thematic relevance to LLM-assisted diagnostic applications (n = 94). Ultimately, 650 articles met the eligibility criteria and were included in the final bibliometric analysis. All eligible records were subsequently imported into the analytical platforms for bibliometric processing and visualization.

Figure 2.

PRISMA flow diagram.

Data analysis

The analytical process integrated multiple software platforms, including Microsoft Excel 2019, ArcGIS, VOSviewer (Version 1.6.20), CiteSpace (Version 6.4.R1), and Pajek, to support bibliometric analysis and visualization. Microsoft Excel 2019 was used to visualize temporal trends in publication output. ArcGIS was utilized to map the geographic distribution of publications across countries and regions. CiteSpace was employed to identify and visualize thematic research clusters within the publication corpus. The analytical parameters were configured as follows. The time-slicing window spanned 2023–2025 with a one-year slice length. Node selection was operationalized using the g-index (k = 25). Network pruning was performed using the Pathfinder, Minimum Spanning Tree, and Pruning the Merged Network algorithms. VOSviewer was applied to quantify publication and citation performance and to construct collaboration and co-citation networks involving countries, authors, journals, and institutions. In country co-authorship analysis, each country was required to contribute at least one publication. In author co-authorship analysis, a minimum threshold of three publications per author was imposed. In journal co-citation analysis, only cited sources receiving at least 40 citations were included. For institutional collaboration networks, a minimum threshold of four publications per institution was enforced. Threshold parameters were selected with reference to established practices in bibliometric visualization and were adjusted to balance information retention, network readability, and interpretability.³⁵ CiteSpace was further used to generate a dual-map overlay that contextualized disciplinary linkages between citing and cited journals. Detailed analytical procedures and software configurations are provided in Supplemental File 2 to facilitate methodological transparency and reproducibility.

Results

Temporal-spatial distribution of publications

In this emerging field, the first indexed publication appeared in February 2023. Since then, scholarly output has expanded markedly, increasing from 2 articles in Q1 2023 to 148 in Q1 2025. This corresponds to an average quarterly growth rate of 71.25%, calculated using the geometric mean method (Figure 3). By April 13, 2025, the WOSCC database had indexed a total of 650 publications on this topic. This increase in publication volume may reflect growing academic interest in the application of LLM-assisted diagnosis in clinical contexts.

Figure 3.

Temporal distribution of publications.

Research on LLM-assisted diagnosis spanned 69 countries, indicating widespread international research activity in this emerging field (Figure 4). The United States led with 273 publications, followed by China (135) and Germany (65), suggesting a geographically uneven distribution of research output. Although bibliometric data are well suited to characterizing geographic patterns, they do not support causal inference regarding the determinants of national research productivity. Accordingly, the interpretations presented below should be regarded as contextual and speculative reflections informed by the broader scientific and policy literature, rather than empirically validated explanations established by the present analysis. The dominance of these countries may reflect differences in research infrastructure, funding capacity, data availability, and technological ecosystems. For instance, the development of advanced LLMs by major technology organizations, together with the presence of leading academic medical centers and robust academic-clinical collaboration ecosystems, may create favorable conditions for research activity. In the United States, the development of frontier LLMs, including GPT-4, Claude, Gemini, and Llama, by U.S.-based organizations, along with contributions from institutions such as the Mayo Clinic, Harvard Medical School, and the University of Pennsylvania, may contribute to a technologically enriched environment for clinical AI research. In addition, access to large-scale biomedical datasets, which are essential resources for this field, may also facilitate scholarly activity. In Germany, the Health Data Use Act has established a legislative framework supporting a centralized digital health infrastructure, which may facilitate researchers’ access to pseudonymized health data across institutions. Moreover, national investments in digital health and strategic policy initiatives may be associated with scholarly output. In China, the National Health Commission issued the Guidelines for Artificial Intelligence Application Scenarios in the Health Sector in November 2024, which outlined priority domains for AI deployment and referenced the potential integration of advanced LLMs into clinical contexts. Such policy initiatives may be viewed as part of the broader institutional and scholarly discourse surrounding emerging AI technologies. Nevertheless, it is important to emphasize that the present study did not examine the causal relationship between policy environments and publication output. Accordingly, these interpretations should be regarded as hypothesis-generating contextual observations and interpreted with appropriate caution.

Figure 4.

Spatial distribution of publications.

Co-authorship network analysis (Figure 5) demonstrated a progressively interconnected yet markedly asymmetrical global collaboration landscape. The United States occupied the dominant central node in the network, exhibiting the highest degree of connectivity and functioning as a principal bridge linking multiple international research clusters. China, Germany, and the United Kingdom emerged as important secondary hubs, each maintaining robust cross-national collaborative linkages. In contrast, countries such as Japan, South Korea, Turkey, Pakistan, Portugal, Ireland, and Poland contributed to the literature but remained in comparatively peripheral positions within the network. Notably, the analysis indicated limited representation from countries in Africa and South America, with these regions either appearing only at the margins of the network or remaining largely absent. Overall, the collaboration structure was concentrated around a limited number of global centers, suggesting an uneven distribution of collaborative influence across the international research network.

Figure 5.

Collaborative networks of top 69 countries.

The dominance of high-income countries, particularly the United States, as collaborative hubs may warrant consideration from the perspective of global equity in AI-driven healthcare research. Although bibliometric network analysis cannot directly measure disparities in healthcare access or outcomes, the geographic concentration of scholarly influence may warrant careful consideration. Specifically, this pattern may suggest that the literature surrounding the development and refinement of LLM-based diagnostic technologies is more strongly represented by the clinical contexts, datasets, and priorities of high-income settings. Conversely, the limited participation of many African and South American countries may suggest the possibility of underrepresentation of diverse regional perspectives and healthcare contexts within the research landscape. Such underrepresentation may theoretically raise concerns regarding contextual validity and the potential for algorithmic bias. For instance, models trained primarily on Western clinical data may exhibit uncertain generalizability in low-resource settings characterized by distinct disease burdens, healthcare infrastructures, and clinical practices. This imbalance may also have potential implications for future inequities in the translation and accessibility of AI-assisted diagnostic technologies, particularly in regions where healthcare resources remain limited. Although these interpretations remain speculative and are inferred from collaboration structures rather than direct clinical evidence, they may highlight the potential value of inclusive international partnerships, capacity-building initiatives, and cross-regional data-sharing frameworks to support the equitable advancement of AI-assisted healthcare technologies worldwide.

Distribution and co-authorship analysis of authors

A total of 3,852 authors contributed to the 650 publications on LLM-assisted diagnostics. VOSviewer analysis identified the 10 most productive authors, together with their institutions, countries, publication counts, citations, and total link strength (TLS, i.e., the total strength of co-authorship links between a given researcher and other researchers),³⁶ as summarized in Table 1. Klang E (Icahn School of Medicine at Mount Sinai, USA) recorded the highest publication output, followed by Nadkarni G (Icahn School of Medicine at Mount Sinai, USA), Shimizu T (Dokkyo Medical University, Japan), Hirosawa T (Dokkyo Medical University, Japan), and Glicksberg B (University of California, USA). Klang E also ranked highest in TLS, followed by Nadkarni G, Chen J (Stanford University, USA), Shimizu T, and Glicksberg B. Notably, these TLS values substantially exceeded the network average of 9.04, underscoring stronger-than-average collaborative connectivity within the co-authorship network.

Table 1.

Top 10 most productive authors.

Rank	Author	Affiliation (country)	Number of publication	Citations	Total link strength
1	Klang, E	Icahn School of Medicine at Mount Sinai (USA)	13	31	83
2	Nadkarni, G	Icahn School of Medicine at Mount Sinai (USA)	9	16	53
3	Shimizu, T	Dokkyo Medical University (Japan)	8	80	33
4	Hirosawa, T	Dokkyo Medical University (Japan)	7	64	27
5	Glicksberg, B	University of California (USA)	6	6	32
6	Cheungpasitporn, W	Mayo Clinic (USA)	6	42	25
7	Thongprayoon, C	Mayo Clinic (USA)	6	42	25
8	Omar, M	Icahn School of Medicine at Mount Sinai (USA)	5	10	23
9	Harada, Y	Dokkyo Medical University (Japan)	5	58	25
10	Chen, J	Stanford University (USA)	5	105	49

A co-authorship analysis was also conducted using VOSviewer to examine collaboration patterns among authors (Figure 6). The dataset comprised 76 researchers, each with at least three publications. The analysis indicated that several authors, including Klang E, Glicksberg B, Hirosawa T, Shimizu T, and Omar M, exhibited extensive collaborative linkages within the network. Additionally, a well-defined collaborative network was observed among Klang E, Glicksberg B, Omar M, and Nadkarni G. Notably, these leading collaborators were predominantly U.S.-based, with three affiliated with the Icahn School of Medicine at Mount Sinai, suggesting that collaboration within this cluster was largely concentrated within a single national context. Meanwhile, authors such as Cheungpasitporn W, Chen J, Cho S, Craici I, Ferber D, and Fink A occupied positions within several collaborative subnetworks, reflecting their participation in ongoing research collaborations within the field.

Figure 6.

Collaborative networks of top 76 authors.

Distribution and co-citation analysis of journals

The dataset comprised publications from a total of 650 distinct sources. Table 2 lists the ten most prolific journals, specifying each journal’s number of publications, citations, citation-per-publication (C/P), impact factor (IF), core collection edition (CCE), and journal citation reports (JCR) category. Here, C/P reflects the mean citations per article in the dataset and serves as a journal-level indicator of scholarly influence, whereas IF measures the average annual citation frequency of articles published within the journal. Among the most productive venues were Cureus Journal of Medical Science, Journal of Medical Internet Research (JMIR), and Journal of the American Medical Informatics Association (JAMIA). The prominence of these journals may partly reflect thematic specialization, as journals with established foci on digital health, artificial intelligence, or medical informatics are inherently aligned with LLM-assisted diagnosis research, thereby potentially enhancing their attractiveness as publication venues. In addition, open-access publishing models or hybrid access policies may facilitate broader dissemination and visibility of published work, potentially contributing to higher submission volumes. In terms of absolute citation volume, JMIR, NPJ Digital Medicine, and Cureus Journal of Medical Science ranked highest. By contrast, NPJ Digital Medicine demonstrated the highest C/P value (28.64), followed by JMIR (24.69) and JAMIA (22.23), indicating greater average citation influence relative to publication volume.

Table 2.

Top 10 most productive publication sources.

Rank	Publication title	Number of publication	Citations	C/P	IF	CCE	JCR category
1	CUREUS JOURNAL OF MEDICAL SCIENCE	17	305	17.94	1.0	ESCI	MEDICINE, GENERAL & INTERNAL
2	JOURNAL OF MEDICAL INTERNET RESEARCH	13	321	24.69	5.8	SCIE	HEALTH CARE SCIENCES & SERVICES
3	JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION	13	289	22.23	4.7	SCIE	COMPUTER SCIENCE, INFORMATION SYSTEMS
4	EUROPEAN ARCHIVES OF OTO-RHINO-LARYNGOLOGY	12	71	5.92	1.9	SCIE	OTORHINOLARYNGOLOGY
5	JMIR MEDICAL EDUCATION	12	261	21.75	3.2	ESCI	EDUCATION, SCIENTIFIC DISCIPLINES
6	JOURNAL OF CLINICAL MEDICINE	12	53	4.41	3.0	SCIE	MEDICINE, GENERAL & INTERNAL
7	NPJ DIGITAL MEDICINE	11	315	28.64	12.4	SCIE	HEALTH CARE SCIENCES & SERVICES
8	LECTURE NOTES IN COMPUTER SCIENCE	10	155	15.50	0.3	SCIE	COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
9	APPLIED SCIENCES-BASEL	8	63	7.88	2.5	SCIE	CHEMISTRY, MULTIDISCIPLINARY
10	JMIR MEDICAL INFORMATICS	8	91	11.38	3.2	SCIE	MEDICAL INFORMATICS

A VOSviewer-based co-citation analysis was subsequently conducted to examine inter-journal citation relationships among peer-reviewed journals, deliberately excluding preprint platforms such as arXiv and medRxiv to avoid potential network distortion (Figure 7). The analysis incorporated 80 journals that satisfied the predefined citation threshold. The findings indicated that Cureus Journal of Medical Science was frequently co-cited with JMIR, JMIR Medical Education, Nature, PLOS Digital Health, and Radiology. Other prominent co-citation links involved Bioinformatics, Lecture Notes in Computer Science, and JAMIA, as well as cross-disciplinary clusters connecting Advances in Neural Information Processing Systems, Journal of Biomedical Informatics, and IEEE Access.

Figure 7.

Co-citation network of top 80 journals.

A dual-map overlay generated via CiteSpace (Figure 8) further contextualized the disciplinary intersections between publishing and citing journals. On the left side of the map, the dominant domains of the citing journals included “Mathematics, Systems, Mathematical,” “Medicine, Medical, Clinical,” “Ecology, Earth, Marine,” and “Molecular, Biology, Immunology.” On the right side, the principal domains of the cited journals comprised “Systems, Computing, Computer,” “Environmental, Toxicology, Nutrition,” “Chemistry, Materials, Physics,” and “Health, Nursing, Medicine.” Green citation paths indicated that journals categorized within the “Medicine, Medical, Clinical” domain predominantly cited literature originating from the domains of “Health, Nursing, Medicine” as well as “Molecular, Biology, Genetics.”

Figure 8.

The dual-map overlay of journals of LLMs in medical diagnosis.

Distribution and co-authorship analysis of institutions

A total of 1,363 institutions contributed to the research on LLM-assisted diagnosis, with the ten most productive institutions listed in Table 3. Among these institutions, eight are in North America and two in Asia. Harvard University (USA) and Stanford University (USA) ranked as the most productive institutions, followed by the Icahn School of Medicine at Mount Sinai (USA). Tel Aviv University (Israel) and Peking University (China) were the only non-U.S. institutions in the top-ranking group. In terms of citation counts, Brown University (USA) ranked first, followed by Harvard University (USA) and Stanford University (USA). Brown University accumulated 1,326 citations and demonstrated a C/P value of 147.3, markedly exceeding that of other institutions. This exceptionally high citation performance was largely associated with a single highly influential publication entitled “Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models,” which had accrued a total of 2,389 citations in the WOSCC database as of April 7, 2025.³⁷ This disproportionately cited article substantially contributed to both the total citation count and the C/P metric for Brown University. Collectively, the top 10 institutions accounted for 24.15% of all publications, indicating a concentration of research output among leading institutions and suggesting uneven institutional participation across the field. Regarding collaborative connectivity, Harvard University led‌ with a TLS of 113, followed by Stanford University (99) and Tel Aviv University (59), all of which exceeded the network average of 7.42. These findings indicated relatively strong collaborative connectivity and central positioning within the institutional collaboration network.

Table 3.

Top 10 most productive institutions.

Rank	Organization	Country/Region	Number of publications	Ciations	C/P	Total link strength
1	Harvard University	USA	26	411	15.81	113
2	Stanford University	USA	26	225	8.65	99
3	Icahn School of Medicine at Mount Sinai	USA	20	114	5.7	52
4	Mayo clinic	USA	19	179	9.42	46
5	Tel Aviv University	Israel	18	66	3.67	59
6	Yale University	USA	12	95	7.92	35
7	Peking University	China	9	26	2.89	49
8	Columbia University	USA	9	32	3.56	12
9	University of Pennsylvania	USA	9	138	15.3	22
10	Brown University	USA	9	1326	147.3	15

Institutional collaboration patterns were further examined using a network analysis approach, focusing on the top 89 institutions with at least four publications each (Figure 9). The findings indicated that institutions such as Harvard University, Stanford University, Tel Aviv University, Mayo Clinic, and the Icahn School of Medicine at Mount Sinai maintained particularly high levels of collaborative activity. Notable bilateral partnerships included Harvard University–Mayo Clinic, Icahn School of Medicine at Mount Sinai–Tel Aviv University, and Chaim Sheba Medical Center–Tel Aviv University. Additional strong collaborations were observed between Harvard University and Stanford University, the Zucker School of Medicine and Ohio State University, Ben-Gurion University of the Negev and Tel Aviv University, the Icahn School of Medicine at Mount Sinai and Sheba Medical Center, as well as the VA Palo Alto Health Care System and Stanford University. The network also revealed clusters of collaborating institutions, including a U.S.-centered cluster primarily composed of Harvard University, Stanford University, Mayo Clinic, and the VA Palo Alto Health Care System, as well as an Israeli-centered cluster involving Tel Aviv University, Chaim Sheba Medical Center, the Icahn School of Medicine at Mount Sinai, and Ben-Gurion University of the Negev. Overall, these findings suggested a high degree of domestic collaboration within the United States and relatively dense intra-regional collaboration among Israeli institutions, whereas cross-regional collaboration remained comparatively limited.

Figure 9.

Collaborative networks of top 89 institutions.

Research clusters of the publications

Research clusters were identified through frequency and link-strength analysis of 259 keywords extracted from the literature database using CiteSpace, thereby characterizing the distribution and interrelationships of the principal thematic domains (Figure 10). Ten clusters were identified and subsequently categorized into five overarching groups.

Figure 10.

Research clusters network map.

Group 1 encompassed Cluster 0 and Cluster 2, which together represented major technical domains identified within the keyword co-occurrence network.

• Cluster 0 (23 keywords, including artificial intelligence, neural networks, machine learning, large language model, multimodal large language models, big data, and AI-powered algorithms) primarily reflected the concentration of research activity on the fundamental computational frameworks and data infrastructures underpinning this field. Specifically, big data was frequently referenced as a foundational component encompassing heterogeneous datasets, including patient records, laboratory findings, diagnostic imaging, and biomedical text corpora.³⁸ Machine learning, neural networks, and LLMs also emerged as central methodological themes within the cluster.^39,40

• Cluster 2 (22 keywords, including medical imaging, identification, CT, diagnosis, image processing, images, knowledge graph, and diabetic retinopathy) highlighted the prominence of imaging-related research within the broader LLM-assisted diagnosis literature.⁴¹ Imaging modalities such as CT, MRI, and retinal imaging were frequently represented as key data domains associated with LLM applications.⁴² Similarly, terms such as image processing and knowledge graphs indicated ongoing scholarly interest in enhancing data representation and semantic integration. The presence of disease-specific terms, such as diabetic retinopathy, reflected specific application areas explored within the literature.⁴³

Together, Clusters 0 and 2 represented two major thematic domains identified through keyword co-occurrence analysis, namely computational methodologies and medical imaging applications. These clusters represented thematic concentrations within the literature rather than evidence of established clinical effectiveness or real-world implementation. The keyword associations within these clusters indicated thematic attention to LLM architectures for medical data processing,⁴⁴ multimodal data integration,⁴⁵ image analysis techniques,⁴⁶ and structured knowledge representation,⁴⁷ alongside ethical and legal considerations.⁴⁸

Group 2 included Cluster 3, which was characterized by 21 keywords, including clinical reasoning, language models, natural language processing, evidence-based medicine, search engine, clinical informatics, and medical informatics. This cluster was centered on clinical reasoning and was characterized by keywords related to clinical decision-making, clinical informatics, and medical informatics.⁴⁹ Importantly, the presence of terms such as “clinical reasoning” and “evidence-based medicine” should be interpreted as indicative of research interest rather than evidence of validated LLM capabilities. Keyword associations within this cluster indicated several thematic sub-areas, including LLM-based modeling of clinical reasoning,⁵⁰ personalization of clinical reasoning,⁵¹ contextual and multimodal clinical reasoning,⁵² interpretability and transparency in LLM-assisted decision-making,⁵³ evaluation of LLM-assisted clinical reasoning,⁵⁴ and human-LLM collaboration.⁵⁵

Group 3 integrated Clusters 4, 5, and 6, which collectively reflected major thematic domains of scholarly attention regarding the potential applications of LLMs in medical diagnosis.

• Cluster 4 (20 keywords, including decision making, disease detection, electronic medical records, multimodal data, skin cancer, multiple sclerosis, cervical cancer, and diagnostic uncertainty) was characterized by keyword associations related to diagnostic complexity and uncertainty as represented in the literature. Such uncertainty is commonly associated with complex or rare conditions characterized by overlapping symptoms or incomplete clinical information. The co-occurrence of terms related to multimodal data and electronic medical records suggested thematic attention to the integration of diverse clinical data sources, such as electronic health records, laboratory results, and imaging data, within LLM-related studies. Keyword associations within this cluster indicated multiple thematic areas of investigation, including contextual interpretation of medical information,⁵⁶ addressing ambiguity and incomplete information,⁵⁷ supporting differential diagnosis and multi-criteria decision-making,⁵⁸ and adaptive learning and model refinement.⁵⁹

• Cluster 5 (18 keywords, including information, electronic health records, named entity recognition, artificial hallucinations, and diagnostic accuracy) represented a complementary thematic domain focused on the evaluation of the reliability and validity of LLM-generated diagnostic outputs. Keywords such as artificial hallucinations and diagnostic accuracy underscored sustained scholarly attention to the reliability, limitations, and dependability of LLM-generated results. Keyword associations within this cluster indicated thematic attention to probabilistic reasoning and confidence estimation,⁶⁰ mitigation of LLM hallucinations,⁶¹ and strategies for bias reduction in LLM-assisted decision-making.⁶²

• Cluster 6 (17 keywords, including medical education and training, clinical practice guidelines, quality, physicians, clinical practice, errors, accuracy, and best practices) highlighted an additional thematic direction focused on the intersection of LLMs with medical education, professional training, and clinical practice contexts. The co-occurrence of terms such as medical education, clinical practice, and best practices suggested that the literature increasingly explored the intersection of LLMs with educational and professional development environments. Within this body of work, LLMs were frequently discussed within experimental or conceptual contexts—for example, in case-based learning and clinical training scenarios—rather than as validated clinical tools. Accordingly, these discussions remained largely exploratory and should not be interpreted as evidence of established effectiveness. Keyword associations within this cluster indicated thematic attention to simulation-based learning,⁶³real-time feedback and instructional support,⁶⁴ multimodal scenario integration,⁶⁵ and the design of adaptive or personalized educational tools.⁶⁶

Group 4 integrated Clusters 1, 7, and 9, which collectively illustrated major application-oriented thematic domains identified within the literature on LLM-assisted diagnosis.

• Cluster 1 (23 keywords, including cancer, breast cancer, rare disease, thyroid cancer, human phenotype ontology, random forest, differential diagnosis, risk assessment, care, active surveillance, and survival) reflected a substantial concentration of scholarly attention on disease-focused applications, particularly within oncology and rare diseases. The frequent co-occurrence of terms such as risk assessment, survival, and differential diagnosis suggested thematic attention to disease detection, prognostic evaluation, and clinical management applications within the literature. Keyword associations within this cluster indicated several thematic areas, including early detection,⁶⁷ active surveillance strategies,⁶⁸ robust risk-assessment frameworks,⁶⁹ and predictive modeling tailored to disease-specific contexts.⁷⁰

• Cluster 7 (15 keywords, including data models, adaptation models, critical care, acute respiratory distress syndrome, critical care nephrology, bias, and diagnostic errors) highlighted a thematic domain centered on critical care and high-acuity clinical settings. The co-occurrence of keywords related to critical illness, diagnostic errors, and adaptation models suggested scholarly attention to the potential use of LLM-assisted systems in complex and time-sensitive environments. Keyword associations within this cluster indicated thematic attention to real-time diagnostic support in acute settings,⁷¹ prediction of patient deterioration and complications,⁷² guidance for treatment protocols and pharmacological management,⁷³ and monitoring of patient safety.⁷⁴

• Cluster 9 (13 keywords, including patient care, ai chatbot evaluation, education, artificial intelligence applications, chatbots, and first aid) emphasized an application-oriented thematic domain associated with routine patient care and healthcare delivery contexts. The co-occurrence of terms such as chatbots, patient care, and education suggested thematic attention to clinical communication, patient interaction, and healthcare delivery processes. Keyword associations within this cluster indicated multiple thematic areas of investigation, including preventive care and screening,⁷⁵ automation of administrative tasks and documentation,⁷⁶ optimization of healthcare workflows,⁷⁷ and chronic disease management and monitoring.⁷⁸

Group 5 comprised Cluster 8, which examined the role of prompt engineering in diagnostic contexts. This cluster was defined by 14 keywords, including prompt engineering, transformer, knowledge retrieval, foundation model, generative artificial intelligence, clinical decision tool, medical AI, and AI in healthcare. The prominence of these terms indicated that prompt engineering has emerged as a distinct and increasingly investigated methodological domain within the literature. The co-occurrence of keywords suggested thematic attention to how input design influences LLM-generated outputs in clinical contexts. Prompt engineering was frequently investigated as a strategy for structuring interactions with foundation models and enhancing the relevance, consistency, and contextual coherence of generated outputs. Investigations within this cluster primarily reflected thematic attention to the structured prompt design for clinical applications,⁷⁹ the development of domain-specific or context-aware prompting strategies,⁸⁰ the examination of clinician–AI interaction practices,⁸¹ and the evaluation of output quality and reliability.⁸² Importantly, the bibliometric approach employed in this study does not permit causal inferences regarding the impact of prompt engineering on diagnostic performance or patient outcomes.

Analysis of clinical research

A targeted PubMed search retrieved 96 clinical studies meeting the inclusion criteria. The search strategy is detailed in Supplemental File 3, and the inclusion and exclusion criteria are described in the Methods section. These publications spanned a broad spectrum of clinical contexts, thereby providing a fine-grained overview of the clinical domains, study designs, and evaluation settings represented within the emerging literature on LLM-assisted diagnosis. The analysis below is organized according to five key axes.

The 96 included studies demonstrated a highly heterogeneous distribution across clinical specialties, reflecting both the broad range of clinical domains investigated and the concentration of research activity within selected specialties. Gastroenterology and Hepatology represented the largest cluster (27.9%, n = 26), primarily comprising studies involving LLM-assisted colonoscopy for colorectal polyp and adenoma detection, as well as upper gastrointestinal endoscopy applications focusing on esophageal squamous cell carcinoma, Helicobacter pylori infection diagnosis, and bowel preparation quality assessment. Oncology constituted the second largest domain (19.8%, n = 19), encompassing applications such as prostate cancer detection using MRI and biopsy interpretation, sentinel lymph node evaluation in breast cancer, and melanoma detection in dermatology. Internal Medicine and General Practice accounted for 13.5% (n = 13), including diagnostic reasoning support, retrieval-augmented generation–based pre-procedure informed consent chatbots, and health economic evaluations of LLM-assisted diagnostic interventions in nutrition-related settings. Cardiology and Cardiovascular Medicine represented 10.4% (n = 10), covering ECG-based mortality alert systems, cardiomyopathy screening in pregnancy, cardiovascular risk stratification, and heart failure diuretic management. The remaining studies were distributed across Emergency Medicine (6.2%, n = 6), Pulmonology and Respiratory Medicine (5.2%, n = 5), Neurology (4.2%, n = 4), Ophthalmology (4.2%, n = 4), and other specialties.

The methodological characteristics of the included studies were heterogeneous, with randomized controlled trials (RCTs) constituting the largest study design category and accounting for 66.7% (n = 64) of all studies. Among these RCTs, parallel-group superiority designs predominated, whereas non-inferiority and equivalence designs represented a substantial minority. Non-randomized interventional studies—including single-arm cohort studies, paired-reader studies, and observational implementation studies— accounted for 8.3% (n = 8) of the sample. These methodological designs were primarily used to examine real-world implementation and generally provided a lower level of causal inference than randomized study designs. Prospective observational studies comprised 12.5% (n = 12) of the total and generally evaluated LLM-assisted diagnostic systems within routine clinical workflows. Retrospective observational studies and secondary analyses of prior trial data comprised the remaining 12.5% (n = 12), predominantly serving to validate AI algorithms using previously collected imaging or histopathological datasets, or to explore emerging biomarker endpoints within completed trial cohorts.

Classification according to the Oxford Centre for Evidence-Based Medicine (OCEBM) hierarchy revealed the following distribution of evidence levels across the included studies. Approximately 10.4% (n = 10) of studies qualified at OCEBM Level 1, encompassing high-quality RCTs. The majority of studies—64.6% (n = 62) —were classified as OCEBM Level 2, including individual cohort studies or RCTs that did not meet Level 1 criteria. A total of 15.6% (n = 15) of studies were categorized as Level 3, comprising cohort studies and case-control studies lacking strong controls. The remaining 9.4% (n = 9) of studies were classified as Level 4, predominantly consisting of descriptive case reports and low-quality controlled observational studies.

Despite the growing body of evidence, the literature identified several translational challenges that may constrain the integration of LLMs into clinical practice. Several studies highlighted concerns regarding performance variability across heterogeneous clinical environments characterized by differences in data distributions, workflow conditions, and operator-dependent factors. LLMs also raised potential safety concerns, including the risk of model hallucinations, while standardized prospective monitoring frameworks remained insufficiently developed. Heterogeneity in operator expertise was additionally reported as a factor that may influence the consistency of AI-assisted diagnostic performance, as performance gains may disproportionately amplify pre-existing levels of clinical expertise rather than reduce variability in diagnostic performance. In addition, geographic underrepresentation—with research output heavily concentrated in high-income regions—may limit the available evidence regarding the transferability of findings to populations with distinct epidemiological profiles. Furthermore, barriers to workflow integration persisted, including increased consultation time requirements and the lack of structured integration into clinical workflows, rendering technical performance alone insufficient for meaningful adoption.

Discussion

This study analyzed 650 publications indexed between Q1 2023 and Q1 2025, with the aim of providing a comprehensive overview of the development trajectory, research clusters, current research status, and emerging trends in LLM-assisted diagnosis.

Temporal distribution analysis demonstrated sustained and rapid growth in this field since 2023, with an average quarterly growth rate exceeding 70%. While bibliometric analysis does not support causal inference, this growth may be associated with several concurrent developments. From a technological perspective, advances in frontier models such as GPT-4 may have contributed to lowering barriers to experimentation through improvements in natural language understanding, multimodal reasoning, and API accessibility.⁸³ Clinically, increasing pressures associated with population aging, the escalating burden of chronic disease, workforce shortages, and rising healthcare expenditures may likewise have contributed to growing interest in scalable diagnostic solutions.^84,85 In addition, the broader transition toward data-driven medicine may have further stimulated scholarly interest in LLM-assisted diagnostic approaches.⁸⁶ These factors should be interpreted as contextual considerations rather than definitive causal explanations.

The United States, China, and Germany emerged as the three leading contributors to research on LLM-assisted diagnosis. Notably, the United States dominated the most productive authors and institutions, underscoring its leading position in this field. In terms of scientific collaboration, a global cooperation network has already formed, connecting the United States, China, Germany, Italy, the United Kingdom, and Israel. The United States functioned as the principal hub, maintaining extensive partnerships with numerous countries. Regionally, a China-centered cluster has developed in Asia, whereas a Germany-centered cluster has emerged in Europe. It is also noteworthy that countries across Africa and South America remained markedly underrepresented. This pattern may be viewed, within the broader literature, as potentially associated with disparities in computational infrastructure, access to large-scale biomedical datasets, and differences in regulatory or governance environments. This geographic imbalance may also be interpreted as potentially influencing the datasets, implementation priorities, and validation contexts underlying future LLM-assisted diagnostic systems, which may further raise theoretical concerns regarding contextual generalizability and algorithmic bias in low-resource settings. Although these interpretations remain speculative and are inferred from collaboration structures rather than direct empirical evidence, they underscore the potential importance of inclusive international collaborations, cross-regional data-sharing frameworks, and sustained investment in digital research infrastructure.

Ten clusters were identified in the field and subsequently grouped into five overarching categories. The keyword co-occurrence analysis indicated that Group 1 was primarily characterized by the intersection of two closely related technical domains: multimodal AI architectures and medical imaging. This group reflected a concentration of research activity centered on computational frameworks and data integration strategies within the literature. From an interpretive perspective, these bibliometric patterns may reflect increasing scholarly attention to multimodal diagnostic frameworks.^87,88 The frequent co-occurrence of keywords related to textual, imaging, and other data modalities further supports this interpretation. However, such patterns should be regarded as narrative synthesis rather than evidence of demonstrated improvements in diagnostic performance or clinical effectiveness.^89,90 Consistent with these thematic patterns, recent literature has increasingly explored disease-specific applications of multimodal LLMs in domains such as radiology and ophthalmology, including CT-based intracranial assessment and retinal image interpretation for diabetic retinopathy screening. These studies commonly emphasize the integration of radiologic findings, textual interpretation, and knowledge-grounded contextual reasoning, illustrating how multimodal frameworks are being investigated across different diagnostic settings. Nevertheless, current investigations remain predominantly experimental, and future studies may benefit from prospective validation across heterogeneous populations, rigorous benchmarking against clinician performance, and systematic evaluation of generalizability beyond curated datasets. In addition, the presence of terms related to ethical and legal considerations indicated scholarly attention to governance and trustworthiness within this technical domain. Consistent with this observation, the broader literature has increasingly discussed issues such as regulatory readiness, algorithmic fairness, accountability frameworks, data privacy, informed consent, and liability allocation, highlighting the growing prominence of governance considerations in multimodal diagnostic AI research.

Group 2 focused on the translation of LLM technical capabilities into clinically meaningful reasoning processes. The literature frequently discussed the application of LLMs in contexts associated with clinical decision-making. From a narrative synthesis perspective, these themes may reflect growing scholarly interest in how LLMs could be conceptually aligned with selected components of clinical reasoning, particularly information synthesis and hypothesis generation. However, the bibliometric methodology does not permit inferences regarding whether LLMs replicate or approximate human cognitive processes; consequently, such interpretations should remain appropriately qualified.⁴⁹ Building on this conceptual and methodological framing, recent research has focused on clinical reasoning and collaborative aspects of LLM-assisted diagnosis, particularly diagnostic accuracy and calibration, human–AI collaboration, clinical decision-making, and uncertainty management, as examined in experimental or simulated settings, with the aim of improving the reliability, interpretability, and clinical relevance of model-supported diagnostic reasoning. In parallel, translational and implementation considerations have also received attention, including clinical validation, robustness and safety assessment, and dataset representativeness, which together shape the methodological and institutional requirements for the responsible deployment of LLM-based diagnostic systems for potential clinical implementation in the future.

Group 3 elucidated a set of application-oriented research themes identified through keyword co-occurrence analysis, particularly those related to diagnostic uncertainty and medical education. The literature frequently engaged with challenges associated with complex clinical data, model reliability, and educational or training contexts. From an interpretive perspective, these bibliometric findings may reflect broader scholarly interest in exploring how LLMs could be applied in contexts characterized by diagnostic uncertainty, particularly those involving ambiguous or incomplete clinical information. Consistent with these thematic patterns, recent studies have increasingly explored strategies for managing diagnostic uncertainty, improving the reliability of LLM-generated outputs, and mitigating risks associated with hallucinations and inaccurate recommendations; integration of multimodal data sources and complementary AI approaches has emerged as a recurring area of investigation.^91–94 However, these strategies remain largely exploratory and have not yet been systematically validated in real-world clinical settings. In addition, the literature has increasingly examined the use of LLMs in educational and training environments, with simulation-based learning, feedback generation, and case-based instruction frequently discussed as potential application areas.

Group 4 emphasized application contexts evident in the bibliometric structure, including disease-focused research, critical care settings, and patient-facing healthcare services, thereby indicating coverage across multiple clinical domains and care levels. From a narrative synthesis perspective, these patterns may reflect increasing scholarly attention to the potential integration of LLM-assisted systems into diverse healthcare environments, ranging from high-acuity clinical settings to routine care and patient-facing services.^95,96 Consistent with these thematic patterns, recent studies have increasingly examined disease-specific applications, real-world implementation scenarios, and the use of domain-adapted models across heterogeneous clinical settings. At the clinical deployment level, the literature has increasingly emphasized implementation-related considerations, including clinician oversight, workflow integration, consultation-time burden, auditability, and prospective safety monitoring. These recurring themes suggest that successful translational deployment may depend not only on algorithmic performance but also on the broader sociotechnical environments in which LLM-assisted systems are implemented. This highlights the potential value of integrating perspectives from clinical informatics, human factors engineering, and implementation science alongside conventional model-evaluation frameworks.

Group 5 highlighted prompt engineering as a distinct methodological theme emerging from the keyword co-occurrence analysis, indicating growing scholarly attention to interaction design with LLMs. From a bibliometric interpretive perspective, these patterns may reflect increasing interest in how input formulation may influence LLM-generated outputs in medical contexts. Consistent with these thematic patterns, recent studies have increasingly examined structured prompting, domain-specific adaptation, and clinician-AI interaction as approaches for optimizing interactions with foundation models and improving the relevance and reliability of generated outputs.^97,98 Emerging literature has also explored adaptive prompting strategies responsive to patient complexity, automated frameworks for prompt evaluation, and the integration of prompting techniques into broader clinical decision-support ecosystems. In parallel, retrieval-augmented prompting, self-reflective reasoning pipelines, and automated prompt-optimization frameworks are increasingly discussed as methodological approaches that may further support the reproducibility, robustness, interpretability, and safety of LLM-assisted diagnostic interactions.

Building on the translational challenges identified in the Results section, the current literature suggests several complementary directions for future investigation. Large-scale multicenter studies may benefit from extending beyond surrogate performance metrics to examine patient-centered outcomes, including disease-specific survival, quality-adjusted life-years (QALYs), and complication rates. The literature has also demonstrated increasing attention to the development of standardized safety evaluation frameworks for LLM-assisted diagnosis, particularly those incorporating prospective assessment of diagnostic errors, inaccuracies, and potential clinical harm. Research on implementation across heterogeneous healthcare systems may further benefit from rigorous evaluation of adoption processes, workflow integration, and cost-effectiveness. In addition, equity-focused research may benefit from the inclusion of underrepresented populations from resource-constrained settings to improve generalizability across diverse epidemiological and healthcare contexts. Human–AI collaboration likewise emerged as a recurring theme, encompassing questions related to variability in clinician expertise, appropriate degrees of algorithmic autonomy, explainability mechanisms, uncertainty communication, and calibration of clinician reliance. Finally, longitudinal investigations may help clarify the potential influence of LLM integration on clinician decision-making, including issues related to automation bias and changes in diagnostic performance over time. Collectively, these directions reflect emerging research priorities identified within the current literature and may inform the responsible, safe, and clinically meaningful integration of LLM systems into clinical practice.

Limitations

This study has several limitations. First, the search strategy was iteratively developed based on commonly used LLM-related terminology, representative model names, and terminology identified through preliminary scoping searches and relevant review literature. Nevertheless, given the rapid evolution of nomenclature within the fields of LLMs and medical informatics, it remains possible that certain relevant publications employing unconventional, emerging, or insufficiently indexed terminology were not captured. Second, the present study primarily relied on the Web of Science Core Collection as the principal source database, potentially resulting in incomplete coverage of studies indexed in other databases and the underrepresentation of emerging research disseminated through preprint repositories and conference proceedings frequently used in artificial intelligence and computational medicine research. Consequently, the findings may partially reflect the indexing structure and publication dynamics of the WOSCC database rather than the full spectrum of ongoing scholarly activity within the field. Accordingly, future studies incorporating multiple databases, preprint repositories, conference proceedings, and domain-specific AI indexing platforms may provide a more comprehensive characterization of the evolving LLM research landscape. Third, the relatively short observation period (2023–2025) may accentuate apparent publication growth trends while limiting the ability to derive stable long-term evolutionary conclusions regarding the field. Accordingly, the observed growth patterns should be interpreted as early indicators of rapid scholarly expansion rather than definitive longitudinal trajectories. Fourth, bibliometric methods inherently capture publication volume, citation patterns, and thematic structures; however, they do not assess technical performance, clinical effectiveness, implementation outcomes, or patient-level impact. Consequently, the findings should be interpreted as reflecting patterns of research activity rather than evidence of clinical utility or effectiveness. Finally, keyword-based clustering analyses are sensitive to author-selected terminology, indexing practices, and database-specific metadata structures, which may introduce thematic overlap, cluster instability, and classification inaccuracies.

Conclusion

This bibliometric review mapped the evolving landscape of LLM-assisted diagnosis, highlighting the growth in scholarly publications, which reflected increasing academic and clinical interest rather than empirically validated technological maturity or demonstrated clinical advancement. The literature was characterized by prominent thematic concentrations related to computational foundations, clinical reasoning, diagnostic uncertainty, clinical applications, and prompt engineering methodologies. Keyword clustering and thematic analyses further identified recurrent areas of scholarly attention, including multimodal data integration, interpretability, clinician–AI interaction, and domain-specific diagnostic applications. Equally critical are recurring ethical and equity considerations—including bias mitigation, data governance and security, and global equity in access and representation—which continue to constitute subjects of ongoing scholarly and policy discourse rather than established safeguards or universally realized outcomes. Collectively, these findings provide a structured overview of the current research landscape and synthesize its major thematic directions, thereby informing future investigation of LLM-assisted diagnostic systems.

Supplemental material

Supplemental material - Application of large language models in medical diagnosis: A bibliometric review

Supplemental material for Application of large language models in medical diagnosis: A bibliometric review by Quan Zhang, Haokun Wang, Hongjuan Li, Fengbo Jiao, Hongchen Zhou, and, Meiyu Li in Digital Health.

Footnotes

Acknowledgments

The authors acknowledge the contributions of researchers whose work has advanced the field of LLMs in medical diagnosis.

ORCID iD

Quan Zhang

Ethical considerations

This study exclusively used publicly available bibliographic data retrieved from the Web of Science Core Collection and PubMed. No human participants, clinical data, or identifiable personal information were involved. Therefore, in accordance with institutional and international research ethics guidelines, ethical approval was not required.

Author contributions

Investigation: HW, ML; Data curation: ML, FJ; Project administration: QZ, HL; Methodology: HW, HL, QZ; Software: HW, HZ; Visualization: HW, HZ; Formal analysis: QZ, ML; Writing—original draft: HW, ML, QZ; Conceptualization: HL, HW; Writing—review & editing: FJ, QZ; Supervision: QZ, HL; Funding acquisition: QZ, HL.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research was supported by the National Social Science Fund of China (25BSH015). The funding body had no role in study design, data collection, data analysis, or manuscript preparation.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The datasets generated and analyzed during this study are available from the corresponding author upon reasonable request.*

Guarantor

QZ and HL are the guarantors of this work.

Supplemental material

Supplemental material for this article is available online.

References

Farooq

Khan

Aich

, et al. Artificial neural network-assisted modeling of thermal performance in hybrid nanofluid-integrated solar water pumps. J Radiat Res Appl Sci 2025; 18: 101951. https://doi.org/10.1016/j.jrras.2025.101951

Taha

Abdal

Ali

, et al. Comparative study of AI algorithms in boundary layer flow: evaluating performance of levenberg-marquardt, bayesian, and scaled conjugate methods. Therm Sci Eng Prog 2025; 63: 103697. https://doi.org/10.1016/j.tsep.2025.103697

Ullah

Oliveira-Silva

Nawaz

, et al. Identification of depressing tweets using natural language processing and machine learning: application of grey relational grades. J Radiat Res Appl Sci 2025; 18: 101299. https://doi.org/10.1016/j.jrras.2025.101299

Zulqarnain

Rehman

HKU

Siddique

, et al. Einstein hybrid structure of q-rung orthopair fuzzy soft set and its application for diagnosis of waterborne infectious disease. Comput Model Eng Sci 2024; 139: 1863–1892. https://doi.org/10.32604/cmes.2023.031480

Tran

, et al. Global evolution of research in artificial intelligence in health and medicine: a bibliometric study. J Clin Med 2019; 8: 360. https://doi.org/10.3390/jcm8030360

Weissman

Mankowitz

Kanter

. Unregulated large language models produce medical device-like output. NPJ Digit Med 2025; 8: 148. https://doi.org/10.1038/s41746-025-01544-y

Imrie

Rauba

Van Der Schaar

. Redefining digital health interfaces with large language models. Front Artif Intell 2025; 8: 1623339. https://doi.org/10.3389/frai.2025.1623339

Ifargan

Hafner

Kern

, et al. Autonomous LLM-driven research—from data to human-verifiable research papers. NEJM AI 2025; 2: AIoa2400555. https://doi.org/10.1056/aioa2400555

Lin

Wang

Jiang

, et al. Large language models in clinical trials: applications, technical advances, and future directions. BMC Med 2025; 23: 563. https://doi.org/10.1186/s12916-025-04348-9

10.

Sallam

Snygg

Allam

, et al. Artificial intelligence in clinical medicine: a SWOT analysis of AI progress in diagnostics, therapeutics, and safety. J Innov Med Res 2025; 4: 1–20.

11.

Karanikolas

Manga

Samaridi

, et al. Strengths and weaknesses of LLM-based and rule-based NLP technologies and their potential synergies. Electronics 2025; 14: 3064. https://doi.org/10.3390/electronics14153064

12.

Singhal

Azizi

, et al. Large language models encode clinical knowledge. Nature 2023; 620: 172–180. https://doi.org/10.1038/s41586-023-06291-2

13.

Omar

Nassar

Glicksberg

, et al. Emerging applications of NLP and large language models in gastroenterology and hepatology:a systematic review. Front Med 2025; 11: 1512824. https://doi.org/10.3389/fmed.2024.1512824

14.

Guo

, et al. Exploring the potential of large language models in identifying metabolic dysfunction-associated steatotic liver disease: a comparative study of non-invasive tests and artificial intelligence-generated responses. Liver Int 2025; 45: e16112. https://doi.org/10.1111/liv.16112

15.

Kim

Song

. SkinSavvy2: augmented skin lesion diagnosis and personalized medical consultation system. Electronics 2025; 14: 969. https://doi.org/10.3390/electronics14050969

16.

Ford

Pevy

Grunewald

, et al. Can artificial intelligence diagnose seizures based on patients' descriptions? A study of GPT-4. Epilepsia 2025; 66: 456–465. https://doi.org/10.1111/epi.18322

17.

Milad

Antaki

Milad

, et al. Assessing the medical reasoning skills of GPT-4 in complex ophthalmology cases. Br J Ophthalmol 2024; 108: 1398–1405. https://doi.org/10.1136/bjo-2023-325053

18.

Drouaud

Stocchi

Tang

, et al. Exploring the performance of ChatGPT in an orthopaedic setting and its potential use as an educational tool. JBJS Open Access 2024; 9: e24. https://doi.org/10.2106/jbjs.oa.24.00081

19.

Zhan

Xiong

Wang

, et al. Utilizing GPT-4 to interpret oral mucosal disease photographs for structured report generation. Sci Rep 2025; 15: 5187. https://doi.org/10.1038/s41598-025-89328-y

20.

Tsai

Liu

Cheng

. Remote diagnosis on upper respiratory tract infections based on a neural network with few symptom words: a feasibility study. Diagnostics 2024; 14: 329. https://doi.org/10.3390/diagnostics14030329

21.

Cusidó

Solé-Vilaró

Marti-Puig

, et al. Assessing the capability of advanced AI models in cardiovascular symptom recognition: a comparative study. Appl Sci 2024; 14: 8440. https://doi.org/10.3390/app14188440

22.

Novoa-Laurentiev

Plasek

, et al. Enhancing early detection of cognitive decline in the elderly: a comparative study utilizing large language models in clinical notes. EBioMedicine 2024; 109: 105000. https://doi.org/10.1016/j.ebiom.2024.105401

23.

Sabaneh

Salameh

Khaleel

, et al. Early risk prediction of depression based on social media posts in Arabic. Proc 2023 IEEE 35th Int Conf Tools with Artificial Intelligence (ICTAI). : IEEE, 2023, pp. 1–6.

24.

Kharko

McMillan

Hagström

, et al. Generative artificial intelligence writing open notes: a mixed methods assessment of the functionality of GPT 3.5 and GPT 4.0. Digit Health 2024; 10: 20552076241291384. https://doi.org/10.1177/20552076241291384

25.

European Commission . Artificial intelligence in healthcare. https://health.ec.europa.eu/ehealth-digital-health-and-care/artificial-intelligence-healthcare_en (2025), accessed 11 July 2025.

26.

Shikino

Shimizu

Otsuka

, et al. Evaluation of ChatGPT-generated differential diagnosis for common diseases with atypical presentation: descriptive research. JMIR Med Educ 2024; 10: e58758. https://doi.org/10.2196/58758

27.

Jin

Hua

Chengcheng

, et al. Innovative practices and reflections on the development of hospital-led co-constructed state key laboratories. Intell Pharm 2025; 2025: 1–10.

28.

Doraiswamy

Blease

Bodner

. Artificial intelligence and the future of psychiatry: insights from a global physician survey. Artif Intell Med 2020; 102: 101753, 20200115. https://doi.org/10.1016/j.artmed.2019.101753

29.

Ozturk

. Bibliometric review of resource dependence theory literature: an overview. Manag Rev Q 2021; 71: 525–552. https://doi.org/10.1007/s11301-020-00192-8

30.

Matorevhu

. Bibliometrics: application opportunities and limitations. Bibliometrics—an Essential Methodological Tool for Research Projects. : IntechOpen, 2024, pp. 1–20.

31.

Rotondi

Donner

. A confidence interval approach to sample size estimation for interobserver agreement studies with multiple raters and outcomes. J Clin Epidemiol 2012; 65: 778–784. https://doi.org/10.1016/j.jclinepi.2011.10.019

32.

Waffenschmidt

Hausner

Sieben

, et al. Effective study selection using text mining or a single-screening approach: a study protocol. Syst Rev 2018; 7: 166. https://doi.org/10.1186/s13643-018-0839-x

33.

Page

McKenzie

Bossuyt

, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 2021; 372: 71.

34.

Romero-Gamboa

Pinella-Vega

Millones-Gómez

, et al. A bibliometric analysis. J Pharm Pharm Res 2024; 12: 911–928.

35.

Zhang

Yue Lau

Wang

, et al. From algorithms to clinical execution: a cross-validated knowledge atlas of AI-enabled precision care (2015–2025). Digit Health 2026; 12: 20552076261436251. https://doi.org/10.1177/20552076261436251

36.

Rai

Kernaghan

Schoonmade

, et al. Digital technologies to prevent social isolation and loneliness in dementia: a systematic review. J Alzheimers Dis 2022; 90: 513–528. https://doi.org/10.3233/jad-220438

37.

Kung

Cheatham

Medenilla

, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health 2023; 2: e0000198. https://doi.org/10.1371/journal.pdig.0000198

38.

Levra

Gatti

Mene

, et al. A large language model-based clinical decision support system for syncope recognition in the emergency department: a framework for clinical workflow integration. Eur J Intern Med 2025; 131: 113–120. https://doi.org/10.1016/j.ejim.2024.09.017

39.

Zhang

Qiang

Hao

, et al. GutGPT: a multidimensional knowledge-enhanced large language model for gastrointestinal medicine. J Biomed Inform 2025; 155: 104885. https://doi.org/10.1016/j.jbi.2025.104885

40.

Chang

Yang

, et al. Towards a holistic framework for multimodal LLM in 3D brain CT radiology report generation. Nat Commun 2025; 16: 2258. https://doi.org/10.1038/s41467-025-57426-0

41.

Park

Byun

, et al. LLM-driven multimodal target volume contouring in radiation oncology. Nat Commun 2024; 15: 9186. https://doi.org/10.1038/s41467-024-53387-y

42.

Bhayana

Alwahbi

Ladak

, et al. Leveraging large language models to generate clinical histories for oncologic imaging requisitions. Radiology 2025; 314: e242134. https://doi.org/10.1148/radiol.242134

43.

Wada

Akashi

Shih

, et al. Optimizing GPT-4 turbo diagnostic accuracy in neuroradiology through prompt engineering and confidence thresholds. Diagnostics 2024; 14: 1541. https://doi.org/10.3390/diagnostics14141541

44.

Wang

Zhao

Qiang

, et al. Knowledge-tuning large language models with structured medical knowledge bases for trustworthy response generation in Chinese. ACM Trans Knowl Discov Data 2025; 19: 1–17. https://doi.org/10.1145/3686807

45.

Fahrner

Chen

Topol

, et al. The generative era of medical AI. Cell 2025; 188: 3648–3660. https://doi.org/10.1016/j.cell.2025.05.018

46.

Tian

Jiang

Zhang

, et al. The role of large language models in medical image processing: a narrative review. Quant Imaging Med Surg 2023; 14: 1108–1121. https://doi.org/10.21037/qims-23-892

47.

Kobayashi

, et al. Interpretable medical image visual question answering via multi-modal relationship graph learning. Med Image Anal 2024; 97: 103279. https://doi.org/10.1016/j.media.2024.103279

48.

Zhui

Fenghe

Xuehu

, et al. Ethical considerations and fundamental principles of large language models in medical education. J Med Internet Res 2024; 26: e60083. https://doi.org/10.2196/60083

49.

Goh

Gallo

Hom

, et al. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Netw Open 2024; 7: e2440969. https://doi.org/10.1001/jamanetworkopen.2024.40969

50.

Sblendorio

Dentamaro

Cascio

, et al. Integrating human expertise and automated methods for a dynamic and multi-parametric evaluation of large language models’ feasibility in clinical decision-making. Int J Med Inform 2024; 188: 105501.

51.

As’ s ad

Faran

Joharji

. AI-supported shared decision-making (AI-SDM): conceptual framework. JMIR AI 2025; 4: 75866.

52.

Huppertz

Siepmann

Topp

, et al. Revolution or risk? assessing the potential and challenges of GPT-4V in radiologic image interpretation. Eur Radiol 2025; 35: 11110–11121. https://doi.org/10.1007/s00330-024-11115-6

53.

Patrício

Teixeira

Neves

. A two-step concept-based approach for enhanced interpretability and trust in skin lesion diagnosis. Comput Struct Biotechnol J 2025; 28: 71–79. https://doi.org/10.1016/j.csbj.2025.02.013

54.

Schaye

DiTullio

Guzman

, et al. Large language model-based assessment of clinical reasoning documentation in the electronic health record across two institutions: development and validation study. J Med Internet Res 2025; 27: e67967. https://doi.org/10.2196/67967

55.

Woelfle

Hirt

Janiaud

, et al. Benchmarking Human-AI collaboration for common evidence appraisal tools. J Clin Epidemiol 2024; 175: 111533. https://doi.org/10.1016/j.jclinepi.2024.111533

56.

Alkhnbashi

Mohammad

Hammoudeh

. Aspect-based sentiment analysis of patient feedback using large language models. Big Data Cogn Comput 2024; 8: 167. https://doi.org/10.3390/bdcc8120167

57.

Tanković

Šajina

Lorencin

. Transforming medical data access: the role and challenges of recent language models in SQL query automation. Algorithms 2025; 18: 124. https://doi.org/10.3390/a18030124

58.

Nemati

Assadi Shalmani

, et al. Benchmarking large language models from open- and closed-source families for free-text criteria annotation in healthcare. Future Internet 2025; 17: 138. https://doi.org/10.3390/fi17040138

59.

Agarwal

Wood

Murray

, et al. Impact of hospital-specific domain adaptation on BERT-based models to classify neuroradiology reports. Eur Radiol 2025; 35: 1–15. https://doi.org/10.1007/s00330-025-11500-9

60.

Van der Heijden

LLM

Marang-van de Mheen

Thielman

, et al. Validity of routinely reported Rutherford scores as part of daily clinical practice. Int J Angiol 2024; 33: 148–155.

61.

Roustan

Bastardot

. The clinicians’ guide to large language models: a general perspective with a focus on hallucinations. Interact J Med Res 2025; 14: e59823. https://doi.org/10.2196/59823

62.

Zack

Lehman

Suzgun

, et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. Lancet Digit Health 2024; 6: e12–e22. https://doi.org/10.1016/s2589-7500(23)00225-x

63.

Borg

Georg

Jobs

, et al. Virtual patient simulations using social robotics combined with large language models for clinical reasoning training in medical education: a mixed methods study. J Med Internet Res 2025; 27: e63312. https://doi.org/10.2196/63312

64.

Yang

Zhou

, et al. Aligning large language models with radiologists by reinforcement learning from AI feedback for chest CT reports. Eur J Radiol 2025; 184: 111984. https://doi.org/10.1016/j.ejrad.2025.111984

65.

Lucas

Upperman

Robinson

. A systematic review of large language models and their implications in medical education. Med Educ 2024; 58: 1276–1285. https://doi.org/10.1111/medu.15402

66.

Mishra

Lurie

Mark

. Accuracy of LLMs in medical education: evidence from a concordance test with medical teacher. BMC Med Educ 2025; 25: 443. https://doi.org/10.1186/s12909-025-07009-w

67.

Tang

Tan

, et al. Liquid biopsy for early cancer detection: technological revolutions and clinical dilemma. Expert Rev Mol Diagn 2024; 24: 937–955. https://doi.org/10.1080/14737159.2024.2408744

68.

Xia

Hua

Mei

, et al. Clinical application potential of large language model: a study based on thyroid nodules. Endocrine 2025; 87: 206–213. https://doi.org/10.1007/s12020-024-03981-3

69.

Syed

Ahmed

Iqbal

, et al. Mediscan: a framework of u-health and prognostic AI assessment on medical imaging. J Imaging 2024; 10: 322. https://doi.org/10.3390/jimaging10120322

70.

Omar

Brin

Glicksberg

, et al. Utilizing natural language processing and large language models in the diagnosis and prediction of infectious diseases: a systematic review. Am J Infect Control 2024; 52: 992–1001. https://doi.org/10.1016/j.ajic.2024.03.016

71.

Khan

Sullivan

O’

Ed . A comparison of the diagnostic ability of large language models in challenging clinical cases. Front Artif Intell 2024; 7: 1379297. https://doi.org/10.3389/frai.2024.1379297

72.

Chao

Chang

, et al. Improving prediction of complications post-proton therapy in lung cancer using large language models and meta-analysis. Cancer Control 2024; 31: 10732748241286749. https://doi.org/10.1177/10732748241286749

73.

Young

Enichen

Rao

, et al. Racial, ethnic, and sex bias in large language model opioid recommendations for pain management. Pain 2022; 163: 3388–3395.

74.

Chang

Ghavidel

, et al. An LLM-based framework for zero-shot de-identifying flexible text data in protected health information enabling potential risk-informed patient safety. Int J Radiat Oncol Biol Phys 2024; 120: e518–e519. https://doi.org/10.1016/j.ijrobp.2024.07.1149

75.

Kawasaki

. How can artificial intelligence be implemented effectively in diabetic retinopathy screening in Japan? Medicina 2024; 60: 243. https://doi.org/10.3390/medicina60020243

76.

Hartman

Zhang

Poddar

, et al. Development and evaluating large language model-generated emergency medicine handoff notes. JAMA Netw Open 2024; 7: e2448723. https://doi.org/10.1001/jamanetworkopen.2024.48723

77.

Van Buchem

Kant

King

, et al. Impact of a digital scribe system on clinical documentation time and quality: a usability study. JMIR AI 2024; 3: e60020. https://doi.org/10.2196/60020

78.

Liu

Wei

Xiang

, et al. Bridging the gap in neonatal care: evaluating AI chatbots for chronic neonatal lung disease and home oxygen therapy management. Pediatr Pulmonol 2025; 60: e71020. https://doi.org/10.1002/ppul.71020

79.

Tian

Ayers

, et al. Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: a narrative review. BMC Med Inform Decis Mak 2024; 24: 357. https://doi.org/10.1186/s12911-024-02757-z

80.

Gómez

Yin

Huang

, et al. How large language model-powered conversational agents influence decision making in domestic medical triage contexts. Front Comput Sci 2024; 6: 1427463. https://doi.org/10.3389/fcomp.2024.1427463

81.

Kämmer

Hautz

Krummrey

, et al. Effects of interacting with a large language model compared with a human coach on the clinical diagnostic process and outcomes among fourth-year medical students: study protocol for a prospective, randomised experiment using patient vignettes. BMJ Open 2024; 14: e087469. https://doi.org/10.1136/bmjopen-2024-087469

82.

Wang

Lin

, et al. Application of large language models in medical training evaluation—using ChatGPT as a standardized patient: a multimetric assessment. J Med Internet Res 2025; 27: e59435. https://doi.org/10.2196/59435

83.

Chen

Zheng

, et al. Evolution and prospects of foundation models: from large language models to large multimodal models. Comput Mater Contin 2024; 80: 3289–3312. https://doi.org/10.32604/cmc.2024.052618

84.

Nguyen

Lee

Rodriguez

, et al. Addressing the growing burden of musculoskeletal diseases in the ageing US population: challenges and innovations. Lancet Healthy Longev 2025; 6: e100–e107. https://doi.org/10.1016/j.lanhl.2025.100707

85.

Jane Osareme

Muonde

Maduka

, et al. Demographic shifts and healthcare: a review of aging populations and systemic challenges. Int J Sci Res Arch 2024; 11: 383–395. https://doi.org/10.30574/ijsra.2024.11.1.0067

86.

Ogundeko-Olugbami

Ogundeko

Lawan

, et al. Harnessing data for impact: transforming public health interventions through evidence-based decision-making. World J Adv Res Rev 2025; 25: 2085–2103. https://doi.org/10.30574/wjarr.2025.25.1.0297

87.

Al-Zoghby

Ismail Ebada

Saleh

, et al. A comprehensive review of multimodal deep learning for enhanced medical diagnostics. Comput Mater Contin 2025; 84: 4567–4589. https://doi.org/10.32604/cmc.2025.065571

88.

Simon

Ozyoruk

Gelikman

, et al. The future of multimodal artificial intelligence models for integrating imaging and clinical metadata: a narrative review. Diagn Interv Radiol 2025; 31: 303–314.

89.

Huang

Wang

, et al. Foundation models and intelligent decision-making: progress, challenges, and perspectives. The Innovation 2025; 6: 100948. https://doi.org/10.1016/j.xinn.2025.100948

90.

Fang

Wang

Pan

, et al. Large models in medical imaging: advances and prospects. Chin Med J 2025; 138: 1647–1664. https://doi.org/10.1097/cm9.0000000000003699

91.

Khan

Danishuddin Khan

MWA

, et al. Multi-modal AI in precision medicine: integrating genomics, imaging, and EHR data for clinical insights. Front. Artif. Intell 2025; 8: 1743921. https://doi.org/10.3389/frai.2025.1743921

92.

Ansari

Abdalla

Ansari

, et al. Practical utility of liver segmentation methods in clinical surgeries and interventions. BMC Med Imaging 2022; 22: 97. https://doi.org/10.1186/s12880-022-00825-2

93.

Ansari

Mourad

Qaraqe

, et al. Deep learning for ECG Arrhythmia detection and classification: an overview of progress for period 2017–2023. Front Physiol 2023; 14: 1246746. https://doi.org/10.3389/fphys.2023.1246746

94.

Yaqoob

Ishaq

Ansari

, et al. Advancing paleontology: a survey on deep learning methodologies in fossil image analysis. Artif Intell Rev 2025; 58: 83. https://doi.org/10.1007/s10462-024-11080-y

95.

Bedi

Liu

Orr-Ewing

, et al. Testing and evaluation of health care applications of large language models: a systematic review. JAMA 2025; 333: 319–328. https://doi.org/10.1001/jama.2024.21700

96.

Nazi

Peng

. Large language models in healthcare and medical domain: a review. Informatics 2024; 11: 57. https://doi.org/10.3390/informatics11030057

97.

Lin

Kuo

. Roles and potential of large language models in healthcare: a comprehensive review. Biomed J 2025; 54: 100868–100877. https://doi.org/10.1016/j.bj.2025.100868

98.

Panagoulias

Virvou

Tsihrintzis

. Augmenting large language models with rules for enhanced domain-specific interactions: the case of medical diagnosis. Electronics 2024; 13: 320. https://doi.org/10.3390/electronics13020320

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.24 MB