Abstract
To critically evaluate the performance of Generative Pre-trained Transformer (GPT)-4-based large language models (LLMs) for extracting imaging findings from oncology records, with a primary focus on quantifying the impact of reference data quality on measured performance.
A two-phase study was conducted on 40 oncology medical records. In Phase 1, model outputs were compared against existing, uncurated reference summaries. In Phase 2, outputs for a 20-record subset were re-evaluated against a new “gold standard” of expert-curated, standardized summaries created by a board-certified radiologist. We systematically tested two model versions (text-only GPT-4.0 vs. multimodal GPT-4.1), two prompt designs, two input modalities (text vs. image), and two document scopes. Performance was assessed using lexical metrics (BLEU, ROUGE, METEOR) and a semantic alignment metric (Kullback–Leibler [KL] Divergence).
A profound performance disparity was observed between phases. Phase 1 evaluation against uncurated references yielded modest scores (e.g., max ROUGE-1 ≈ 0.45, BLEU ≈ 0.15) and high semantic divergence (KL > 7.7). In contrast, Phase 2 evaluation against the gold-standard references resulted in substantial improvements across all configurations. The top-performing configuration—multimodal GPT-4.1 using image-based input on the full document—achieved a ROUGE-1 of 0.57, BLEU of 0.25, and a significantly lower KL Divergence of 5.96, closely approaching the expert standard.
The quality and consistency of the reference standard are the most critical drivers of measured LLM performance in clinical information extraction tasks. Standard NLP metrics can be misleading when applied to uncurated “ground truth.” With a clinically validated reference, advanced multimodal models like GPT-4.1 demonstrate a powerful capability to accurately summarize complex oncology reports, highlighting the necessity of codeveloping AI models and their evaluation frameworks.
Keywords
Introduction
The data challenge in precision oncology
Modern precision oncology is fundamentally data-driven, relying on the synthesis of complex patient information to tailor therapies to individual tumor biology. 1 Central to this paradigm is the longitudinal analysis of medical imaging, including computed tomography, magnetic resonance imaging, and positron emission tomography scans. These imaging studies provide the objective evidence needed for initial diagnosis and staging, the assessment of treatment response through standardized criteria, and long-term surveillance for disease recurrence. 2 However, the critical findings from these studies—such as tumor measurements, the appearance of new metastases, or signs of therapeutic response—are often encapsulated within unstructured, narrative radiology reports embedded within complex electronic health records (EHRs). 2 The manual process of extracting and synthesizing this longitudinal information for multidisciplinary tumor boards, clinical trial eligibility screening, or routine clinical follow-up is time-consuming, inefficient, and susceptible to error, creating a significant bottleneck in the delivery of timely and precise cancer care. 3
The emergence of LLMs for clinical information extraction
Artificial intelligence (AI), particularly the subfield of Natural Language Processing (NLP), offers a powerful solution to this unstructured data problem. 2 While traditional NLP methods have been applied in oncology for years, the recent advent of transformer-based large language models (LLMs) represents a paradigm shift in the ability to comprehend and generate human-like text. 4 Models from the Generative Pre-trained Transformer (GPT) family, especially GPT-4 and its successors, have demonstrated remarkable capabilities in a variety of clinical tasks, including information extraction and summarization. 5 A growing body of literature highlights the potential of these models to parse complex clinical documents, with studies showing GPT-4 outperforming earlier models in extracting actionable details from radiology reports and other clinical notes. 3 Consequently, LLMs are now the predominant technology being explored for automating data extraction in oncology research and care. 5
The unaddressed gap: Evaluation standards and multimodality
Despite the promise of LLMs, a critical challenge remains: robust and meaningful evaluation. Many studies report performance metrics without critically examining the quality or consistency of the “ground truth” data used for comparison. 5 This is particularly problematic in clinical domains, where reference summaries may be inconsistent, incomplete, or not created for the purpose of systematic evaluation. This raises a fundamental question: when an LLM’s output fails to match a reference, is it a failure of the model or a flaw in the reference? The urgent need for standardized, scalable, and clinically meaningful evaluation frameworks is a recurring theme in the literature. 5
Furthermore, the technological frontier is rapidly advancing beyond text-only models. Real-world medical records are frequently scanned documents or PDFs, not clean text files. Multimodal LLMs, which can process and interpret multiple data types such as text and images simultaneously, represent a significant technological leap. 6 These models can potentially bypass errors from Optical Character Recognition (OCR) and leverage visual and spatial cues within a document—such as layout, section headings, and tables—that are lost in plain text conversion. 7 Evaluating the practical advantage of these multimodal capabilities in a real-world clinical context is an area of active and vital research. 8
Study rationale and contributions
This study was designed to address these critical gaps through a rigorous, two-phase evaluation of GPT-4-based models for extracting imaging findings from oncology medical records. The primary objective was to isolate and quantify the impact of reference standard quality on the measured performance of these advanced AI systems.
The study was guided by two central hypotheses:
Standardizing the reference summaries to create a clinically curated “gold standard” (Phase 2) will lead to a significant and measurable improvement in all performance metrics compared with using uncurated, heterogeneous references (Phase 1). The multimodal GPT-4.1 model will outperform the text-only GPT-4.0, with its advantage being most pronounced when provided with image-based document input and evaluated against a high-quality reference standard.
This work offers several novel contributions to the field of AI in precision oncology. It is the first study to directly quantify the performance delta attributable to reference data curation for this clinical task. It provides a systematic comparison of text versus image input for a real-world document extraction challenge, offering empirical evidence for the utility of multimodal models. Finally, it yields actionable insights into prompt engineering and evaluation practices that are essential for the responsible and effective deployment of clinical AI.
Materials and Methods
Dataset and cohort
The evaluation was conducted on a dataset of 40 de-identified, multi-page oncology medical records sourced from a clinical trial management system. Each record represented a single patient and contained a complex mixture of document types, including narrative radiology reports, pathology reports, and general clinical notes, reflecting the heterogeneity of real-world EHRs. 9 As an initial pre-processing step, a clinical analyst performed manual, page-by-page tagging of each record to identify all pages containing substantive imaging findings or radiologist impressions. This process created a focused subset of pages used in one of the experimental arms (“Tagged Pages”) to test the model’s performance with prefiltered, relevant content. 9
The two-phase evaluation design
The core of the methodology is a two-phase design, which functions as a controlled experiment to measure the impact of the reference standard’s quality. This design feature served as the study’s primary independent variable.
Phase 1 (Baseline evaluation)
This phase included all 40 medical records. The “ground truth” for evaluation consisted of pre-existing reference summaries available within the source system. These summaries were not created for systematic evaluation and were characterized by significant heterogeneity in style, variable levels of completeness, and a lack of a unified structure. They represented a realistic, albeit flawed, baseline for comparison. 9
Phase 2 (Gold-Standard evaluation)
This phase focused on a subset of 20 records selected randomly from the initial 40 to ensure a representative sample of document complexity and length, minimizing selection bias. For this phase, a new “gold standard” set of reference summaries was created. A board-certified radiologist (J.P.) meticulously reviewed each of the 20 records and rewrote the corresponding reference summary to adhere to a consistent, clinically relevant, and structured format. This format mandated one line per imaging study, beginning with the modality and date, followed by a concise description of key findings and critical negative results. This expert curation process was the key intervention designed to create a high-fidelity, reliable benchmark for model evaluation. 9
Experimental configurations
A systematic, multi-factorial experimental design was employed to test the influence of several key variables on model performance. Each record in each phase was processed using 16 unique configurations, derived from the combination of the following factors.
Model versions
Prompt templates
Input modalities
Document scopes
The experimental design is summarized in Table 1.
Summary of Experimental Design
OCR, optical character recognition.
Evaluation framework
Model outputs were evaluated against the corresponding reference summaries using a combination of standard lexical similarity metrics and a more advanced semantic alignment metric.
Lexical similarity metrics:
Standard NLP metrics were used to measure the textual overlap between the generated summaries and the reference summaries. These included:
While these metrics are standard, their known limitations in capturing deeper semantic meaning, especially in specialized domains such as medicine, were acknowledged. 12
Semantic Alignment Metric: Kullback–Leibler (KL) Divergence:
To complement the lexical metrics, KL Divergence was employed to assess semantic content alignment.
Results
Primary finding: Reference standardization dramatically improves measured performance
The most significant finding of the study was the profound disparity in performance metrics between Phase 1 and Phase 2, driven entirely by the quality of the reference standard. In Phase 1, where models were evaluated against uncurated and inconsistent reference texts, performance was uniformly modest across all 16 configurations. The highest ROUGE-1 score achieved was approximately 0.45, with BLEU scores rarely exceeding 0.15. The KL Divergence scores were consistently high, with the best configuration achieving a score of 7.73, indicating significant semantic divergence between the model outputs and the baseline references.9,10
In stark contrast, Phase 2, which utilized the expert-curated, standardized gold-standard references, demonstrated substantial improvements across every metric for nearly all configurations. The average ROUGE-1 score across all configurations rose significantly, and the average KL Divergence dropped, signaling a much closer alignment between the models’ outputs and the expert standard. This shift strongly suggests that the modest scores observed in Phase 1 were not solely an indictment of the models’ capabilities but were largely an artifact of an inconsistent and unreliable evaluation benchmark. The models were capable of producing high-quality summaries, but this was only revealed when they were measured against an equally high-quality standard.
Dissecting performance under gold-standard conditions (phase 2)
Focusing on the more reliable Phase 2 results allowed for a clearer assessment of the different experimental configurations.
GPT-4.1 superiority
In the controlled environment of Phase 2, GPT-4.1 consistently and clearly outperformed its predecessor, GPT-4.0. For instance, comparing the top-performing configuration (Prompt V1, Image input, Whole MR) for GPT-4.1 against a comparable text-based configuration for GPT-4.0, the KL Divergence was 5.96 for GPT-4.1 versus 7.88 for GPT-4.0.9,10 This pattern held across lexical metrics as well, with GPT-4.1 achieving higher ROUGE and BLEU scores, indicating its superior ability to both identify and correctly phrase the key findings when guided by a clear reference.
The multimodal advantage
The study provided concrete evidence for the value of multimodal capabilities. The highest-performing configurations for GPT-4.1 in Phase 2 all utilized image-based input, which surpassed their text-only counterparts. The overall best configuration used image input and achieved a ROUGE-1 of 0.57 and a BLEU score of 0.25. 7 The same configuration using text input yielded slightly lower scores. This suggests that the ability to process the document in its native visual format allows the model to leverage formatting cues and potentially circumvent OCR imperfections, leading to higher-fidelity extraction. 7
Impact of prompt design
A counter-intuitive finding emerged regarding prompt design. In the noisy conditions of Phase 1, there was no discernible difference between the concise Prompt V1 and the detailed Prompt V2. However, in Phase 2, the simpler Prompt V1 consistently outperformed the more detailed Prompt V2. The top three configurations all used Prompt V1. For example, the best configuration with Prompt V1 achieved a BLEU score of 0.25, while the same setup with Prompt V2 scored only 0.21. 8 This suggests that when the reference standard is clear and consistent, a simple, aligned prompt is more effective than a complex, over-constrained one.
Impact of document scope
The optimal document scope also reversed between phases. In Phase 1, using only the pre-filtered “Tagged Pages” was beneficial, as it reduced noise. In Phase 2, however, providing the “Whole MR” yielded the best results. This indicates that when the model has a clear objective defined by the gold-standard reference, it can effectively leverage the broader context of the entire record to ensure no relevant findings are missed, even if they fall outside of the initially tagged sections. 14 Table 2 provides a quantitative summary of the top-performing configurations and key comparisons.
Quantitative Summary of the Top-Performing Configurations and Key Comparisons
Bold values represent statistical significance.
Qualitative analysis: Making the metrics meaningful
To illustrate the practical significance of these quantitative differences, Table 3 presents a side-by-side comparison for a representative case. It shows a snippet of the original radiology report, the vague and incomplete reference from Phase 1, the precise and structured expert-curated reference from Phase 2, and the output from the top-performing GPT-4.1 configuration.
Qualitative Analysis Example
The qualitative example makes the abstract metrics tangible. The Phase 1 reference is clinically inadequate, lacking specific measurements and failing to mention the critical new finding of potential metastases. In contrast, the output from the top-performing GPT-4.1 configuration in Phase 2 almost perfectly mirrors the expert-curated summary, capturing the specific measurement, the interval change, and the presence of new nodules. This demonstrates that the higher scores in Phase 2 correspond directly to a dramatic improvement in clinical utility and accuracy.
Discussion
Principal finding: Rethinking “ground truth” in the era of generative AI
The central finding of this study is a methodological one: in the evaluation of clinical LLMs, the quality of the reference standard can be a more significant determinant of measured performance than the model, prompt, or input modality. The “moving target” problem observed in Phase 1, where the uncurated references were themselves inconsistent and often incomplete, rendered the evaluation metrics ambiguous. An LLM could produce a summary that was clinically superior to the reference yet be penalized by lexical metrics for the very act of being more accurate and comprehensive.18,19 This paradox highlights a critical flaw in applying standard NLP evaluation practices without first validating the “ground truth.” This work provides empirical weight to the argument that the development of standardized, clinically meaningful evaluation frameworks is not an afterthought but a prerequisite for making real progress in the field. 5 The results suggest a shift in focus is needed: from a purely model-centric approach to a more holistic, data-and-evaluation-centric one.
This raises a practical question of scalability: if high-quality references are required for performance, must radiologists manually curate all data? We argue that expert curation should be viewed as a high-value investment in evaluation benchmarks, not a requirement for daily workflow.
By investing in a representative “Gold Standard” set for validation, institutions can verify that the AI model accurately handles heterogeneous “garbage in” data. Once validated, the AI model itself serves as the solution to the scalability problem, acting as an automated standardization layer that converts messy narrative reports into structured, high-fidelity outputs without further human intervention.
The clinical advantage of multimodal document analysis
The superior performance of GPT-4.1 with image-based input in Phase 2 provides strong evidence for the practical benefits of multimodality in clinical document processing. The advantage likely extends beyond simply circumventing OCR errors. By processing the document image directly, the model can leverage the rich visual and spatial information that is lost in plain-text conversion. 7 This includes the structure of the report, such as the clear delineation of “FINDINGS” and “IMPRESSION” sections, the use of bolding or capitalization to emphasize critical results, and the layout of tabular data. This ability to interact with medical documents in their native format, much like a human clinician does, represents a significant step toward more robust and less brittle AI systems. It reduces the reliance on complex and potentially error-prone pre-processing pipelines and allows the model to access a richer set of contextual cues, ultimately leading to higher-fidelity information extraction. 6
Prompt engineering: Simplicity, alignment, and redundancy
The finding that the simpler Prompt V1 outperformed the more detailed Prompt V2 in the high-fidelity conditions of Phase 2 offers a nuanced insight into prompt engineering. This study proposes a “prompt-reference alignment” hypothesis. When the reference standard is of high quality and internally consistent, it implicitly provides the model with a strong, repeated example of the desired output style. A concise prompt that aligns with this inherent style is sufficient to guide the model effectively. The more detailed instructions in Prompt V2, while well-intentioned, may have become redundant or even counterproductive. By specifying phrasing or formatting rules that deviated slightly from the expert radiologist’s natural summarization style, Prompt V2 may have “over-constrained” the model, forcing it into less natural constructions that lowered its lexical similarity scores relative to the gold standard. 8 This underscores a key principle for practitioners: prompt complexity should be minimized, and instructions should be added only as necessary to correct specific, observed failures, rather than pre-emptively. 8
Context is king: The role of document scope
The reversal in the optimal document scope between the two phases is also instructive. The superiority of the “Whole MR” input in Phase 2 demonstrates that with a clear and comprehensive objective (as defined by the gold-standard reference), advanced LLMs can effectively manage and utilize a large context. Providing the full document empowers the model to synthesize information from across different sections, ensuring that no relevant finding is missed. This is particularly important in real-world scenarios where automated document classification or manual tagging may be imperfect. The ability to process the entire record reduces the brittleness of the system and increases its potential to capture a complete clinical picture, mirroring how a human expert would review a patient’s chart. 19
Contextualizing with broader literature
The performance scores achieved in Phase 2 (e.g., ROUGE-1 of 0.57, BLEU of 0.25) are comparable to or exceed those reported in other recent studies on LLM-based clinical information extraction and summarization. 6 However, it is crucial to contextualize these numbers. The broader NLP literature has increasingly recognized that automated metrics like ROUGE and BLEU often correlate poorly with human judgments of clinical quality, particularly regarding factual consistency, completeness, and the absence of harmful errors. 12 While this study used a more sophisticated semantic metric (KL Divergence) and a qualitative analysis to add depth, the findings reinforce the message that these automated scores should be interpreted with caution and always be supplemented with expert human evaluation before any clinical deployment. 8
Emerging evaluation frameworks such as RadFact, 20 RadCliQ, 21 and the Provider Documentation Summarization Quality Instrument (PDSQI-9), 22 along with methods such as FineRadScore, 23 provide more clinically meaningful metrics and systematic error analyses. Information extraction studies using InstructGPT and other instruction-tuned models highlight how careful prompt design and annotation guidelines can enhance extraction accuracy and reliability.15,18 Beyond lexical overlap metrics, researchers have explored semantic similarity measures such as BERTScore 24 and dense passage retrieval techniques 25 to better align model outputs with expert intent. Even large-scale neural machine translation systems demonstrate that architectural choices and scale can dramatically impact performance. 26 Together, these works underscore the importance of developing robust evaluation tools and adapting insights from adjacent fields to the assessment of clinical language models. 27
Limitations
This study has several limitations that should be considered. First, the sample size was relatively small (40 records in Phase 1, 20 in Phase 2) and sourced from a single institution. This may limit the generalizability of the findings, and validation on a larger, multi-institutional dataset with more diverse reporting styles is a necessary next step. 17
Second, while the Phase 2 gold standard was a significant improvement, it was created by a single board-certified radiologist. Clinical summarization can involve subjective judgment, and another expert might have a different style or emphasize different findings. A more robust standard would involve a consensus reference created by multiple experts to account for inter-rater variability.
Third, the evaluation relied primarily on automated metrics. Although these were supplemented with a qualitative example, the study did not include a formal human evaluation of clinical accuracy, factuality, or a systematic error analysis. Quantifying the rates of clinically significant errors, such as the omission of a critical finding or the hallucination of a nonexistent one, is an essential step for assessing clinical readiness. 6
Fourth, the evaluation relied on descriptive statistics to characterize performance differences. While formal hypothesis testing (e.g., p-values) was not performed due to the pilot sample size (n = 20 in Phase 2), the magnitude of the performance delta (e.g., >25% improvement in lexical metrics) and the consistency of the trend across all configurations provide strong evidence for the validity of the findings.
Fifth, the model comparison was limited to two variants of GPT-4. A broader comparison including other state-of-the-art proprietary models (e.g., from Google or Anthropic) or fine-tuned open-source models would provide a more comprehensive view of the current technology landscape. Finally, this work did not evaluate practical implementation factors such as computational cost or inference latency, which are critical considerations for deploying such systems at scale in a clinical environment. 6
Conclusion
This comprehensive two-phase evaluation demonstrates that the fidelity of AI-driven information extraction from clinical documents is a joint function of the AI model’s intrinsic capability and, critically, the clarity and quality of the framework used to evaluate it. The single most important factor in achieving high measured performance was the curation of a gold-standard reference dataset. When evaluated against this expert-curated benchmark, the multimodal GPT-4.1 model, provided with full document context via image input and guided by a concise prompt, was able to generate summaries of imaging findings that closely emulated those of a clinical expert.
The results carry a clear message for the field of AI in precision oncology: progress requires a dual focus. While continuing to advance the underlying model technology is important, it is equally, if not more, crucial to invest in the development of rigorous, clinically-grounded, and high-quality benchmark datasets and evaluation standards. Without them, we risk misinterpreting model performance and slowing the translation of these powerful technologies into tools that can genuinely improve patient care.
Clinical Relevance
The findings of this study have direct and significant implications for the application of AI in clinical oncology workflows.
Authors’ Contributions
All authors confirm that they have made substantial contributions to this work and have approved the final article. Roles are defined using the CRediT taxonomy as follows: Conceptualization: A.L.-B., E.T., and S.K. Methodology: A.L.-B. and N.G.T. Software: E.T. and S.Y. Validation: E.T. and S.Y. Formal analysis: N.G.T. Investigation: E.T. and S.Y. Resources: S.K. Data curation: E.T. Writing—original draft: A.L.-B. and N.G.T. Writing—review and editing: All authors. Supervision: A.L.-B. and S.K. Project administration: S.K. Funding acquisition: S.K.
Footnotes
Acknowledgments
The authors would like to thank the clinical analysts who assisted with the curation of the gold-standard reference summaries. The authors also appreciate the support of the staff and patients who contributed to the de-identified medical records used in this study.
Author Disclosure Statement
A.L.B.: Leadership: Massive Bio; Stock and Other Ownership Interests: Massive Bio; Consulting or Advisory Role: Massive Bio, Bayer, PSI, BrightInsight, Cardinal Health, Pfizer, AstraZeneca, Verily, Medscape; Speakers’ Bureau: Guardant Health, Ipsen, AstraZeneca/Daiichi Sankyo, Natera. N.G.T.: Employment: Capital Health; Patents, Royalties, Other Intellectual Property: Software copyright for Python programming code for RO-APM analysis. E.T.: Employment: Massive Bio. S.Y.: Employment: Massive Bio. S.K.: Leadership: Massive Bio; Stock and Other Ownership Interests: Massive Bio.
Funding Information
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
