Abstract
Objective:
Indeterminate cytology (Bethesda III and IV) represents 15–30% of biopsied thyroid nodules and require additional diagnostic testing. Molecular testing (MT) is a commonly used diagnostic tool that evaluatesmalignancy risk through next generation sequencing of fine needle aspiration (FNA) samples. While MT achieves high sensitivity (97–100%) in ruling out malignancy, its specificity and positive predictive value (PPV) remain relatively low. This study proposes a multimodal deep learning model that integrates ultrasound (US) imaging with MT to improve risk stratification by enhancing PPV while maintaining high sensitivity. Combining these modalities leverages complementary information from both molecular and imaging data, addressing limitations in current approaches and offering a robust framework for evaluating indeterminate nodules.
Methods:
We retrospectively analyzed 333 patients with indeterminate thyroid nodules (259 benign, 74 malignant) at UCLA Medical Center between 2016 and 2022. We evaluated four configurations: whole frame US images, 256 × 256 patches, 128 × 128 patches, and an ensemble model combining the first three configurations. The clinical baseline consisted of Bethesda cytology and MT results. Models were assessed using five fold cross validation stratified by surgical outcomes.
Results:
The clinical baseline (Bethesda + MT) achieved an AUROC of 0.728 [0.68, 0.78] with sensitivity of 0.946 [0.88, 1.00], specificity of 0.664 [0.60, 0.73], and PPV of 0.448 [0.41, 0.48]. The proposed ensemble model demonstrated improved performance, achieving an AUROC of 0.831 [0.77, 0.89] with a sensitivity of 0.946 [0.88, 1.00], specificity of 0.703 [0.66, 0.75], and PPV of 0.477 [0.46, 0.50]. These improvements were statistically significant (p = 0.0008).
Conclusion:
Our multimodal model enhances MT performance by providing statistically significant improvements in PPV and specificity while maintaining high sensitivity. Our framework could be leveraged to reduce the number of benign thyroid resections in patients with indeterminate nodules. However, this study is limited by its single center dataset, lack of external validation, and the use of binarized MT outputs rather than granular malignancy risk probabilities. Future work should validate these findings across diverse populations and larger external datasets for more comprehensive risk stratification.
Dear Editor:
Indeterminate cytology (Bethesda III and IV) is reported in 15 to 30% of thyroid nodules evaluated through fine-needle aspiration (FNA), often necessitating repeat biopsies or diagnostic surgeries. 1,2 Molecular testing (MT) is a diagnostic tool that performs next-generation sequencing on the FNA sample, which allows for a more precise determination of malignancy risk in patients with indeterminate cytology. 2 –5 While MT has proven to be useful in stratifying malignancy risk, it is calibrated to maximize sensitivity while tolerating a relatively higher number of false positive results, requiring repeat biopsy or diagnostic surgery for ultimately benign pathology. MT achieved a sensitivity of 97 to 100% and positive predictive value (PPV) of 53 to 63% in determining the presence of malignancy in a recent randomized controlled trial in 2020. 6 Subsequent reports have demonstrated comparable sensitivities of multiple MT platforms including ThyroSeq, Afirma, and ThyraMIR with PPVs ranging from 64% to 95%, 74% to 94%, and 38% to 65%, respectively, depending on the type of nodules assessed and study performed. 5,7
Diagnostic assistance from convolutional neural networks applied to ultrasound (US) images of thyroid nodules has been extensively studied in various malignancy classification and nodule segmentation tasks. 8 –11 Recently, Zhuang et al. demonstrated that attention multiple instance learning (AMIL) models, when applied to US image studies, can effectively classify patients with thyroid nodules as benign or malignant. 12 This work focused on malignancy prediction in patients with benign or malignant cytology on FNA, excluding those with indeterminate nodules. In our study, we specifically aimed to explore the use of a multimodal AMIL approach for classifying indeterminate nodules as malignant or benign. Our objective is to combine the analysis of US images with MT to reduce the number of false positive results from MT while preserving sensitivity. This approach could be used to bolster the use of MT for the determination of which patients warrant further invasive diagnostic procedures.
Methods
Dataset
Consecutive patients with indeterminate thyroid nodules at UCLA Medical Center from May 2016 through February 2022 were retrospectively reviewed. This study was approved by the UCLA Institutional Review Board (IRB#19-001535). Our analysis included patients with indeterminate cytology (Bethesda III and IV on FNA) who received MT. Thyroseq (Sonic Healthcare, Rye Brook, NY) and Afirma (Veracyte, San Francisco, CA) were used for MT. For those patients who underwent surgery, the surgical pathology was used to determine a label of benign or malignant. In those who did not undergo surgery, benign results were assumed as these patients were determined to be at low risk of nodule malignancy based on clinical evaluation including a benign MT result, an approach that has been used in previous studies. 11 A significant number of patients had a pathological diagnosis of non-invasive follicular thyroid neoplasm with papillary-like nuclear features (NIFTP). These patients were grouped with those diagnosed with thyroid cancer as NIFTP is currently considered pre-malignant and should be managed surgically. 6,13
Experiments and setup
We developed a predictive framework leveraging AMIL to analyze US scans and predict surgical outcomes for thyroid nodules. 14 AMIL operates under the multiple instance learning (MIL) paradigm, which allows for weakly supervised learning where labels are assigned to groups (bags) rather than individual instances. In our context, all US scans from a given patient’s image study form a single bag labeled by the patient’s malignancy status, enabling outcome prediction without scan-level annotations. Traditional MIL models cannot specify which instances contribute to a bag’s label, but AMIL provides attention scores, highlighting the most influential regions in each scan.
During preprocessing, US images were cropped to remove extraneous elements like text overlays and resized to ensure consistency. To analyze localized features, we extracted patches of 256 × 256 and 128 × 128 pixels, upscaling the latter to 256 × 256 for compatibility with the pre-trained ResNet101 backbone, which extracted robust feature representations from the final hidden layer. Features from each patient’s scans were stacked into matrices, and input through AMIL to generate an attention-weighted feature vector encoding all scan information. This vector was concatenated with binary MT outcomes and passed into a fully connected network for final predictions. Initial experiments also integrated binarized Bethesda III/IV scores with MT (MT + BE) for the clinical baseline. However, the baseline (clinical decision making) consistently relied solely on MT, so further experiments excluded Bethesda scores. Our framework was designed to reduce false positives by integrating MT results with US imaging. The classification threshold was set to match MT’s sensitivity, ensuring improved specificity and fewer false positives at equivalent sensitivity levels.
We evaluated three configurations—whole-frame images and patches of 256 × 256 and 128 × 128 pixels—to determine the optimal image resolution for prediction. The final model ensembled these approaches during test time to improve performance. Models were trained over 100 epochs with a batch size of one, employing five fold cross-validation stratified by surgical outcomes. Statistical significance relative to baseline MT was assessed using a one-sided Wilcoxon signed-rank test (α = 0.05). 6,15 For additional details, including figures and results, please refer to the Supplementary Data.
Results
There were 333 patients with indeterminate thyroid nodules who met this study’s inclusion criteria (259 benign and 74 malignant). A majority of patients had Bethesda III pathology (57.4% vs. 42.6% with Bethesda IV). Papillary thyroid cancer (PTC) was the most common malignancy (47% of patients), while NIFTP was present in 35% and follicular thyroid cancer in 15% (Table 1). The performance of the baseline and different model configurations is in Supplementary Data. The models’ predictions aggregated over all five folds were used to plot the receiver operating characteristic (ROC) curves shown in Figure 1a. The baseline MT + BE model achieved an area under the receiver operating characteristic curve (AUROC) of 0.728 [0.68, 0.78] with a sensitivity of 0.946 [0.88, 1.00] and specificity of 0.664 [0.60, 0.73], consistent with prior studies. 6 We tested multiple configurations, including whole-frame and patch-based models, with detailed results provided in Supplementary Data. The ensemble model outperformed all others, achieving the highest AUROC of 0.831 [0.77, 0.89], PPV of 0.477 [0.46, 0.50], and specificity of 0.703 [0.66, 0.75], while maintaining the same sensitivity as the clinical baseline. These improvements were statistically significant (p = 0.0008). Attention maps from the patch 256 model are shown in Figure 1b and c.
Demographic Comparisons Between Patients with Benign and Malignant Thyroid Nodules
IQR, interquartile range; MT, molecular testing; NIFTP, non-invasive follicular thyroid neoplasm with papillary-like nuclear features.

Results of the AMIL framework.
Discussion
We developed an AMIL model combining US images and MT to classify indeterminate thyroid nodule malignancy, retaining MT’s high sensitivity while reducing false positives. Minimizing false negatives is critical to avoid delayed cancer diagnoses, while reducing false positives prevents unnecessary surgeries and associated morbidity. By matching MT’s sensitivity, our model aimed to reduce misclassified malignancies, which would lead to fewer false positives associated with MT in clinical practice. Although the PPV improvement is modest (0.477 [0.46, 0.50] vs. 0.448 [0.41, 0.48]), with over 120,000 indeterminate biopsies annually in the United States, 16,17 even small reductions in false positives could greatly benefit patients and health care systems.
Several studies have utilized deep learning to predict malignancy in indeterminate thyroid nodules, focusing on either pathology slides or using only US studies. 18 –20 To our knowledge, this is the first deep learning framework designed specifically to augment MT performance by reducing false positives. MT platforms report varying sensitivity and PPV depending on the test type and patient characteristics. In our cohort, the PPV of MT alone was relatively low (0.448 [0.41, 0.48]), but our framework improved PPV on the same group of patients. Given MT’s widespread clinical use, enhancing its performance represents a more achievable and clinically relevant goal than replacing it entirely with deep learning models. Combining the Bethesda score with MT increased the AUROC from 0.627 (MT alone) to 0.728 [0.68, 0.78], indicating that biopsy results complement MT. However, other metrics showed no significant improvement as the optimal decision point primarily relied on MT. While digitized cytology slides could offer more information that the Bethesda classification might overlook, they are not routinely digitized due to time and cost constraints.
This study has several key limitations that warrant discussion. As a single-center study without external validation, results may reflect practices specific to our institution and lack generalizability. Additionally, we were restricted to analyzing patients for whom US images were available in our database, excluding those with outside imaging or imaging missing for other reasons. We also assumed that patients who did not undergo surgery had benign pathology, an assumption consistent with prior literature studying MT. However, this assumption introduces potential bias, as institutional differences in Bethesda class rate of malignancy may affect the classification of indeterminate nodules. Some of these patients may have had malignant pathology but did not undergo surgery due to factors such as advanced age, medical comorbidities, or sub-centimeter PTCs. This is particularly relevant given that 34% of our “benign” patients had a suspicious MT result. Additionally, NIFTP nodules, which accounted for 35% of the nodules in our cohort, differ from other cancers in terms of presentation, clinical importance, and treatment, making it valuable to distinguish them separately in future analyses. Finally, our dataset was limited to a binarized version of MT output (suspicious or benign). In clinical practice, recent MT reports often include probability ranges of malignancy risk that guide diagnostic decisions. However, reports from different companies use varying methods to compute and present these probabilities, and older reports lack this information entirely. Incorporating such granular data could enhance our model’s ability to build upon MT results and could be explored in future efforts using deep learning approaches.
Footnotes
Authors’ Contributions
S.A. contributed to conceptualization, formal analysis, methodology, investigation, visualization, and writing the original draft. A.M. and S.S.A.S. were responsible for software development, methodology, investigation, validation, and writing the original draft. V.I. and A.R. handled data curation, methodology, and writing—review and editing. V.R.S., C.M., H.Z., M.P., R.M., M.L., and M.Y. contributed to data curation, resources, validation, and writing—review and editing. C.W.A. and W.S. handled funding acquisition, supervision, project administration, and writing—review and editing.
Data Availability
The data underlying this article will be shared on reasonable request to the corresponding author.
Author Disclosure Statement
The authors have no disclosures or conflicts of interest.
Funding Information
This work was supported by the National Institute of Biomedical Imaging and Bioengineering of the National Institutes of Health under award number R21EB030691 and a UCLA Radiology Exploratory Research Grant.
Supplementary Material
Supplementary Data
