Combining Ultrasound Imaging and Molecular Testing in a Multimodal Deep Learning Model for Risk Stratification of Indeterminate Thyroid Nodules

Abstract

Objective:

Indeterminate cytology (Bethesda III and IV) represents 15–30% of biopsied thyroid nodules and require additional diagnostic testing. Molecular testing (MT) is a commonly used diagnostic tool that evaluatesmalignancy risk through next generation sequencing of fine needle aspiration (FNA) samples. While MT achieves high sensitivity (97–100%) in ruling out malignancy, its specificity and positive predictive value (PPV) remain relatively low. This study proposes a multimodal deep learning model that integrates ultrasound (US) imaging with MT to improve risk stratification by enhancing PPV while maintaining high sensitivity. Combining these modalities leverages complementary information from both molecular and imaging data, addressing limitations in current approaches and offering a robust framework for evaluating indeterminate nodules.

Methods:

We retrospectively analyzed 333 patients with indeterminate thyroid nodules (259 benign, 74 malignant) at UCLA Medical Center between 2016 and 2022. We evaluated four configurations: whole frame US images, 256 × 256 patches, 128 × 128 patches, and an ensemble model combining the first three configurations. The clinical baseline consisted of Bethesda cytology and MT results. Models were assessed using five fold cross validation stratified by surgical outcomes.

Results:

The clinical baseline (Bethesda + MT) achieved an AUROC of 0.728 [0.68, 0.78] with sensitivity of 0.946 [0.88, 1.00], specificity of 0.664 [0.60, 0.73], and PPV of 0.448 [0.41, 0.48]. The proposed ensemble model demonstrated improved performance, achieving an AUROC of 0.831 [0.77, 0.89] with a sensitivity of 0.946 [0.88, 1.00], specificity of 0.703 [0.66, 0.75], and PPV of 0.477 [0.46, 0.50]. These improvements were statistically significant (p = 0.0008).

Conclusion:

Our multimodal model enhances MT performance by providing statistically significant improvements in PPV and specificity while maintaining high sensitivity. Our framework could be leveraged to reduce the number of benign thyroid resections in patients with indeterminate nodules. However, this study is limited by its single center dataset, lack of external validation, and the use of binarized MT outputs rather than granular malignancy risk probabilities. Future work should validate these findings across diverse populations and larger external datasets for more comprehensive risk stratification.

Dear Editor:

Indeterminate cytology (Bethesda III and IV) is reported in 15 to 30% of thyroid nodules evaluated through fine-needle aspiration (FNA), often necessitating repeat biopsies or diagnostic surgeries.^1,2 Molecular testing (MT) is a diagnostic tool that performs next-generation sequencing on the FNA sample, which allows for a more precise determination of malignancy risk in patients with indeterminate cytology.^2

–5 While MT has proven to be useful in stratifying malignancy risk, it is calibrated to maximize sensitivity while tolerating a relatively higher number of false positive results, requiring repeat biopsy or diagnostic surgery for ultimately benign pathology. MT achieved a sensitivity of 97 to 100% and positive predictive value (PPV) of 53 to 63% in determining the presence of malignancy in a recent randomized controlled trial in 2020.⁶ Subsequent reports have demonstrated comparable sensitivities of multiple MT platforms including ThyroSeq, Afirma, and ThyraMIR with PPVs ranging from 64% to 95%, 74% to 94%, and 38% to 65%, respectively, depending on the type of nodules assessed and study performed.^5,7

Diagnostic assistance from convolutional neural networks applied to ultrasound (US) images of thyroid nodules has been extensively studied in various malignancy classification and nodule segmentation tasks.^8

–11 Recently, Zhuang et al. demonstrated that attention multiple instance learning (AMIL) models, when applied to US image studies, can effectively classify patients with thyroid nodules as benign or malignant.¹² This work focused on malignancy prediction in patients with benign or malignant cytology on FNA, excluding those with indeterminate nodules. In our study, we specifically aimed to explore the use of a multimodal AMIL approach for classifying indeterminate nodules as malignant or benign. Our objective is to combine the analysis of US images with MT to reduce the number of false positive results from MT while preserving sensitivity. This approach could be used to bolster the use of MT for the determination of which patients warrant further invasive diagnostic procedures.

Methods

Dataset

Consecutive patients with indeterminate thyroid nodules at UCLA Medical Center from May 2016 through February 2022 were retrospectively reviewed. This study was approved by the UCLA Institutional Review Board (IRB#19-001535). Our analysis included patients with indeterminate cytology (Bethesda III and IV on FNA) who received MT. Thyroseq (Sonic Healthcare, Rye Brook, NY) and Afirma (Veracyte, San Francisco, CA) were used for MT. For those patients who underwent surgery, the surgical pathology was used to determine a label of benign or malignant. In those who did not undergo surgery, benign results were assumed as these patients were determined to be at low risk of nodule malignancy based on clinical evaluation including a benign MT result, an approach that has been used in previous studies.¹¹ A significant number of patients had a pathological diagnosis of non-invasive follicular thyroid neoplasm with papillary-like nuclear features (NIFTP). These patients were grouped with those diagnosed with thyroid cancer as NIFTP is currently considered pre-malignant and should be managed surgically.^6,13

Experiments and setup

We developed a predictive framework leveraging AMIL to analyze US scans and predict surgical outcomes for thyroid nodules.¹⁴ AMIL operates under the multiple instance learning (MIL) paradigm, which allows for weakly supervised learning where labels are assigned to groups (bags) rather than individual instances. In our context, all US scans from a given patient’s image study form a single bag labeled by the patient’s malignancy status, enabling outcome prediction without scan-level annotations. Traditional MIL models cannot specify which instances contribute to a bag’s label, but AMIL provides attention scores, highlighting the most influential regions in each scan.

During preprocessing, US images were cropped to remove extraneous elements like text overlays and resized to ensure consistency. To analyze localized features, we extracted patches of 256 × 256 and 128 × 128 pixels, upscaling the latter to 256 × 256 for compatibility with the pre-trained ResNet101 backbone, which extracted robust feature representations from the final hidden layer. Features from each patient’s scans were stacked into matrices, and input through AMIL to generate an attention-weighted feature vector encoding all scan information. This vector was concatenated with binary MT outcomes and passed into a fully connected network for final predictions. Initial experiments also integrated binarized Bethesda III/IV scores with MT (MT + BE) for the clinical baseline. However, the baseline (clinical decision making) consistently relied solely on MT, so further experiments excluded Bethesda scores. Our framework was designed to reduce false positives by integrating MT results with US imaging. The classification threshold was set to match MT’s sensitivity, ensuring improved specificity and fewer false positives at equivalent sensitivity levels.

We evaluated three configurations—whole-frame images and patches of 256 × 256 and 128 × 128 pixels—to determine the optimal image resolution for prediction. The final model ensembled these approaches during test time to improve performance. Models were trained over 100 epochs with a batch size of one, employing five fold cross-validation stratified by surgical outcomes. Statistical significance relative to baseline MT was assessed using a one-sided Wilcoxon signed-rank test (α = 0.05).^6,15 For additional details, including figures and results, please refer to the Supplementary Data.

Results

There were 333 patients with indeterminate thyroid nodules who met this study’s inclusion criteria (259 benign and 74 malignant). A majority of patients had Bethesda III pathology (57.4% vs. 42.6% with Bethesda IV). Papillary thyroid cancer (PTC) was the most common malignancy (47% of patients), while NIFTP was present in 35% and follicular thyroid cancer in 15% (Table 1). The performance of the baseline and different model configurations is in Supplementary Data. The models’ predictions aggregated over all five folds were used to plot the receiver operating characteristic (ROC) curves shown in Figure 1a. The baseline MT + BE model achieved an area under the receiver operating characteristic curve (AUROC) of 0.728 [0.68, 0.78] with a sensitivity of 0.946 [0.88, 1.00] and specificity of 0.664 [0.60, 0.73], consistent with prior studies.⁶ We tested multiple configurations, including whole-frame and patch-based models, with detailed results provided in Supplementary Data. The ensemble model outperformed all others, achieving the highest AUROC of 0.831 [0.77, 0.89], PPV of 0.477 [0.46, 0.50], and specificity of 0.703 [0.66, 0.75], while maintaining the same sensitivity as the clinical baseline. These improvements were statistically significant (p = 0.0008). Attention maps from the patch 256 model are shown in Figure 1b and c.

Table 1.

Demographic Comparisons Between Patients with Benign and Malignant Thyroid Nodules

Variable	Benign (n = 259)	Malignant (n = 74)
Male sex (%)	49 (19.1)	16 (21.1)
Age (median [IQR])	56 [43, 68]	47 [37, 60]
Surgical resection (%)	101 (39.0)	74 (100.0)
Bethesda IV (%)	42 (16.3)	20 (26.3)
MT suspicious (%)	86 (33.5)	71 (93.4)
Surgical pathology (%)
Benign	101 (39.0)	0 (00.0)
Papillary thyroid cancer	0 (0.0)	35 (47.3)
Follicular carcinoma	0 (0.0)	11 (14.9)
NIFTP	0 (0.0)	26 (35.1)
Other thyroid cancer	0 (0.0)	2 (02.7)

IQR, interquartile range; MT, molecular testing; NIFTP, non-invasive follicular thyroid neoplasm with papillary-like nuclear features.

FIG. 1.

Results of the AMIL framework. (a) ROC curves comparing different model configurations. (b, c) Patch 256 attention maps of representative US scans. In (b), there is a nodule in the right lobe of the thyroid with the attention map including the majority of the nodule. The deep border has a notable rim of hyperechogenicity that is partially contained within the attention map. In (c), there is a large thyroid nodule in the center of the frame, containing a central solid component with hypoechoic peripheral areas. Notably, the attention map is located directly over an area containing isoechoic and hypoechoic nodule components at the periphery as well as the hyperechoic interface between the nodule and the deep surrounding tissue. AMIL, attention multiple instance learning; BE, Bethesda score; ROC, receiver operating characteristic; US, ultrasound; WF, whole-frame images.

Discussion

We developed an AMIL model combining US images and MT to classify indeterminate thyroid nodule malignancy, retaining MT’s high sensitivity while reducing false positives. Minimizing false negatives is critical to avoid delayed cancer diagnoses, while reducing false positives prevents unnecessary surgeries and associated morbidity. By matching MT’s sensitivity, our model aimed to reduce misclassified malignancies, which would lead to fewer false positives associated with MT in clinical practice. Although the PPV improvement is modest (0.477 [0.46, 0.50] vs. 0.448 [0.41, 0.48]), with over 120,000 indeterminate biopsies annually in the United States,^16,17 even small reductions in false positives could greatly benefit patients and health care systems.

Several studies have utilized deep learning to predict malignancy in indeterminate thyroid nodules, focusing on either pathology slides or using only US studies.^18
–20 To our knowledge, this is the first deep learning framework designed specifically to augment MT performance by reducing false positives. MT platforms report varying sensitivity and PPV depending on the test type and patient characteristics. In our cohort, the PPV of MT alone was relatively low (0.448 [0.41, 0.48]), but our framework improved PPV on the same group of patients. Given MT’s widespread clinical use, enhancing its performance represents a more achievable and clinically relevant goal than replacing it entirely with deep learning models. Combining the Bethesda score with MT increased the AUROC from 0.627 (MT alone) to 0.728 [0.68, 0.78], indicating that biopsy results complement MT. However, other metrics showed no significant improvement as the optimal decision point primarily relied on MT. While digitized cytology slides could offer more information that the Bethesda classification might overlook, they are not routinely digitized due to time and cost constraints.

This study has several key limitations that warrant discussion. As a single-center study without external validation, results may reflect practices specific to our institution and lack generalizability. Additionally, we were restricted to analyzing patients for whom US images were available in our database, excluding those with outside imaging or imaging missing for other reasons. We also assumed that patients who did not undergo surgery had benign pathology, an assumption consistent with prior literature studying MT. However, this assumption introduces potential bias, as institutional differences in Bethesda class rate of malignancy may affect the classification of indeterminate nodules. Some of these patients may have had malignant pathology but did not undergo surgery due to factors such as advanced age, medical comorbidities, or sub-centimeter PTCs. This is particularly relevant given that 34% of our “benign” patients had a suspicious MT result. Additionally, NIFTP nodules, which accounted for 35% of the nodules in our cohort, differ from other cancers in terms of presentation, clinical importance, and treatment, making it valuable to distinguish them separately in future analyses. Finally, our dataset was limited to a binarized version of MT output (suspicious or benign). In clinical practice, recent MT reports often include probability ranges of malignancy risk that guide diagnostic decisions. However, reports from different companies use varying methods to compute and present these probabilities, and older reports lack this information entirely. Incorporating such granular data could enhance our model’s ability to build upon MT results and could be explored in future efforts using deep learning approaches.

Footnotes

Authors’ Contributions

S.A. contributed to conceptualization, formal analysis, methodology, investigation, visualization, and writing the original draft. A.M. and S.S.A.S. were responsible for software development, methodology, investigation, validation, and writing the original draft. V.I. and A.R. handled data curation, methodology, and writing—review and editing. V.R.S., C.M., H.Z., M.P., R.M., M.L., and M.Y. contributed to data curation, resources, validation, and writing—review and editing. C.W.A. and W.S. handled funding acquisition, supervision, project administration, and writing—review and editing.

Data Availability

The data underlying this article will be shared on reasonable request to the corresponding author.

Author Disclosure Statement

The authors have no disclosures or conflicts of interest.

Funding Information

This work was supported by the National Institute of Biomedical Imaging and Bioengineering of the National Institutes of Health under award number R21EB030691 and a UCLA Radiology Exploratory Research Grant.

Supplementary Material

Supplementary Data

References

Cibas

, Ali

. The 2017 Bethesda system for reporting thyroid cytopathology. Thyroid, 2017; 27(11):1341–1346.

Durante

, Grani

, Lamartina

, et al. The diagnosis and management of thyroid nodules: A review. JAMA, 2018; 319(9):914–924; doi: 10.1001/jama.2018.0898

Roth

, Witt

, Steward

. Molecular testing for thyroid nodules: Review and current state. Cancer, 2018; 124(5):888–898; doi: 10.1002/cncr.30708

Khan

, Zeiger

. Thyroid nodule molecular testing: Is it ready for prime time? Front Endocrinol (Lausanne), 2020; 11:590128; doi: 10.3389/fendo.2020.590128

Patel

, Carty

, Lee

. Molecular testing for thyroid nodules including its interpretation and use in clinical practice. Ann Surg Oncol, 2021; 28(13):8884–8891; doi: 10.1245/s10434-021-10307-4

Livhits

, Zhu

, Kuo

, et al. Effectiveness of molecular testing techniques for diagnosis of indeterminate thyroid nodules: A randomized clinical trial. JAMA Oncol, 2021; 7(1):70–77.

Sipos

, Ringel

. Molecular testing in thyroid cancer diagnosis and management. Best Pract Res Clin Endocrinol Metab, 2023; 37(1):101680; doi: 10.1016/j.beem.2022.101680

Sharifi

, Bakhshali

, Dehghani

, et al. Deep learning on ultrasound images of thyroid nodules. Biocybernetics and Biomedical Engineering, 2021; 41(2):636–655; doi: 10.1016/j.bbe.2021.02.008

Khachnaoui

, Guetari

, Khlifa

. A Review on deep learning in thyroid ultrasound computer-assisted diagnosis systems. In: 2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS) IEEE: Sophia Antipolis, France; 2018; pp. 291–297; doi: 10.1109/IPAS.2018.8708866

10.

Cao

C-L

, Li

Q-L

, Tong

, et al. Artificial intelligence in thyroid ultrasound. Front Oncol, 2023; 13:1060702; doi: 10.3389/fonc.2023.1060702

11.

, Xu

H-L

, Cao

Y-N

, et al. The performance of deep learning on thyroid nodule imaging predicts thyroid cancer: A systematic review and meta-analysis of epidemiological studies with independent external test sets. Diabetes Metab Syndr, 2023; 17(11):102891; doi: 10.1016/j.dsx.2023.102891

12.

Zhuang

, Ivezic

, Feng

, et al. Patient-Level thyroid cancer classification using attention multiple instance learning on fused multi-scale ultrasound image features. AMIA Annu Symp Proc, 2023; 2023:1344–1353.

13.

Haugen

, Sawka

, Alexander

, et al. American thyroid association guidelines on the management of thyroid nodules and differentiated thyroid cancer task force review and recommendation on the proposed renaming of encapsulated follicular variant papillary thyroid carcinoma without invasion to noninvasive follicular thyroid neoplasm with papillary-like nuclear features. Thyroid, 2017; 27(4):481–483.

14.

Ilse

, Tomczak

, Welling

. Attention-based deep multiple instance learning. 2018; doi: 10.48550/ARXIV.1802.04712

15.

Steward

, Carty

, Sippel

, et al. Performance of a multigene genomic classifier in thyroid nodules with indeterminate cytology: A Prospective Blinded Multicenter Study. JAMA Oncol, 2019; 5(2):204–212.

16.

Cabanillas

, McFadden

, Durante

. Thyroid Cancer. The Lancet, 2016; 388(10061):2783–2795.

17.

Sosa

, Hanna

, Robinson

, et al. Increases in thyroid nodule fine-needle aspirations, operations, and diagnoses of thyroid cancer in the United States. Surgery, 2013; 154(6):1420–1427; doi: 10.1016/j.surg.2013.07.006

18.

Conn Busch

, Cozzi

, Li

, et al. Role of machine learning in differentiating benign from malignant indeterminate thyroid nodules: A literature review. Health Sciences Review, 2023; 7:100089; doi: 10.1016/j.hsr.2023.100089

19.

Gild

, Chan

, Gajera

, et al. Risk stratification of indeterminate thyroid nodules using ultrasound and machine learning algorithms. Clin Endocrinol (Oxf), 2022; 96(4):646–652; doi: 10.1111/cen.14612

20.

Wang

, Zheng

, Wan

, et al. Deep learning models for thyroid nodules diagnosis of fine-needle aspiration biopsy: A retrospective, prospective, multicentre study in China. Lancet Digit Health, 2024; 6(7):e458–e469; doi: 10.1016/S2589-7500(24)00085-2