Abstract
Background:
The rising incidence of thyroid cancer presents a growing diagnostic and therapeutic challenge. Various risk stratification systems have sought to integrate clinical, ultrasonographic, and, in some cases, cytological features to aid malignancy prognostication. This systematic review aims to critically evaluate risk stratification tools (RSTs) for patients with thyroid nodules, which incorporate multimodal inputs to assess their diagnostic performance and clinical utility in supporting surgical decision-making.
Methods:
PubMed, Embase, and Cochrane databases were searched from inception to 04/13/2026, identifying studies evaluating multivariable risk prediction models for adult patients undergoing assessment of thyroid nodules. Studies were excluded if the proposed tool failed to incorporate clinical features, ultrasound findings, and cytology results or was not validated with histology. Data extraction encompassed methodology of model development, performance metrics, and approaches to validation. Risk of bias was assessed using the PROBAST+AI tool.
Results:
Seven studies describing five distinct RSTs met inclusion criteria Thyroid Nodule App (TNAPP), the McGill Thyroid Nodule Score (MTNS), CUT Score, Memorial Sloan Kettering Cancer Centre (MSKCC) nomogram, and Thyroid Prediction Score (TiPS). TiPS demonstrated the highest sensitivity (96.2%) and specificity (97.5%) with area under the curve (AUC) >0.9. The CUT score also showed strong performance (AUC >0.9), particularly in low-to-intermediate risk nodules. TNAPP underperformed (accuracy 50.5%; specificity 27.5%) despite broad clinical inputs. The MTNS and MSKCC, although promising for indeterminate cytology, lacked robust validation. Most models were derived from single-center, retrospective cohorts, limiting generalizability.
Conclusions:
RSTs integrating multimodal data may improve thyroid nodule risk stratification, particularly in cases of indeterminate cytology. However, methodological limitations and lack of external validation currently restrict clinical utility. Prospective evaluation in diverse populations is required to identify the most effective and generalizable tools. Until then, RSTs should be used as adjuncts to, not replacements for, clinical judgment and shared decision-making in thyroid nodule assessment.
Introduction
The rising global incidence of thyroid cancer poses significant challenges for international health systems and policy. 1 Improved access to ultrasonography and technological advancements have led to a surge in the detection of incidental nodules.2,3 However, less than 10% of thyroid nodules are malignant; the majority of which are indolent microcarcinomas that carry favorable prognoses and do not impact overall survival.2,4,5 Despite the rising incidence of thyroid cancer, mortality rates have remained stable, implying a trend toward overdiagnosis and overtreatment. 6
Accurately distinguishing benign from malignant thyroid nodules prior to surgery remains a significant challenge. No single ultrasonographic parameter can reliably predict malignancy. However, combinations of sonographic characteristics (including taller-than-wide shape, irregular margins, and microcalcifications) correlate with an increased risk of cancer.7–9 These high-risk ultrasound features have been integrated into variable international classification systems.10–13 While fine-needle aspiration (FNA) cytology augments estimations of malignancy risk for suspicious nodules, its diagnostic value is limited.14,15 Despite its high reported sensitivity and specificity, 16 FNA is limited in its ability to conclusively rule out malignancy, especially for indeterminate results. 17 Clinical factors, including age,18,19 sex, 20 family history,21,22 biochemical markers,23,24 and prior radiation exposure,25,26 are associated with thyroid cancer but vary in their predictive value, and no single factor is definitive. 27 Consequently, many patients undergo thyroidectomy for nodules later proven benign on final histopathology.28,29
Several risk prediction models have been proposed, combining clinical, imaging, and cytological features into composite scoring systems. These models offer the potential to streamline decision-making, standardize risk assessment, endorse individualized care, reduce costs, and ultimately improve patient outcomes through reducing unnecessary interventions.30–32 However, many of these tools are derived from narrowly defined patient subsets and lack rigorous external validation; hence, none are currently advocated for routine clinical use in the UK. 11
This systematic review aims to evaluate clinically usable multimodal risk stratification tools (RSTs) for thyroid nodules that integrate clinical, ultrasonographic, and cytological variables. In addition to comparing reported diagnostic performance, this review also examines the methodological quality, heterogeneity, and real-world applicability of these models.
Methods
This systematic review was conducted in accordance with the 2020 Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines. 33 A comprehensive search of PubMed, Embase, and Cochrane databases was conducted from inception to April 13, 2026 (Supplementary Data). Duplicate citations were removed after export using manual bibliographic matching within the reference management software. After removing duplicates, two blinded reviewers independently screened titles and abstracts, resolving all discrepancies through discussion. Full-text articles were assessed independently by the same two blinded reviewers, with any remaining discrepancies resolved by consultation with a third independent reviewer. Additional studies were identified through snowballing search strategies.
Inclusion criteria
This review included adult patients undergoing assessment for thyroid nodules. The intervention of interest was defined as thyroid RSTs, which combined ultrasound findings, cytology, and clinical risk factors as predictor variables to estimate individual risk of malignancy. The primary outcome was the presence of histologically confirmed malignancy.
Exclusion criteria
Studies were excluded if they involved pediatric patients, animals, or disease processes unrelated to thyroid cancer, such as toxic nodules. In addition, case reports, literature reviews, and articles not published in English were excluded. Multivariable models incorporating two of the three key predictor domains (ultrasound findings, cytology, and clinical risk factors) were excluded from the formal review dataset and PRISMA synthesis but were considered separately during the discussion for contextual comparison. Similarly, studies focusing on risk stratification within narrowly defined patient subsets (such as nodules with indeterminate cytology or specific ultrasound classification categories) were excluded from the formal review dataset and PRISMA analysis but were considered separately in the narrative discussion for contextual comparison. Studies focusing on RSTs designed to predict cervical lymph node metastasis, rather than the risk of primary thyroid malignancy, were not included. RSTs developed specifically to predict a single histological subtype of thyroid malignancy (e.g., papillary or medullary thyroid cancer), rather than overall thyroid cancer risk, were excluded from formal analysis. Finally, for the purposes of this review, a clinically usable tool was defined as one that provided a reproducible patient-level output using specified inputs and an explicit framework for risk estimation or decision-making. Machine learning models were therefore excluded if their output could not be translated into a reproducible patient-level risk estimate or decision aid suitable for routine clinical use, for example, a point-based score, nomogram, clinical calculator, or explicit decision rule as opposed to a study reporting multivariable logistic regression alone or a machine learning model without a directly applicable clinical output.
Data extraction
Data were extracted regarding study design, methodology, population characteristics, predictor variables, model performance, and validation. Patient characteristics (including mean age, gender distribution, and prevalence of malignancy) were collated to evaluate the comparability of baseline populations. Data extraction was performed independently by both reviewers.
Data synthesis and analysis
Descriptive statistics were used to summarize model characteristics and identify the most frequently utilized predictor variables. Diagnostic performance was assessed using area under the curve (AUC), C-statistics, sensitivity, and specificity, where available. In accordance with established thresholds, AUC or C-statistic values of 0.5–0.6 indicated poor performance, 0.6–0.7 adequate performance, and values above 0.7 good performance. 34
Risk of bias assessment
Included studies were assessed for their risk of bias using the Prediction Model Risk Of Bias Assessment Tool (PROBAST+AI). 35 An overall low-risk rating was only assigned when all domains were judged as low risk. Two reviewers independently conducted the risk of bias assessments. For studies that developed or analyzed multiple similar models, a single risk of bias assessment was conducted.
Institutional Review Broad waiver
Institutional Review Board approval was not required for this systematic review, as it does not involve human subjects or identifiable personal data.
Results
Study selection
In total, 3124 records were identified, as outlined in the PRISMA flow diagram (Fig. 1). Following removal of duplicates and articles published in languages other than English, 2830 articles were eligible for screening. Title and abstract screening excluded 2524 records, leaving 306 articles for full-text review. Following full-text review, the most common reasons for exclusion were studies limited to multivariable statistical analysis without describing an RST (n = 241), models incorporating only two of the three key predictor variables (n = 40), and studies focused on narrowly defined patient subgroups (n = 37).

PRISMA flow diagram of exclusion criteria. PRISMA, Preferred Reporting Items for Systematic Reviews and Meta-Analysis.
Characteristics of included studies
Seven studies met criteria for inclusion, describing five distinct RSTs for thyroid cancer: TNAPP,36,37 the CUT Score, 38 the McGill Thyroid Nodule Score (MTNS),39,40 TiPS, 41 and the MSKCC nomogram 42 (Table 1).
Characteristics of Included Studies
Median.
Mean.
MSKCC, Memorial Sloan Kettering Cancer Centre; RST, risk stratification tool; TiPS, Thyroid Prediction Score; TNAPP, Thyroid Nodule App.
TNAPP 36 and MTNS 39 were developed through expert consensus, rather than statistical modeling. TNAPP was validated on 95 cases submitted from multiple centers to test interface usability. 36 However, the absence of clear case selection criteria raises concerns about the potential for selection bias. The MTNS underwent more robust internal validation, with retrospective application to 844 patients who underwent surgery. Ianni et al. derived their CUT score following meta-analysis of over 30,000 nodules, with a small prospective validation (n = 110). 38 Most tools were developed or validated at single centers, raising concerns about generalizability and potential overfitting. 43
Malignancy rates varied markedly across studies, from 12.5% in the TiPS cohort 41 to 53–63% in CUT and MTNS cohorts,38–40 likely reflecting referral and selection biases. Age distributions were generally comparable (mean/median 53–55 years),37–40,42 except for TiPS (mean age 43 years) 41 and TNAPP, which did not report demographics. 36 A predominantly female population distribution was noted across all studies, although male representation ranged from 11.4% 41 to 33%, 42 which may influence predictive performance. Baseline nodule size, where reported, was also broadly comparable across cohorts. However, the MTNS development cohort reported by Sands et al. appeared to include relatively larger nodules, with 73% measuring >2 cm; this may partially account for the higher malignancy rate of 63% observed in that study. Although several other potentially relevant baseline features were captured in individual studies, including TSH, family history, prior irradiation, and multiplicity, their reporting was too sporadic to allow meaningful comparison across cohorts. Papillary thyroid carcinoma (PTC) formed the predominant malignancy in all cohorts, although the spectrum of histological subtypes differed. The MSKCC cohort included the broadest range of thyroid cancers, likely due to its tertiary referral setting. 42 Some studies did not report histological subtypes, limiting comparability and applicability to diverse patient populations.36,40
Eligibility criteria were heterogenous and often restrictive. TNAPP excluded high-risk features to focus on low-to-moderate risk nodules.36,37 The MSKCC nomogram 42 and MTNS39,40 were restricted to surgical cohorts. While histopathology offers a definitive gold standard outcome, this limits the generalizability of RSTs for broader outpatient practice. TiPS 41 excluded pregnant, thyrotoxic, and patients with previous thyroid cancer. In contrast, eligibility criteria for the CUT score were insufficiently reported, 38 limiting assessment of its external applicability.
Characteristics of risk prediction models
All five models incorporated clinical, ultrasound, and cytological variables, though with differing breadth and methodological approaches (Table 2). The MTNS produced a 22-item weighted scoring system based on expert consensus, while the MSKCC nomogram was the only model developed purely through multivariable logistic regression. TiPS represented the simplest approach, combining basic clinical data with established ACR-TIRADS and Bethesda classification systems. The TNAPP tool included the broadest range of clinical factors, while the CUT score limited clinical variables to those supported through prior meta-analysis. TSH was the most consistently used clinical parameter, employed in four out of five models.
Overview of Risk Stratification Tool Components
FNA, fine-needle aspiration; MSKCC, Memorial Sloan Kettering Cancer Centre; RST, risk stratification tool; TiPS, Thyroid Prediction Score; TNAPP, Thyroid Nodule App.
Ultrasound features including hypoechogenicity, irregular margins, and microcalcifications were common across all models, although weighting differed. TiPS relies entirely on the ACR-TIRADS, boosting interobserver consistency. By contrast, the CUT score assigns odds-ratio-based weights to each sonographic parameter. The MTNS adopts a broader approach, including PET positivity and lymphadenopathy, while the MSKCC nomogram narrows its focus to shape, echotexture, and vascularity, emphasizing cytology instead. The TNAPP blends AACE/AME and ACR TI-RADS frameworks into a simplified three-tier system.
When incorporating cytology, most models relied on established classification systems: TNAPP, MTNS, and TiPS integrate the Bethesda system to varying extents, while the CUT score uses the Italian Thy system. 44 The MSKCC nomogram was the only tool to incorporate detailed cytological parameters not formally aligned with any traditional classification system. The TNAPP and MTNS also incorporated selective molecular markers (like BRAF) and other immunohistochemical markers.
The primary outcome across all RSTs was the estimation of malignancy risk. The TNAPP tool provided the most granular output, including malignancy probability ranges and management recommendations. However, its malignancy probability estimates were partly based on data that predate the reclassification of Noninvasive Follicular Thyroid Neoplasm with Papillary-like Nuclear Features and may overestimate true malignancy rates. The CUT score stratifies nodules into low, intermediate, or high risk but without providing management guidance. The MTNS, TiPS, and MSKCC nomogram offer numerical risk estimates to support decision-making without recommending management.
Model performance
Considerable heterogeneity was observed in diagnostic performance across models (Table 3). Diagnostic performance varied according to the thresholds applied within each model. TiPS demonstrated excellent performance at a score threshold ≥6 with a sensitivity of 96.2% and excellent predictive values [negative predictive value (NPV) 99.5%, positive predictive value (PPV) 83.3%], outperforming ACR TI-RADS alone in the same cohort. The CUT score showed similarly high sensitivity (95% for scores >2.5) with AUC values >0.9, indicating strong discrimination. In contrast, TNAPP demonstrated lower overall accuracy (50.5%) and low specificity (27.5%), offering limited improvement over existing guideline-based approaches.
Summary of Comparative Risk Stratification Tool Performance
AUC, area under the curve; MSKCC, Memorial Sloan Kettering Cancer Centre; RST, risk stratification tool; TiPS, Thyroid Prediction Score; TNAPP, Thyroid Nodule App.
Specificity was highest for TiPS (97.5% for score thresholds ≥6). Similarly, the CUT score achieved specificities of up to 95% for higher thresholds, although this was accompanied by reduced sensitivity. In contrast, TNAPP displayed the weakest specificity (27.5%) and predictive values (PPV 60.4%, NPV 44.2%), suggesting utility primarily for ruling in rather than ruling out malignancy. The MTNS and MSKCC models were developed specifically for indeterminate cytology (Bethesda III/IV). MTNS provided clear risk gradients, with scores >9 corresponding with malignancy rates of 63% and scores of 7 correlating with a 32% risk of malignancy. However, no formal sensitivity, specificity, accuracy, or receiver operating characteristic data were defined. By contrast, the MSKCC nomogram achieved excellent discrimination (C-index 0.91), outperforming cytology alone.
This comparative analysis highlights the differing clinical priorities of each system: ACR TI-RADS and TNAPP prioritize sensitivity, whereas TiPS, CUT, and the MSKCC nomogram provide more balanced performance profiles with higher specificity and overall diagnostic utility.
External validation of model performance
External validation of MTNS has shown consistent performance in intermediate nodules, with accuracy up to 91.4% for higher score thresholds.45,46 Scheffler et al. proposed MTNS+ as an extended version of the score, demonstrating that the incorporation of thyroglobulin may further improve sensitivity by up to 10.5%, particularly at lower score thresholds >7.47,48 However, specificity and positive predictive value were not substantially improved.
The CUT score demonstrates mixed outcomes during external validation. While some cohorts demonstrate moderate discriminative ability,49,50 others report poor performance. 51 Thus, while the CUT score demonstrated promise in select clinical cohorts, its generalizability beyond the original Italian cohort appears limited, underscoring the need for population-specific external validation and contextual calibration for RSTs.
Risk of bias
The risk of bias and applicability of the seven included prediction models were independently assessed by two authors using the PROBAST+AI tool 35 (Fig. 2 and Tables 4 and 5). Most models exhibited an unclear risk of bias, with a high risk of bias predominantly in the analysis domain. Common issues included small sample sizes, inadequate handling of missing data, retrospective designs without blinding, and lack of appropriate internal or external validation. Several models did not adequately address potential overfitting or provide sufficient statistical justification for predictor selection and weighting. While predictor and outcome definitions were generally consistent and clinically appropriate (low risk), the use of case-cohort designs and exclusion of nonsurgical cases in several studies raised concerns surrounding representativeness.

Study-Level Risk of Bias Assessment Using PROBAST+AI for the Five Included Prediction Models
MSKCC, Memorial Sloan Kettering Cancer Centre; MTNS, McGill Thyroid Nodule Score; TiPS, Thyroid Prediction Score; TNAPP, Thyroid Nodule App.
Study-Level Applicability Assessment Using PROBAST+AI for the Five Included Prediction Models
MSKCC, Memorial Sloan Kettering Cancer Centre; MTNS, McGill Thyroid Nodule Score; TiPS, Thyroid Prediction Score; TNAPP, Thyroid Nodule App.
Discussion
Summary of main findings
By focusing specifically on clinically usable multimodal tools, rather than isolated predictors or broader predictive modeling studies, this review highlights the small number of models that have progressed to patient-level application and clarifies the key barriers preventing wider clinical implementation.
Across the five included RSTs, ultrasound features like hypoechogenicity, microcalcifications, irregular margins, and nodule shape were the most consistently used predictors. TSH emerged as the most frequently incorporated clinical variable, and cytological assessment remained central to most models. Among the models, TiPS demonstrated the strongest diagnostic performance, with a high Youden Index indicating an excellent balance between sensitivity and specificity. The CUT score performed well at lower thresholds but lost sensitivity as specificity increased. TNAPP demonstrated poor overall accuracy, with little benefit beyond existing guidelines. External validation was limited, with only TNAPP and MTNS evaluated outside of their development cohorts, but only in restricted, single-center populations.
Overall, TiPS appeared to be the most promising tool in terms of reported diagnostic performance. This may reflect its use of a smaller number of standardized, high-yield variables, which could improve reproducibility compared with more complex tools incorporating numerous heterogenous parameters. However, this should be interpreted cautiously, given this model is recently published, single-center, and has not yet undergone external validation. Its apparent superiority may therefore reflect cohort-specific performance, and prospective multicenter validation is required to establish its generalizability. However, direct comparison of reported sensitivity, specificity, and AUC across models should be interpreted cautiously, as the included RSTs were developed for different clinical scenarios, target populations, and different clinical purposes, including general thyroid nodule assessment, indeterminate cytology, and decision-making. Apparent differences in performance may therefore reflect differences in case mix and intended use, rather than true superiority of one model over another.
Implications for practice
RSTs aim to improve cancer diagnosis and management by providing personalized, objective estimates of malignancy risk, with benefits proven across multiple other medical specialties.52–56 These tools purport to enhance patient communication, reduce overtreatment, support shared decision-making, and ensure timely intervention for high-risk cases.57–59 However, real-world implementation faces challenges, including model complexity, data input requirements, patient acceptability, and integration into clinical workflows. 57
Risk stratification holds particular value in overstretched cancer pathways, where the rising incidence of thyroid cancer demands more efficient, balanced approaches to diagnosis. 60 Such models could help avoid unnecessary FNAs and operations, reduce health care expenditure, and minimize patient harm.61,62 Beyond individual patient management, RSTs could inform broader population health strategies, 57 improve cost-effectiveness, 31 and serve as valuable tools for medical education and research. 63
However, applicability may differ across health care systems. In the United States, molecular testing is now increasingly integrated into the management of indeterminate thyroid nodules and may reduce the incremental use of nonmolecular RSTs in routine practice.64,65 While molecular testing can improve diagnostic stratification, its widespread availability may also encourage reliance on expensive adjunctive testing in situations where structured risk assessment and clinical judgment may otherwise be sufficient. By contrast, in resource-limited settings where molecular testing is less available, less affordable, or not routinely reimbursed, multimodal RSTs may offer greater practical value.66–68 This further emphasizes the importance of robust external validation across different health care systems and resource settings.
Risk stratification tools incorporating multimodal inputs in restricted patient cohorts
Eleven other multimodal RSTs were identified but excluded from formal analysis due to their narrowly defined patient cohorts, despite incorporating all three predictor domains.69–79 Most models focused on patients with indeterminate cytology, demonstrating moderate to good discrimination (AUC 0.721–0.757) 70 and high negative predictive values (99.5%). 69
Other studies incorporated more detailed cytological criteria or used machine learning techniques, again demonstrating moderate to good discrimination (AUC 0.784–0.84).71,72,79 Integrating molecular testing further improved accuracy. Models combining clinical, sonographic, cytological, and molecular data achieved superior discrimination, with AUCs up to 0.88, which outperformed molecular testing in isolation.76,78 This underscores the potential role of multimodal RSTs in triaging indeterminate nodules suitable for molecular analysis. A smaller subset of RSTs targeted specific histological subtypes with mixed results. While one study found no reliable predictors for differentiating follicular adenoma from carcinoma, 73 another machine-learning model achieved high diagnostic accuracy when predicting follicular thyroid carcinoma (AUC 0.97). 75
Risk stratification tools using two key parameters
This systematic review identified over 30 RSTs incorporating only two parameters. Most combined ultrasound features with select clinical variables like age, TSH, autoimmune status, or family history. These models generally demonstrated good discriminative performance (AUCs 0.80–0.95), implying the integration of select clinical risk factors can meaningfully enhance ultrasound-based risk stratification and reduce unnecessary FNAs.80–82 This moderate-to-strong performance is reflected in more recent studies, with the strongest RSTs integrating age as a clinical variable (AUC 0.84–0.948).83–85 A few studies paired cytology with ultrasound to refine risk assessment following FNA.86–89 For instance, the BETH-TR score integrated ACR-TIRADS with Bethesda scores, achieving strong diagnostic performance (92% sensitivity, 74% specificity, AUC 0.88). 90
Despite promising diagnostic performance, these high AUCs should be interpreted cautiously given most models were retrospective, single-center, and lacked external validation. Many RSTs were trained on high-risk surgical cohorts, which likely inflated diagnostic accuracy.77,91,92 Advanced machine-learning or multimodal approaches often relied on niche biomarkers or nonroutine imaging (such as elastography or Doppler), limiting real-world applicability.77,92–95 Extremely high AUCs (>0.95) reported in some studies likely reflect overfitting, small sample sizes, and restricted cohort selection.43,77,92 Overall, while combining clinical data with ultrasound can improve predictive performance, two-parameter models should be used cautiously and primarily as decision-support tools until validated prospectively across diverse outpatient populations.
Limitations of existing models
Current best practice for evaluating suspicious thyroid nodules involves US FNA, although its utility is limited by interobserver variability, 96 operator subjectivity, 97 and modest diagnostic accuracy. 40 Cytology alone cannot reliably differentiate benign from malignant thyroid nodules, often necessitating diagnostic surgery. 16
RSTs seek to address these limitations by integrating clinical, ultrasonographic, and cytological data. However, diagnostic accuracy alone does not establish clinical utility. For an RST to influence practice meaningfully, it must also demonstrate adequate calibration, provide clinically actionable and validated decision thresholds, and show downstream benefit in patient management, particularly in reducing unnecessary diagnostic surgery. These features were inconsistently reported across the included studies, limiting confidence in the practical applicability of otherwise promising models.
Their widespread use is further constrained by limited external validation, reliance on single-center retrospective cohorts, and uncertain calibration and performance across different health care contexts. Some models, like TNAPP, offer little improvement over established classification systems or depend on extensive clinical inputs, which may hinder routine use.
Although many models may improve malignancy risk estimation, their true clinical value hinges on defining actionable thresholds to ensure real-world applicability and meaningful integration into clinical decision-making. For instance, although models like the MTNS provide stratified malignancy risk estimates, there was no consensus on what thresholds should trigger intervention. Without clearly defined and clinically validated cut-off values, these estimates risk being academically interesting yet practically ambiguous.
Strengths and limitations
This systematic review employed a rigorous and transparent methodology and contemporary risk of bias assessment to comprehensively examine all RSTs for thyroid cancer, which incorporated clinical, ultrasound, and cytological features. By restricting inclusion to models with explicit patient-level application, the review moves beyond descriptive predictor studies and focuses on tools with potential relevance to clinical decision-making.
However, several limitations must be acknowledged. First, this review excluded machine learning models without interpretable outputs, which may have omitted emerging tools with significant predictive potential. The restriction to English-language publications introduces the risk of language bias, in addition to publication bias, as studies reporting poor model performance may be underreported in the literature. In addition, the included RSTs span a broad publication period, during which diagnostic pathways and practice patterns in thyroid nodule assessment have changed substantially. This temporal heterogeneity makes direct comparison more difficult and may limit the present-day applicability of older models.
A further challenge lies in the substantial heterogeneity of study design, inclusion criteria, outcome measures, statistical analysis, and predictor variables across included papers, rendering direct comparison of diagnostic performance difficult between RSTs. In addition, the clinical relevance of nonmolecular RSTs may vary by health care setting. In the United States, widespread access to molecular testing for indeterminate nodules may limit their incremental utility, whereas in resource-limited systems such tools may be of greater practical importance. This variability complicates the generalizability of findings, posing a challenge when considering which model is best suited for broad clinical implementation across diverse settings.
Future directions and research priorities
The significant heterogeneity identified across studies suggests that a single, universally applicable RST for all thyroid nodules and malignant subtypes may be difficult to achieve. Future research may therefore be more productive if directed toward clinically specific decision contexts, particularly indeterminate cytology and follicular-patterned lesions, where the main challenge is uncertainty around the need for diagnostic surgery. In this setting, the principal clinical value of RSTs is likely to lie in reducing unnecessary diagnostic hemithyroidectomy.
Robust, prospective, multicenter external validation is essential to establish the reliability of existing RSTs. Formal head-to-head comparisons of multiple RSTs within the same study population are needed to determine relative performance. Future tools should prioritize usability, providing clear, actionable guidance that goes beyond probabilistic estimates to support evidence-based clinical recommendations. Outside diagnostic performance, future research should also assess the broader health economic impact of RSTs, including their role in resource optimization and cost-effectiveness.
In parallel, there is potential for the development of user-friendly, digital decision-support tools to integrate validated models into everyday clinical workflows. 98 These platforms could allow for dynamic updates as new evidence emerges, including molecular markers. Artificial intelligence (AI) represents a promising avenue for enhancing risk prediction, as noted in other medical fields.99–102 As AI-based models become more robust, interpretable, and clinically validated, they may alter future diagnostic pathways and reduce reliance on conventional rule-based RSTs based on current clinical, ultrasound, and cytological criteria. In addition, AI-driven image analysis of ultrasound images and cytology slides may uncover patterns not readily discernible with human assessment. 103 However, these approaches also face important limitations, including reduced interpretability, dependence on large high-quality datasets, risks of overfitting, and uncertainty regarding generalizability across different health care settings.102,104,105 Such approaches should seek to complement, rather than replace, clinical judgment to support shared decision-making.
Take-Home Messages
Multimodal RSTs may improve thyroid nodule malignancy risk estimation, particularly for indeterminate cytology. Current models are limited by retrospective design, single-center derivation, and limited external validation. Simpler tools with standardized inputs may offer better reproducibility, but high-reported performance requires cautious interpretation. At present, RSTs should be used as adjuncts to clinical judgment, rather than stand-alone decision tools. Prospective multicenter validation and comparison with molecular and AI-based approaches are needed.
Conclusion
As the health care landscape continues to evolve, simple, reliable, and widely applicable RSTs represent a pragmatic, evidence-based approach to optimize thyroid cancer care while addressing the growing demands on health systems. Overall, the limitations of both conventional diagnostic pathways and existing RSTs highlight the need for prospective, external, multicenter validation studies in demographically diverse populations. While current RSTs show promise, further robust external validation is essential before they can be recommended for routine clinical use.
Authors’ Contributions
E.W.: Conceptualization (lead), methodology, formal analysis (lead), investigation (literature screening, narrative analysis, risk of bias assessment), and writing—original draft. Z.S.: Investigation (literature screening, risk of bias assessment), and writing—review and editing. K.B.: Conceptualization, methodology (lead), writing—review and editing, and supervision. N.S.: writing—review and editing (lead) and supervision (lead).
Footnotes
Author Disclosure Statement
No competing financial interests exist.
Funding Statement
No funding was received for this article.
Supplemental Material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
