Abstract
Background:
The use of artificial intelligence (AI) in health care has grown exponentially with the promise of facilitating biomedical research and enhancing diagnosis, treatment, monitoring, disease prevention, and health care delivery. We aim to examine the current state, limitations, and future directions of AI in thyroidology.
Summary:
AI has been explored in thyroidology since the 1990s, and currently, there is an increasing interest in applying AI to improve the care of patients with thyroid nodules (TNODs), thyroid cancer, and functional or autoimmune thyroid disease. These applications aim to automate processes, improve the accuracy and consistency of diagnosis, personalize treatment, decrease the burden for health care professionals, improve access to specialized care in areas lacking expertise, deepen the understanding of subtle pathophysiologic patterns, and accelerate the learning curve of less experienced clinicians. There are promising results for many of these applications. Yet, most are in the validation or early clinical evaluation stages. Only a few are currently adopted for risk stratification of TNODs by ultrasound and determination of the malignant nature of indeterminate TNODs by molecular testing. Challenges of the currently available AI applications include the lack of prospective and multicenter validations and utility studies, small and low diversity of training data sets, differences in data sources, lack of explainability, unclear clinical impact, inadequate stakeholder engagement, and inability to use outside of the research setting, which might limit the value of their future adoption.
Conclusions:
AI has the potential to improve many aspects of thyroidology; however, addressing the limitations affecting the suitability of AI interventions in thyroidology is a prerequisite to ensure that AI provides added value for patients with thyroid disease.
Introduction
Artificial intelligence (AI) was born in 1956 under the premise that “Any aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.” However, AI only began to show promising progress in the medical field in the early 2000s with the progress of computational capacity and the digitalization of health care. It was not until 2017 that the U.S. Food and Drug Administration (FDA) approved the first AI-based application for use. 1 AI has the potential to improve clinical effectiveness, access to care, and biomedical research by optimizing disease diagnosis, treatment, monitoring, prevention, and health care delivery. 2,3
In thyroidology, the earliest published use of AI was in 1991 when researchers attempted to create diagnostic networks to interpret the thyroid function tests. 4,5 Since then, the interest in potential applications of AI has extended to almost all areas of thyroidology. In this review, we aimed to provide thyroid clinicians and researchers with a framework to understand the latest development of AI in thyroidology by comprehensively characterizing emerging AI applications in several fields of thyroidology such as thyroid nodules (TNODs), thyroid cancer, and functional or autoimmune thyroid disease. Specifically, our framework provides a broad narrative overview of the process from conceiving to adopting AI applications, the current applications and techniques of AI in thyroidology, and the challenges that need to be addressed to facilitate future interventions as well as their added value for treating thyroid disease.
To guide this perspective review, we conducted a nonsystematic search of any study published until December 2022 that included the conception, development, validation, utility testing, or adoption of AI algorithms in thyroid conditions (Fig. 1) (Supplementary Appendix SA1).

Applications of AI in thyroidology grouped by year of publication, disease of interest, and stage of development. AI, artificial intelligence.
General Concepts
AI is a computing technology capable of mimicking or surpassing human intelligence. 6 Today's AI algorithms take advantage of vast amounts of data (input) to identify complex and subtle patterns that might be difficult for humans. AI encompasses many interrelated techniques that require diverse levels of human supervision (supervised, unsupervised, semisupervised), and varying degrees of complexity for data processing. 7 In summary, Machine learning (ML) is an area of AI that allows computers to learn from data and make predictions. 8 Traditional ML models rely more on humans to identify useful features, which is a critical step to develop good predictive models (Fig. 2A). Later, deep learning (DL), an emerging area of ML that leverages neural network (NN) architectures, was proposed to enable machines to learn useful features (Fig. 2B). 9 DL represents a modern revamping of NNs, which were originally inspired by mimicking biological neurons' interactions.

Graphic representation of
In NN, data are combined in different ways through a series of progressive and hierarchical layers (including hidden layers) to establish relations from complex patterns. 10 In the medical domain, much detailed patient information is often captured in clinical narratives. Natural language processing (NLP) is the key AI technology that enables machines to recognize and extract information from unstructured text in the electronic medical records (EMR), thus facilitating the use of clinical text in ML models (Fig. 2C). 9
AI Applications: From Conception to Adoption
AI innovation follows several maturing stages from conception to successful clinical adoption (Fig. 3). 11 In the conception stage, the research team defines a specific clinically relevant problem and primary outcome that reflect the current knowledge and burden of the disease of interest. This stage also involves building a multidisciplinary team with clinical and computational expertise and engaging with different stakeholders. In the data collection stage, the potential data sources and variables of interest are identified. Data are then reviewed and annotated with varying degrees of human input. In the training stage, collected data are processed and used to train the AI algorithm based on the chosen methodology. The results are periodically evaluated throughout the training stage, and the model is adjusted to improve performance.

(
In the validation stage, the model's performance is evaluated and further retraining or model tuning is performed where there is discordance with the initial training performance. In the utility testing phase, the model is evaluated in a real-world setting outside the research environment. The primary goal of this stage is to assess the benefits of the proposed model in real-world clinical practice, usually within the institution in which it was created. Sometimes, collaborations are established with other institutions to evaluate the model's generalizability. For a model to be considered adopted, multicenter and prospective validations, regulatory agencies' approval, and availability for use in clinical care are required. Once a model is adopted, there will be no periodic evaluation and retraining or tuning unless substandard performance occurs.
Medical AI models should be evaluated using a combination of different metrics to prevent over- or underestimation of the results. 12 Traditional statistical measures such as sensitivity, specificity, negative predictive value, positive predictive value (also referred to as precision), true positive, true negative, false positive, and false negative are used. Other commonly used metrics include accuracy, the fraction of correctly classified outcomes among all outcomes, and the area under the curve (AUC), an aggregate measure of the model's predictive performance. 13 A model with an AUC closer to one is deemed to have excellent predictive performance, while a model with an AUC closer to 0.5 is considered to have random predictions.
Application of AI in TNODs
TNODs are the most common thyroid disease, with an estimated prevalence of up to 70% with ultrasound (US) and an overall risk of malignancy of 10%. 14 The widespread use of different imaging modalities has led to increased detection of TNODs. 15 In our review, 64% of the original studies of AI in thyroidology focused on several aspects of thyroid nodular disease, including optimizing resource utilization and providing a more accurate and personalized workup and management strategy. The available applications of AI in TNODs can be summarized based on their use of various imaging modalities and laboratory-based methods.
Thyroid US
Thyroid US is the most widely used diagnostic tool for evaluating TNODs. 16 While accessible, safe, and cost-effective, US interpretation is subject to significant interoperator variability. 17 Over the last decade, multiple AI models have aimed at automating interpretation, improving and facilitating risk stratification of TNODs by thyroid US images, or its derived characteristics, and reducing the rate of unnecessary fine needle aspiration biopsies (FNABs), 18 –20 with similar or superior performance than commonly used TNOD reporting and risk stratification systems (AUC: 0.76–0.98). 18,19,21 –34
Given the large heterogenicity in the models' design and training data sets, comparing their performance and evaluating their impact in real-world are challenging. Mature models with external validation have demonstrated mixed stand-alone performance results. Some outperformed experienced radiologists (AUC: 0.88–0.94), 35 –37 others had comparable specificity but lower sensitivity than senior physicians and similar sensitivity but increased specificity than junior physicians (AUC: 0.65–0.98), 22,38 –42 and others had an overall comparable performance regardless of the physicians' level of expertise (AUC: 0.82–0.96). 25,43 –45 Moreover, some models increased the radiologist's performance when used as a supplementary diagnostic aid (AUC: 0.8–0.94). 38,46
In addition, there is promising research on the incorporation of nontraditional data into the risk stratification process, such as the use of radiomics (quantitative extraction of high-dimensional image features from routine imaging) to clarify the nature of indeterminate TNODs (AUC: 0.75–0.88), 47 –50 the prediction of BRAFV600E mutation without requiring molecular testing (AUC: 0.64), 51 the use of video clips rather than static US images for TNOD characterization, 52 and the automatic incorporation of color Doppler features to enhance the TNOD risk stratification prediction (AUC: 0.89). 53 Finally, there have been preliminary studies to specifically differentiate follicular thyroid carcinoma (FTC) from adenoma (AUC: 0.80–0.96), 54,55 and to clarify the malignant versus benign nature of TNODs previously classified as indeterminate by FNAB (Accuracy: 77.4%). 56
Despite the large number of AI models for US risk stratification of TNODs that have been developed with promising results, few of them have had sufficient external multicenter validation or prospective evaluation, only four have been approved by the FDA (S-detect, AmCAD-UT, Koios DS, MEDO-Thyroid), and none of the commercially available are being widely used. 7
There are several limitations to the use of these AI models in real-world clinical practice. Some models are limited by suboptimal performance particularly due to the use of static images 57,58 and training data sets that have few benign nodules, 37 indeterminate nodules, 18,40,43 TIRADS 3 nodules, 26 and nonpapillary thyroid cancer (PTC) malignancies. 23,40 In addition, consistency can be affected by differences in the levels of provider expertise in semiautomatic models that require manual input, 59 and marked fluctuations in data quality due to diversity in data sources (US equipment, radiology protocols, and image segmentation methods).
Molecular testing
FNAB with cytology interpretation is the standard for preoperative diagnosis of TNODs 60 ; however, the utility of FNAB can be limited by indeterminate or nondiagnostic results. 61 Thus, there has been extensive research on clarifying the benign versus malignant nature of these nodules to avoid unnecessary surgical or other invasive interventions. Over the last decade, several AI models were designed to predict the probability of malignancy of TNODs based on their molecular profile with adequate and incremental performance (AUC: 0.88–094). 62 –64 Commercially available molecular diagnostic tests such as Afirma, 65 –67 ThyroSeq, 68,69 Rosetta GX reveal, 70 and Thyramir 71 include AI-based classifiers and are perhaps the most widely used AI applications in thyroidology at this time.
Importantly, despite the high performance of these models in validation studies, their clinical utility might be limited by patient selection bias, and significant interinstitutional variations in performance metrics due to their specific pretest malignancy probability of cytologically indeterminate TNODs. 72
Cytology
Cytological categorization of FNAB is user dependent, and higher quality improves care. 73 AI-based systems have been developed to improve FNAB interpretation, and prediction of malignancy risk, by automating the analysis of cytology images and identifying subtle cytology patterns. Early explorations on digital cytomorphologic evaluation through ML and DL have shown promising results in the automatic classification of TNOD's cytology (AUC: 0.75–0.93), 74 –79 including indeterminate cytology TNODs (AUC: 0.75–0.96). 75,80 These models achieved comparable to superior performance than a pathologist when used stand-alone (Precision: 0.87), and improved pathologists' accuracy when used as a supplementary diagnostic aid (Precision: 0.81–0.88). 81
In addition, there have been preliminary explorations of models to automatically identify region of interest (ROI) on whole slide cytology images to expedite the cytopathologists' review process with adequate concordance compared with manual ROI identification, 82 to predict BRAF-RAS gene expression and identify follicular-patterned thyroid neoplasms based on automatic evaluation of cytologic patterns (AUC: 0.98–0.99), 83,84 and to use adjuvant NLP-extracted features (demographic, US and biochemical characteristics) to improve cytologic classification of indeterminate TNODs (AUC: 0.85). 85
The performance and generalizability in most of these models are limited by small training data sets with a modest amount of indeterminate or borderline TNODs (such as samples that express characteristics from various categories).
Other diagnostics
There are other early exploratory initiatives to further improve several aspects of the diagnosis and management of TNODs, including the use of NLP for automatic identification and workup tracking of thyroid incidentalomas on computed tomography (CT) reports (AUC: 0.99), 86 the use of CT or magnetic resonance imaging (MRI) images for TNOD risk stratification (AUC: 0.85–0.87), 87 –91 the incorporation of demographic, ultrasonographic, biochemical, and cytologic characteristics into an AI-based decision tree model aimed at decreasing the false-negative rate of TNODs that undergo FNAB (Accuracy: 95.5%), 92 the automatic analysis of intraoperative TNOD's frozen sections (Precision: 16.7–96.7%), 93 and the prediction of TNOD's volume reduction response with radiofrequency ablation (Accuracy: 85.1%). 94
The performance of these models might be limited by variations in documentation style and completeness on NLP extracted data, inadequate quality of CT or MRI images, imbalanced cohorts of benign versus malignant TNODs, and limited amount of non-PTC histologies in training data sets.
Application of AI in Thyroid Cancer
In recent years, the incidence of thyroid cancer has been on the rise, particularly among women in the United States. 95,96 As such, there is an urgent need to develop better tools for risk stratification, recurrence prediction, and response to therapies. While the previous section described the applications for identification, evaluation, and risk stratification of TNODs, this section further expands on the additional aspects of risk stratification and management in patients with established thyroid cancer, including prediction of nodal and distant metastases, recurrence, prognosis, and treatment.
Preoperative risk stratification
Several models incorporated clinical, biochemical, anatomical, pathology, and US features to predict the presence of cervical lymph node metastasis (LNM) in patients with PTC (AUC: 0.67–0.91) with comparable performance to radiologists' interpretation of neck US. 97 –101 Lee et al. evaluated the automatic detection of cervical LNM on the CT neck of patients with PTC with promising results (AUC: 0.95). 102 Their model performed similarly to experienced radiologists but superior to junior radiologists and trainees. 103 Other models used neck CT or US radiomics to automatically identify cervical LNM in patients with PTC with superior performance than experienced radiologists (AUC: 0.70–0.93). 104,105
Finally, researchers used demographic and clinicopathological variables (including demographics, histology, and staging) to predict distant metastasis in patients with PTC or FTC (AUC: 0.85–0.91). 106 –108 Despite promising results, these models' utility in clinical practice has not been demonstrated, and only two are readily accessible as online tools outside of the research setting. 99,102
In addition, US and CT are commonly used to diagnose extrathyroidal extension (ETE) preoperatively, but their sensitivity and specificity are limited. 109 US radiomic models performed better than regular US interpretation in predicting ETE given the ability to capture risk factors not usually accessible to the human eye (such as PTC density and enhanced tissue heterogeneity; AUC: 0.83), 110 and CT radiomic models performed similar to experienced radiologists (AUC: 0.75). 111 In addition, CT or MRI radiomics, and US-based models have showed better performance in predicting thyroid capsule invasion (AUC: 0.82), 112 and preoperatively predicting advanced or aggressive PTC (AUC: 0.85–0.96). 113 –115
In general, these models are still in the training or utility testing stages and their generalizability is limited by small training data sets, inadequate data set heterogeneity with very modest presence of non-PTC histologies, low rate of events of interest, lack incorporation of additional clinical data such as molecular or biochemical markers into the predictive models, and scarce testing on images with diverse quality and segmentation techniques, or obtained with different diagnostic equipment.
Prognosis and recurrence risk
Several models use the patient's age and specific malignant disease characteristics (tumor size, metastatic involvement, and nodular disease) to predict the staging of well-differentiated thyroid cancer with similar performance to the 8th American Joint Committee on Cancer (AJCC) staging system (AUC: 0.85–0.98). 116 –118 Furthermore, there are models that used clinicopathological, biochemical, and molecular data to predict the risk of thyroid cancer recurrence (Accuracy: 95.7%). 119,120 A similar model by Kim et al., in addition, used radiation and systemic therapy data to predict survival in patients with distal metastasis status post-thyroidectomy. 119 These models are limited by selection bias and missing or nonstandardized data due to the retrospective nature of the single-center training databases, insufficient long-term survival or recurrence data, and lack of external or prospective validation.
Treatment
Researchers have evaluated the use of AI to predict treatment responses and potential complications. For example, Lubin et al. used ML to identify clinical factors (including tumor focality, preoperative staging, and biochemical markers) predictive of radioactive iodine (RAI) failure. 121 Similarly, Seib et al. used preoperative patient and malignant disease characteristics to predict postoperative complications (such as hypocalcemia, recurrent laryngeal nerve injury, or hematoma; AUC: 0.72). 122 Lastly, Liu et al. used data from quality-of-life questionnaires, sociodemographic and clinical characteristics to predict reduction of quality of life in patients with thyroid cancer 3 months after thyroidectomy (AUC: 0.89). 123 The generalizability of these models could be limited by their small training data sets, and lack of accountability for important risk factors such as prediagnostic psychological health when assessing postsurgical quality of life.
Other studies have evaluated the use of AI to improve the efficacy and safety of the different treatment strategies. Gong et al. developed a model for real-time identification and measurement of the recurrent laryngeal nerve using computer vision during thyroidectomy (Precision: 75.6%). 124 Their model showed promising results and demonstrated feasibility to augment intraoperative decision-making. In addition, another early model from Lin et al. aimed to optimize radiotherapy precision in patients with metastatic thyroid cancer through the use of positron emission tomography CT (PET-CT) with encouraging results. 125 Due to the small training data sets, these models could have suboptimal performance with anatomical variations, inconsistent image quality, and the presence of indeterminate or challenging diagnostic findings.
Application of AI in Autoimmune and Functional Disease
Researchers have also explored the use of AI to understand the intricate aspects of hypo- and hyperthyroidism pathophysiology, automate diagnostic workflow, and enhance current diagnostic and therapeutic approaches.
Hypothyroidism
Two computer-assisted diagnosis (CAD) models trained using DL have been evaluated for the automatic diagnosis of Hashimoto's thyroiditis (HT) from thyroid US data analysis. 126,127 One of these models achieved excellent accuracy (AUC: 0.94), demonstrating consistency on external validation, and had higher performance than radiologists regardless of their level of expertise. 127 These models are subject to patient selection bias, and their performance upon presence of confounders such as TNODs has not been tested. In addition, some models have been trained to predict thyroid dysfunction and patient-specific thyrotropin (TSH) levels based on demographics, and clinical and biochemical data with mixed performance (AUC: 0.61–0.87). 128,129 These models are limited by retrospective training data sets with missing data, imbalanced patient subpopulations, and low rates of events of interest.
Thyrotoxicosis
Several ML models have been developed to aid in the diagnosis of hyperthyroidism. Some models performed well in classifying common thyroid scintigraphy uptake patterns and differentiating between entities such as Graves' disease (GD) and subacute thyroiditis (ST; Accuracy: 87.7–99.3%). 130,131 An NN model from Ma et al. outstandingly distinguished between GD, ST, and HT using single-photon emission CT (SPECT) images (Accuracy: 99–99.6%). 132 In addition, some models were developed to predict the presence and the etiology of thyrotoxicosis based on patient characteristics and biochemical data from the EMR. 133,134 Other researchers have used AI to personalize treatment for GD by predicting patients' responses to nonsurgical therapies such as antithyroid drugs (ATDs) and RAI. Orunesu et al. trained an NN model that used baseline patient characteristics to predict outcomes after discontinuation of ATDs with high sensitivity and specificity (84.6% and 77.2%). 135 Duan et al. used similar characteristics to predict the chance of post-RAI hypothyroidism with suboptimal performance (AUC: 0.72). 136
The generalizability of some of these models is limited due to considerable discrepancies in their training and validation performances. In addition, training data sets were small, contained missing data on key variables such as TRAb levels, and were subject to subpopulation imbalances or low rate of events of interest.
Furthermore, AI has emerged as an alternative to deepen the understanding of the complex pathophysiology of different diseases, including hyperthyroidism. While single genes have been associated with GD, Shen et al. trained an ML model to identify different multigene associations that could be involved in the pathogenesis of GD (AUC: 0.9). 137 Although this model does not reach current clinical practice, its outcomes could facilitate future targeted therapeutic development and strategies for early disease identification and genetic counseling.
Thyroid eye disease
Accurate evaluation and severity assessment are important for treating thyroid eye disease (TED) with emerging therapies. However, current tools such as clinical activity score, vision, inflammation, strabismus, and appearance, and European Group of Graves' Orbitopathy classifications have limitations, leading to misclassifications or missed diagnoses. 138,139 As a solution, AI-based alternatives are being explored to improve diagnosis, severity assessment, and monitoring of TED progression and treatment response. 140 –143 A model by Song et al. diagnosed TED through the screening of orbital CT images with outstanding results (AUC: 0.91). 140 Wen et al. trained a model that used specific local and remote brain functional connectivity abnormalities on functional brain MRI to diagnose TED with an accuracy of 78.5%. 142 In addition, this model provided further insight into the mechanisms of cognitive and visual symptoms of patients with TED. Likewise, a model by Huang et al. identified several features of TED based on the analysis of patients' facial images with promising performance (AUC: 0.6–0.93). 144
Similarly, Lin et al. estimated disease activity on TED patients using orbital MRI images with higher performance than clinicians' assessment (AUC: 0.92). 141 For TED management, Hu et al. predicted response to systemic glucocorticoids through the integration of orbital MRI radiomics and the disease duration, with adequate performance (AUC: 0.85–0.91). 143 The performance of these models is limited by the small data sets with imbalanced subpopulations of interest, and variations in images quality or radiology protocols. In addition, they lack utility testing on real clinical practice.
Discussion
Challenges and future directions
Notwithstanding the advantages the reviewed AI applications may offer, their implementation in real-world clinical practice is challenging due to several limitations. Figure 3 displays the current stage of development of the reviewed applications based on the available published data as well as the general challenges faced, and milestones achieved on each stage.
As already described, each model might have specific limitations depending on its aim and design. In addition, general challenges must be addressed to improve the successful adoption of AI in thyroidology (Table 1). The performance of an AI model is intrinsically tied to the quality of the data. Small, homogeneous, single-center, biased, or retrospective data sets may impair the algorithm's performance, particularly when applied to external institutions or real-world scenarios. Therefore, it is crucial to use larger, more diverse, multicentric, and prospective data sets to ensure the performance and generalizability of the AI model. 145 –148 In addition, models that use demographic variables can be subject to discriminatory bias when trained with databases that reflect historical health inequities in underrepresented groups (differences by race, gender, and socioeconomic backgrounds). 148
Challenges and Future Directions
AI, artificial intelligence; CDA, clinical diagnostic aid; EMR, electronic medical record; NLP, natural language processing.
Thus, training data sets should be representative of the general population, and conscious efforts should be made to identify and correct performance variations when the model is applied to subpopulations with diverse demographics. Furthermore, a lack of interpretability of the model's reasoning process (the “black box” effect) could prevent high-performing algorithms from being adopted or achieving their highest potential. For instance, even if a model can predict the risk of thyroid cancer recurrence with high accuracy, it might not be widely trusted and adopted by the medical community if there is not some insight into the reasoning and weight of variables behind the model's prediction.
Incorporating known pathophysiology into the model can provide valuable context for clinicians and provide them with some degree of self-explainability. For instance, providing the key features that contribute to the model's prediction can help increase the model's transparency and trustworthiness, promote the discovery of new or subtle clinical patterns, and facilitate continuous improvement by identifying factors that impact the precision of the algorithm. 149,150
Moreover, differences in data sources from variations in diagnostic equipment, the structure in which clinical data are documented in the EMR, and institutional practice changes over time can affect the data quality and the performance of the model. 151 Thus, it is important to evaluate the performance of the models when using different data sources and retrain the model when needed. Furthermore, there should be standardization protocols that accommodate fluctuations originating from the ever-changing nature of health care data and account for different diagnostic and laboratory equipment. This would improve the models' consistency and facilitate the development of high-quality data sets that could be used in further model training. 145 –148
In addition to addressing the current limitations, future research of AI in thyroidology could expand to other areas that are yet to be explored, especially nonimage-based models, applications aimed at generating new knowledge rather than simply automating processes, and predictive models to facilitate “precision medicine.” Examples of current research interest include the use of physiologic data captured through wearable devices to facilitate diagnosis, follow up or medication management, the use of NLP models that leverage large volumes of unstructured data from the EMR for augmenting or enriching uncoded or not well-coded data elements, multimodal approaches that incorporate different types of inputs, and more complex pathophysiologic models that include genomics, proteomics, and metabolomics.
Implementation of AI applications in the real world
AI applications should ensure that the model possesses clearly defined, clinically relevant, and actionable outcomes that reflect the burden of the problem being addressed. For instance, while a model could prove high performance in diagnosing HT from US data, this might not have added benefit when applied to current real-world practice. Thus, before full adoption into routine clinical practice, external validation, prospective and multicenter clinical trials, and correlation between the model's accuracy and its real-world clinical efficacy are imperative. 148,152 In addition, incorporating these AI models into the routine clinical workflow is limited by the availability of expert resources, lack of real-time access to the models outside of the research setting, difficult incorporation of the models into the EMR, and high costs associated with software or hardware acquisition. 22
Thus, stakeholder engagement through the distinct phases of development and deployment is fundamental for successful adoption. Implementing any AI strategy requires careful understanding of the available financial and expert resources and a multidisciplinary approach that accounts for the expectations and fears of all the affected stakeholders. 2,153
Furthermore, cross-sector collaborations and coordinated efforts between domain experts (i.e., researchers and clinical experts), technology experts (i.e., technology firms and AI vendors), law makers, regulatory agencies, health care decision makers (i.e., health system and insurance executives), and patients are fundamental for the advancement of AI in thyroidology and the successful incorporation of AI models into routine clinical care. 2 Finally, current and future AI models for thyroid conditions should consider appropriate regulatory frameworks that ensure AI interventions' safe and ethical use while accounting for their dynamic learning and progressive improvement over time. 154
Critical appraisal of AI literature
Despite the current large body of AI literature and its expected exponential growth over the next couple of years, there are not currently validated and widely accepted critical appraisal tools. Several authors have characterized different frameworks to understand AI reports. 155 –158 Table 2 summarizes important considerations for thyroid clinicians and researchers when analyzing an AI article in thyroidology.
Considerations for Critical Appraisal of the Literature
Conclusions
The integration of AI into thyroid-related research and daily clinical practice marks the beginning of a new era in thyroidology. AI has the potential to improve the consistency and accuracy of diagnosis, decrease health care professionals' workload, predict response to therapy, and facilitate the development of clinical decision support systems. In addition, AI can increase access to specialized care, identify subtle risk patterns, and promote personalized care. However, several limitations preclude most current models from being used in real-world clinical practice. To fully realize the potential of AI in thyroidology, rigorous methodological planning and suitability testing are necessary to identify and address obstacles and increase the likelihood of successful adoption of AI interventions.
Footnotes
Acknowledgment
Figures were created with BioRender.com.
Authors' Contributions
D.T.T.: conceptualization, methodology, literature review, and article preparation (initial draft, review, and editing). R.L.T. and M.D.: literature review and article preparation (initial draft and review). J.P.B.: conceptualization, methodology, and article preparation (review and editing). J.W.F., N.S.O., and Y.W.: article preparation (review and editing).
Disclaimer
The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the Patient-Centered Outcomes Research Institute.
Author Disclosure Statement
None of the authors has any relevant disclosures.
Funding Information
J.P.B. and N.S.O. were supported by the National Cancer Institute of the National Institutes of Health under Award Numbers R37CA272473 and K08CA248972, respectively. Y.W. was supported by the National Institute on Aging of the National Institutes of Health under Award Number R56AG069880, and the Patient-Centered Outcomes Research Institute under Award number ME-2018C3-14754.
Supplementary Material
Supplementary Appendix SA1
