Abstract
Background:
Artificial intelligence (AI) is broadly defined as the ability of machines to apply human-like reasoning to problem solving. Recent years have seen a rapid growth of AI in many disciplines. This review will focus on AI applications in the assessment of thyroid nodules.
Summary:
AI encompasses two related computational techniques: machine learning, in which computers learn by observing data provided by humans, and deep learning, which employs neural networks that mimic brain structure and function to analyze data. Some experts believe the way AI systems reach a conclusion should be transparent, or explainable, while others disagree. Most AI platforms in thyroid disease have focused on malignancy risk stratification of nodules. To date, four have been approved by the United States Food and Drug Administration. While the results of validation studies have been mixed, there is ample evidence that AI can exceed the performance of some humans, particularly physicians with less experience. AI has also been applied to assessment of lymph nodes and cytopathology specimens.
Conclusions:
Adoption of AI in thyroid disease will require vendors to demonstrate that their software works as intended, is readily usable in real-world settings, and is cost effective. AI platforms that perform best in head-to-head comparisons will dominate and spur wider adoption.
Introduction
Once restricted to works of science fiction, artificial intelligence (AI) has in recent years been applied in a broad variety of disciplines and industries, including health care. In thyroidology, AI is now being used to evaluate ultrasound images of nodules to estimate their malignancy risk; related applications, such as assessment of cytopathology specimens and lymph nodes, have also been introduced. As with any novel technology, however, AI's promise has sometimes exceeded its real-world benefits. Moreover, deciding which software to purchase is often exacerbated by a lack of physicians' familiarity with how AI works. We will begin by explaining at a high level how AI algorithms evaluate sonograms. We will go on to present four software platforms that have received approval by the United States Food and Drug Administration (FDA), discuss other AI applications, and conclude with our view on outlook and future directions.
Review
Consider the challenge of writing a program to play chess. In one approach, the programmer would anticipate all the possible positions and moves and then translate them into computer code. However, this task would be extremely laborious, and it would be practically impossible to anticipate all the possibilities. Instead, what if the computer could observe human chess experts and develop the rules that govern gameplay on its own? In essence, the latter represents how AI works, where the computer essentially “writes” the programs after reviewing inputs and outputs. Broadly, AI encompasses two techniques: machine learning and deep learning, where the latter is a subset of the former (Fig. 1).

Venn diagram showing the relationship between AI, machine learning, and deep learning. AI, artificial intelligence.
The term machine learning was coined by IBM researcher Arthur Samuel in 1959. 1 In this approach, computers learn by observation of human-provided data rather than being explicitly programmed. For example, a machine learning model may predict cancer in thyroid nodules based on the presence or absence of six ultrasound characteristics: taller-than-wide, margins, microcalcifications, macrocalcifications, hypoechogenicity, and solid composition. In this approach, a human reader first reviews the ultrasound images, determines which features are present, and enters this information into a database. The data, along with diagnostic results from biopsy or surgery, are then used to develop a mathematical model that determines the likelihood of malignancy.
Radiomics
As above, machine learning models may be based on the same ultrasound features physicians use to classify nodules. In contrast, radiomics evaluates images mathematically to glean patterns that are discriminatory diagnostically, but are not readily apparent to human observers. However, while radiomics has been shown to be valuable for machine or deep learning algorithms, it may contribute to the impression that the way the software reaches its conclusion is essentially unknowable and mysterious (see the Explainability section below).
Artificial neural networks
Deep learning uses constructs that mimic the structure and function of the human brain. Instead of neurons, artificial neural networks (ANNs) are composed of linked computational units called nodes that are arranged in layers. The information to be evaluated, such as a sonogram of a nodule, is presented in digital format to an input layer, and then is fed through a series of hidden layers comprising additional nodes. At each one, an incoming signal in numerical form is multiplied by a weighting factor. If the result is above a threshold, the node “fires” and passes the product of the input and weight to the next node. Eventually, the signals reach the output layer, where the result (predicting whether the nodule is likely benign or malignant, for example) is determined.
Initially, all the weights and thresholds are randomly assigned, so the output is likely incorrect. However, during training, the numbers are adjusted iteratively to achieve the desired response (the ground truth), similar to the way humans learn from experience. Unlike machine learning, deep learning is able to extract features from the input data without human intervention to develop a model (Fig. 2). This is accomplished using either a supervised or unsupervised approach.

Simplified schematic of deep learning. Labeled images are fed into the deep learning algorithm. During training, the initial randomized weights are tuned to generate predictions consistent with the ground truth.
Supervised versus unsupervised learning
In supervised learning, AI algorithms are trained on data labeled with a diagnosis. For example, to create a model to predict whether a thyroid nodule is benign or malignant, the computer is supplied with many ultrasound images and diagnostic labels based on a physician's assessment, stability over time, cytopathology, or histology. The latter two are preferred because they are more likely to be accurate, and therefore a more suitable ground truth, which represents the best estimate of the nodule's true state. In unsupervised learning, data are presented to the algorithm without any labels. 2 The software then clusters the data based on similarities between the images. For example, thyroid nodules could be grouped based on hypoechoic or isoechoic echogenicity, or they could be classified depending on whether they contain punctate echogenic foci. Unsupervised learning may be effective when there is not enough labeled data. Alternatively, this approach may be used to pretrain a model that can later be fine-tuned with labeled data.
Object detection and image segmentation
The first steps in estimating the likelihood that a nodule is benign or malignant are to determine whether it exists (detection), and then to define the contour of the nodule's interface with adjacent thyroid parenchyma (segmentation, Fig. 3). AI models can help with both steps. Object detection algorithms draw a bounding box around a nodule, whereas segmentation algorithms identify the area occupied by the nodule. The nodule's three dimensions also can be used to calculate the nodule's volume.

Image segmentation. The sonogram on the left shows a hypoechoic nodule. In the right image, AI has segmented the nodule, depicting it in yellow.
Explainability
The importance of explainability may be understood by considering how patients are informed about how physicians and AI software determine that a nodule has high risk of malignancy. For example, a human observer can usually point out which ultrasound features they used to arrive at a conclusion. Not so with AI platforms, which often function as “black boxes” that are opaque to physicians and patients. Some authors argue that explainability is a must from medical, ethical, legal, and patient perspectives. 3,4 However, others believe that current techniques are incapable of achieving explainability at the individual patient level. 5 Instead, people in the latter camp recommend rigorous internal and external performance validation before AI models are deployed.
Various techniques have been employed to explain how AI algorithms reach a conclusion. The so-called saliency maps, also known as class activation heat maps, can be superimposed on images to show the areas responsible for the prediction using color coding (Fig. 4). For example, this method could highlight areas with microcalcifications to show why a nodule was classified as malignant, but this alone may not be sufficient. Studies have also shown that this method could lead to confirmation bias and, in some cases, be misleading. 6

Saliency maps. Sonograms on the left show a malignant nodule (top) and a benign nodule (bottom). In the color-shaded images, red denotes the areas used by the software to make the prediction.
Another way to address explainability is to generate diagnoses based on matches to similar images. 7 In some respects, this is similar to how some clinical guidelines present sonograms with various patterns that relate to a nodule's malignancy risk. However, using this method as an explainability tool may be dependent on the image database from which similar images are drawn, as well as the user's experience. Another approach is to generate reports that follow specific risk stratification systems (RSSs) (see Applications in thyroid nodule assessment). 8
Applications in thyroid nodule assessment
Most AI applications in thyroid disease have focused on estimation of the malignancy risk of nodules, although computational techniques have also been applied to the evaluation of cytology specimens, as we will describe. We will begin with a brief review of noncomputational approaches to risk assessment.
The high prevalence of predominantly benign thyroid nodules in the general population, coupled with increasing recognition that identification of malignancies constitutes overdiagnosis in many cases, has led to the development of RSSs based on ultrasound. 9 –14 Although RSSs take different approaches to translating a nodule's sonographic appearance to its cancer probability, they are all based on the presence of similar ultrasound features, also known as descriptors, in five broad categories: composition, echogenicity, shape, margin, and calcifications. The availability of RSSs, many of which were endorsed and/or promulgated by professional organizations, has led to improvement and consistency in thyroid nodule reporting and management, most notably in reducing the number of fine needle aspiration (FNA) biopsies of benign lesions. 15
Unfortunately, all RSSs have been shown to suffer from considerable interobserver variability in assigning risk levels, mostly attributable to lack of agreement in assigning their underlying features. 16 Surprisingly, this has been shown for some descriptors that might be expected to be unambiguous, such as echogenicity. However, a nodule's apparent backscatter may appear quite different depending on various factors, including insonation frequency, scan angle, and the presence of adjacent normal parenchyma to serve as a basis for comparison. This is also true of features like shape, which is assigned dichotomously as either wider-than-tall or taller-than-wide, the latter being associated with a higher probability of malignancy. 13 This likely results from uncertainty regarding the precise location of the interface between a nodule and adjacent tissues, which may be subject to considerable error if the margin is poorly defined.
Some investigators have called for changes to the definitions of some descriptors to reduce variability. For example, Grani et al proposed increasing the height-to-width ratio from 1.0 to 1.2 to avoid overcalling the taller-than-wide feature, while others have highlighted the need for additional training for physicians and sonographers who use ultrasound to classify thyroid nodules. 15,17 Despite these recommendations, however, it is likely that inconsistency in assigning features will continue to be problematic. As well, some practitioners tend to rely on expertise gained from experience, rather than applying an RSS, when assigning malignancy risk. In an international survey regarding RSS usage, although almost 95% of 724 respondents indicated at least some familiarity with RSSs, 62% said they favored reporting the features they felt were most relevant. 18
This begs the question as to whether AI-related techniques can help, either by augmenting image analysis and decision making by humans, or perhaps eventually supplanting them completely. As of late 2022, we believe the former goal has been achieved to a limited extent, with an important caveat that wide deployment is probably years away. The latter, even if eventually feasible, is unlikely in the next five years. Software in the first category falls into the category of CAD (sometimes called CADx), for Computer-Aided Diagnosis. They are semiautomated because they currently require operators with thyroid nodule expertise to select images suitable for further analysis by the software. We will describe four systems that have gained 510(k) approval by the FDA to date (Table 1).
Thyroid Artificial Intelligence Algorithm Features Based on Food and Drug Administration 510(k) Submissions
ACR, American College of Radiology; ATA, American Thyroid Association; RSS, risk stratification system; TI-RADS, Thyroid Imaging Reporting And Data System.
AmCAD-UT (AmCAD Biomed, Taipei, Taiwan) is a Microsoft Windows (Microsoft, Redmond, WA) application that initially received FDA approval in 2013. 19 (AmCAD-UT attained CE marking approval in the European Union the following year.) The physician loads a representative ultrasound image in Digital Imaging and Communication in Medicine (DICOM) or Joint Photographic Experts Group (JPEG) format and the software draws a region of interest (ROI) around the nodule. It then analyzes the nodule's characteristics using statistical pattern recognition and quantification and provides malignancy risk estimates for several leading RSSs.
In a study of 300 proven thyroid nodules, of which 55% were benign and 45% were malignant, a clinical expert using the American Thyroid Association (ATA) system achieved a sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) of 87.0%, 91.2%, 90.5%, and 90.9%, respectively. 20 AmCAD-UT's performance was highest when it applied the score derived from the ATA Guidelines, with a sensitivity, specificity, PPV, and NPV of 87.0%, 68.8%, 64.5%, and 86.3%, respectively. Another study by Lu et al included a retrospective analysis of 234 nodules and a prospective evaluation of 220 lesions. 21 All the patients in the latter group had been scheduled for biopsy based on prior evaluation, and so the majority of nodules were malignant. AmCAD-UT correctly classified 173/178 cancers and 33/42 benign nodules. In another retrospective study, the performance of all 19 readers was improved with CAD, with eight exhibiting a significant increase. 22 This platform has also shown utility in evaluating cytologically indeterminate nodules. 23,24
S-Detect 1 and 2 (Samsung Medison, Seoul, South Korea), which received FDA approval in 2018, is the most heavily studied application to date. 25 Similar to other systems, following manual selection of an ROI around the nodule, S-Detect traces its contour. The software assesses several characteristics, including composition, echogenicity, shape, and margin, and calcifications, (the last only in S-Detect version 2), and then classifies the nodule as possibly benign or possibly malignant based on an RSS of the operator's choosing.
In one retrospective study, 106 patients with 218 surgically proven thyroid nodules who underwent sonography or ultrasound-guided FNA before surgery were evaluated. 26 Sensitivity and specificity, respectively, were 81.4% and 68.2% for S-Detect 2, 84.9% and 96.2% for an experienced thyroid radiologist, and 93.0% and 67.4% for the same radiologist assisted by the CAD system. The difference in specificity between S-Detect 2 and the radiologist was statistically significant. Notably, the software was inaccurate in categorizing calcifications and margins.
Another study evaluated 204 nodules in 181 patients who underwent biopsy and/or surgery. An experienced radiologist chose the images for further assessment by S-Detect, with European Thyroid Imaging Reporting And Data System used as the RSS on which the software made its classification. 27 As well, the images were analyzed by four other radiologists with varying experience in thyroid ultrasound ranging from one to 20 years. The sensitivity and specificity for S-Detect and the most experienced radiologist were not statistically different, but S-Detect exhibited higher specificity than the two radiologists with the least experience. Notably, when they were asked to reassess each nodule after viewing the results of S-Detect, their performance improved.
A recent meta-analysis considered 17 studies that had evaluated S-Detect, including the one by Wei noted above. 28 In aggregate, these studies included 1595 and 1118 benign and malignant nodules, respectively. The pooled sensitivity and specificity were 0.87 and 0.74, with positive and negative likelihood ratios of 3.37 and 0.18, respectively.
Koios DS (Koios Medical, Chicago, IL) is a web application that is accessed by a compatible client (Fig. 5). As with the other CAD systems, the user is responsible for selecting a ROI around a nodule. The ROI image data are sent to the Koios server for processing, and results are then provided to the user. The software assesses ultrasound features in the five descriptor categories used in American College of Radiology (ACR) Thyroid Imaging Reporting And Data System (TI-RADS), composition, echogenicity, shape, margin, and echogenic foci, all of which are user-editable. It then maps the nodule to a risk category based on ACR TI-RADS or the ATA Guidelines. An optional component called the AI Adapter uses two orthogonal views of the nodule and deep learning to modify the risk assessment by these two RSSs.

Screenshot of Koios software showing the ACR TI-RADS description and output from the Koios AI adapter, indicating that the nodule is not suspicious (image courtesy of Koios Medical). ACR, American College of Radiology; TI-RADS, Thyroid Imaging Reporting And Data System.
Although no independent investigations of Koios DS for thyroid nodules have been published as of this writing, supporting information was provided to the FDA in the vendor's submission. 29 In a retrospective clinical study of 650 nodules (500 from the United States and 150 from Europe) that ranged from ACR TI-RADS TR1 through TR5, 15 readers' performance using ultrasound alone and ultrasound plus Koios DS was compared. According to the data in the document, the CAD software with the AI Adapter improved average sensitivity and specificity of FNA by 0.084 and 0.140, respectively. All cases used in this study were pathology proven or had a minimum of one year follow-up. When recommending biopsy, Koios's sensitivity was 0.644 (0.545, 0.744) and specificity was 0.612 (0.566, 0.658).
MEDO-Thyroid (Exo, Santa Clara, CA) is a cloud-based platform that uses AI to evaluate DICOM images to segment and measure the linear size and volume of the thyroid gland and user-identified nodules. 30 It also provides an ACR TI-RADS nodule classification based on physician input. As such, it is primarily a reporting tool, although its capabilities may improve in the future.
AI-augmented RSSs
In addition to software that aids in the classification of thyroid nodules, AI has been used to inform changes to existing RSSs. For example, Wildman-Tobriner et al used an AI method known as a genetic algorithm to optimize ACR TI-RADS by changing the point values awarded to some ultrasound features. 31 In the modified RSS, which they termed AI TI-RADS, they reduced the point values for mixed cystic and solid composition, hyperechoic and isoechoic echogenicity, nonclassifiable echogenicity, and macrocalcifications from 1 to 0.
Nonclassifiable composition dropped from 2 to 0 points, and taller-than-wide shape changed from 3 to 1 point. Solid or almost completely solid echogenicity rose from 2 to 3 points. Mean specificity for eight nonexpert radiologists increased from 47.7% to 55.3%, while mean sensitivity was not significantly different. Similarly, a study by Watkins et al comparing ACR TI-RADS, AI TI-RADS, and the British Thyroid Association Guidelines showed little performance difference between the first two RSSs. 32 Nevertheless, AI techniques may prove useful as current RSSs undergo revision over the next few years.
Cine image evaluation
A significant limitation of most AI systems is their use of static images to evaluate cancer risk. This is problematic because all the relevant ultrasound features may not be optimally apparent on just one or two images. Moreover, the person who selects them must be able to choose the most representative images based on familiarity with one or more RSSs. As well, physicians customarily rely on multiple sonograms and, increasingly, on cine or video clips, which provide a perspective that is closer to visualization afforded by hands-on real-time scanning.
With this in mind, Yamashita et al used deep learning to develop a system they called Cine-CNNTrans, which evaluated cine clips of 192 nodules that were segmented by a radiologist. 33 Binary output from their model (low or high risk of malignancy) was then used to change the ACR TI-RADS recommendations by one level. For example, nodules that were classified as having a low cancer risk by their software but were recommended to undergo biopsy by ACR TI-RADS were downgraded to follow-up. The revised system achieved higher specificity than ACR TI-RADS (79.4% vs. 26.9%, p < 0.001) with no significant difference in sensitivity (70.6% vs. 82.4%).
Prediction of mutations
Software has also shown utility in predicting genetic mutations in thyroid cancer. In a study by Yoon et al, a deep learning neural network was used to evaluate images of 469 malignant nodules. 34 Sensitivity, specificity, PPV, and NPV for predicting the BRAFV600E mutation were 85.3%, 41.6%, 59.4%, and 73.9%, respectively. Given the value of mutations to diagnose thyroid cancer and possibly serve as a biomarker for tumor aggressiveness, this capability may prove useful to predict which nodules harbor them and may be candidates for genomic analysis.
Lymph node assessment
CAD software has demonstrated value in predicting cervical lymph node metastases from thyroid cancer. 35 Lee et al developed a system to evaluate lateral compartment nodes preoperatively and both thyroid bed and lateral nodes after thyroidectomy. 36 Their deep learning model exhibited a sensitivity, specificity, and accuracy of 79.5%, 87.5%, and 83.0%, respectively, in a test data set.
Cytopathology
AI techniques have been applied to cytologic diagnosis of specimens from thyroid nodules. A retrospective study used an ANN trained with nonpapillary cancer images of smears from colloid goiter, follicular neoplasms, and lymphocytic thyroiditis and papillary cancers diagnosed by cytology. 37 The model was evaluated using 66 noncancer foci and 21 papillary cancers. The software achieved a sensitivity, specificity, PPV, NPV, and accuracy of 90.5%, 83.3%, 63.3%, 96.5%, and 85.1%, respectively. Another study evaluated 120 cases of benign lesions and 159 papillary thyroid cancers using deep convolutional neural networks. 38 One of the two models they tested resulted in sensitivity, specificity, PPV, NPV, and accuracy of 100%, 94.9%, 63.3%, 95.8%, 100%, and 97.7%, respectively. These results are promising, but to the best of our knowledge, no AI-based software for thyroid cytopathology has yet received FDA approval.
Discussion
Like most emerging technologies, AI passes through several stages before entering the mainstream. At first, excitement leads to unrealistically high expectations, followed by a plateau and a decline in interest, eventually culminating in acceptance. Since the term AI was coined by John McCarthy in 1956, the field has been subjected to several rises and falls in enthusiasm. 39 Initial hype in the late 1950s and the 1960s, termed the first summer of AI, gave way to funding cuts toward the end of the latter decade, a downturn that was characterized at the first AI winter. Predictably, this was followed by a revival in interest, the second summer, and then a second winter, which lasted until the mid 1990s. Given massive recent interest in AI techniques by government institutions, private industry, and academic researchers, it is fair to say that we are in the midst of another AI summer. To an extent, this has been bolstered by massive public interest by the public, fanned by real-word applications and depictions of AI in literature, television, and film.
Medical applications of AI, including the ones described above, have followed this pattern. Speaking at a conference in 2016, machine learning pioneer Geoffrey Hinton infamously stated “We should stop training radiologists now. It's just completely obvious that within five years, deep learning is going to do better than radiologists.” 40 Since then, his prediction has been tempered by real-world performance of AI applications. Still, it is clear that their use will continue to increase in radiology, endocrinology, surgery, and other fields that work with images, including specialties that deal with thyroid disease, particularly nodules. The rate at which AI platforms for risk stratification will be deployed will depend on three factors: efficacy, usability, and cost.
Efficacy
The AI systems that have received FDA approval and undergone independent verification are sometimes able to exceed the performance of observers with limited experience in distinguishing benign from malignant nodules. This capability may prove to be helpful in clinical departments staffed by nonexpert practitioners. Even in settings with highly trained interpreting physicians, software may serve as a useful “over-the-shoulder” check on their work. The performance of AI platforms will no doubt improve as they are trained with larger data sets, similar to the way additional experience benefits human capabilities.
Usability
Even high-performing software will see limited use or fail altogether if it is difficult to access and use. Availability of AI systems on ultrasound machines may help, although that will lessen the time the scanners on which they are installed may be used for imaging. However, in some work settings, AI platforms will mostly be applied after ultrasound images have been acquired, running on an imaging workstation. In both cases, reducing or eliminating manual steps by human operators, such as selecting key images and segmenting nodules, will be critical to achieve operational efficiency. The more time it takes to activate and navigate AI user interfaces, particularly the number of mouse clicks required, the less likely physicians will be willing to adopt them.
Cost
For cash-strapped hospitals and outpatient facilities, whether to acquire an AI platform will rest on cost-effectiveness. It remains to be seen to what extent, if any, Medicare and third-party payors will reimburse practices for the expense of acquiring AI software. More likely, any expenses will be offset though improvements in efficiency and boosts in patient throughput. Enhanced diagnostic performance will also affect decision making, although improved long-term outcomes will be harder to demonstrate. To date, none of these has been shown.
Guidance for selecting AI tools
So far there has been no head-to-head comparison of all four FDA-approved risk stratification algorithms. Before selecting a system for implementation, it is important to ensure that the training data and images used to develop and test the algorithm are representative of the target population. An AI model trained on images from one ultrasound machine vendor may not work well on images acquired from another vendor's machine. As noted, the ground truth used in training the model should be either surgical pathology or cytopathology if it is predicting malignancy. In this scenario, an RSS or stability over time are not sufficient. Close attention should be paid to performance evaluation. In a disease with a low prevalence like thyroid cancer, accuracy may not be the best metric. Since the NPV and PPV are dependent on the prevalence of the disease in the tested population, it should be comparable to that of the target population. The risk of false positives and false negatives should be evaluated in the practice's clinical scenario.
Outlook and future directions
In the short-to-medium term, acceptance and growth of AI applications will be hampered by the multiplicity of algorithms that exhibit different diagnostic performance, particularly if they are dependent on the ultrasound scanner used to acquire the images. Regardless, we believe deployment of AI applications in thyroidology, particularly nodule evaluation, will increase as studies demonstrate which platforms deliver the greatest clinical value by reducing the number of biopsies and perhaps use of molecular markers. Practices will purchase or license the software that most closely meets their needs, perhaps using an acquisition model similar to mobile “app stores.”
At the outset, the benefits of AI will be most evident in settings where providers lack extensive experience by augmenting assessment of preselected nodules. Within several years, though, we expect software will automate identification of lesions that require further attention, whether on a workstation or an ultrasound scanner, providing accuracy and efficiency rivaling or even exceeding that of some expert observers. In the future, we expect AI applications will also be able to generate complete thyroid ultrasound reports that include nodule classification and management recommendations based on guidelines from professional organizations, perhaps integrating clinical data.
Footnotes
Authors' Contributions
F.N.T. and J.T.: All aspects of article preparation.
Author Disclosure Statement
J.T. is CEO of AIBx, Inc. F.N.T. serves as an advisor to AIBx, Inc.
Funding Information
No funding was received for this article.
