Abstract
Background:
Risk stratification systems for thyroid nodules are limited by low specificity. The fine-needle aspiration (FNA) biopsy size thresholds and stratification criteria are based on evidence from the literature and expert consensus. Our aims were to investigate the optimal FNA biopsy size thresholds in the American College of Radiology (ACR) Thyroid Imaging Reporting and Data System (TI-RADS) and artificial intelligence (AI) TI-RADS and to revise the stratification criteria in AI TI-RADS.
Methods:
A total of 2596 thyroid nodules (in 2511 patients) on ultrasound examination with definite pathological diagnoses were retrospectively identified from January 2017 to September 2021 in 6 participating Chinese hospitals. The modified criteria for ACR TI-RADS were as follows: (1) no FNA for TR3; (2) FNA threshold for TR4 increased to 2.5 cm. The modified criteria for AI TI-RADS were as follows: (1) 6-point nodules upgraded to TR5; (2) no FNA for TR3; (3) FNA threshold for TR4 increased to 2.5 cm. The diagnostic performance and the unnecessary FNA rate (UFR) of modified versions were compared with the original ACR TI-RADS.
Results:
Compared with the original ACR TI-RADS, the modified ACR (mACR) TI-RADS yielded higher specificity (73% vs. 46%), accuracy (74% vs. 51%), area under the receiver operating characteristic curve (AUC; 0.80 vs. 0.70), and lower UFR (25% vs. 48%; all p < 0.001), although the sensitivity was slightly decreased (87% vs. 93%, p = 0.057). Compared with the original ACR TI-RADS, the modified AI (mAI) TI-RADS yielded higher specificity (73% vs. 46%), accuracy (75% vs. 51%), AUC (0.81 vs. 0.70), and lower UFR (24% vs. 48%; all p < 0.001), although the sensitivity tended to be slightly decreased (89% vs. 93%, p = 0.13). There was no significant difference between the mACR TI-RADS and mAI TI-RADS in the diagnostic performance and UFR (all p > 0.05).
Conclusions:
The revised FNA thresholds and the stratification criteria of the mACR TI-RADS and mAI TI-RADS may be associated with improvements in specificity and accuracy, without significantly sacrificing sensitivity for malignancy detection.
Introduction
The detection rate of thyroid nodules has increased rapidly with the widespread use of ultrasound (US) since the 1990s. 1,2 Fine-needle aspiration cytology (FNAC) remains the gold standard for diagnosis. 3 Although FNA is a relatively safe and cost-effective procedure, performing FNA for all nodules is impractical, inappropriate, and unnecessary because only 10% of patients presenting with thyroid nodules are at risk of malignancy. 4 Overdiagnosis and overtreatment of thyroid nodules are current concerns worldwide, and up to 77% of detected thyroid cancers may be clinically insignificant. 2 It is noteworthy that thyroid cancer-related mortality rates have not increased substantially despite the sharp increase in incidence. 5 Excessive examination and intervention may not only cause anxiety and economic burden for patients but also waste of medical resources for society. Therefore, determining how to reduce unnecessary biopsies while maintaining appropriate sensitivity for malignancy detection is an issue that requires further investigation.
Various risk stratification systems based on US features have been proposed globally and used to determine which nodules should be subjected to FNA. Size thresholds vary across guidelines, leading to differences in their diagnostic performance and unnecessary biopsy rate. Previous comparative studies showed that Thyroid Imaging Reporting and Data System (TI-RADS) published by the American College of Radiology (ACR) showed the highest specificity and lowest unnecessary biopsy rate compared with other guidelines, 6,7 which was attributed to the larger size thresholds of the ACR guidelines. 8 However, a recent retrospective cohort study reported that 57.4% of biopsied thyroid nodules were benign, 9 which indicates that efforts should be taken to improve the diagnostic performance of ACR TI-RADS. Smaller FNA size thresholds may lead to excessive FNAs, while larger thresholds may decrease the sensitivity.
In an effort to achieve higher specificity, Wildman-Tobriner et al. 10 applied artificial intelligence (AI) to optimize TI-RADS by assigning new scores for eight US features in 2019. Our previous study 11 has validated that the AI TI-RADS significantly improved specificity (70.2% vs. 49.2%) despite a slight decrease in sensitivity compared with the ACR TI-RADS (82.2% vs. 86.7%). AI TI-RADS assigned lower risk levels for 54 malignant nodules, resulting in 29 papillary carcinomas smaller than 1.5 cm that were missed diagnosed. Therefore, it is important to investigate how to modify the stratification criteria of AI TI-RADS to compensate for the sacrifice in sensitivity.
The FNA size thresholds and the stratification criteria in TI-RADS are based on evidence from the literature and expert consensus, which could be optimized to improve the performance of the system. This study aimed to investigate the optimal FNA size thresholds in the ACR TI-RADS and AI TI-RADS and to revise the stratification criteria in AI TI-RADS.
Materials and Methods
This study was approved by the institutional review boards of all participating institutions (approval No. B2021-021-Y01). The requirement for informed consent was waived by the institutional review boards because of the retrospective study design.
Study patients
Between January 2017 and September 2021, a total of 4001 thyroid nodules from 3517 consecutive patients who underwent thyroid US at 6 different hospitals in China were retrospectively identified. The eligibility criteria were as follows: (1) age ≥18 years; (2) the maximum diameter of the nodules was ≥1.0 cm; (3) nodules with definitive cytology results (Bethesda category II or VI), definitive core-needle biopsy (CNB) results, or surgical resection. US-guided FNA was performed for the thyroid nodules under the recommendation of the ACR TI-RADS or before thermal ablation for TR1 and TR2 nodules due to compressive or cosmetic symptoms. CNB was usually performed in nodules with prior inconclusive FNA results. The exclusion criteria were as follows: (1) nodules with inconclusive final diagnoses (n = 634); (2) nodules underwent prior treatments (n = 92); (3) nodules with incomplete or poor US images (n = 19).
Thus, a total of 3256 thyroid nodules were eligible, including 2336 benign nodules and 920 malignant nodules with a malignancy rate of 28.3%. It was reported that ∼10% of patients who present with thyroid nodules are at risk of malignancy. 4 To evaluate the diagnostic performance of the modified criteria proposed in this study in the general population, malignant nodules were included using simple random sampling, while benign nodules were included consecutively. Among eligible malignant nodules, all characteristics were comparable between the exclusion cohort and inclusion cohort (Supplementary Table S1). A total of 2596 thyroid nodules from 2511 patients were included, including 2336 benign nodules and 260 malignant nodules with a malignancy rate of 10.0% (Fig. 1). We previously reported on 601 of the included in our study evaluating the efficacy of AI TI-RADS. 11

Flowchart of the included patients and number of thyroid nodules. n, number of thyroid nodules; US, ultrasound.
US examinations and image analysis
All nodules underwent US examination within two weeks before biopsy or operation. US examinations were performed using high-frequency linear probes and a real-time US system. The US systems used included GE Logiq 9, Logiq E9, Logiq S8 (GE Medical Systems, Milwaukee, WI, USA); Aixplorer (Supersonic Imagine, Paris, France); Philips IU22, EPIQ 7 (Philips Medical Systems, Best, the Netherlands); Siemens ACUSON Juniper, Sequoia, S2000 (Siemens Medical Solutions, Mountain View, CA, USA); Toshiba Aplio 400 (Toshiba Medical Systems Corp., Tokyo, Japan); Hitachi Aloka ProSound ALPHA 10 (Hitachi-Aloka Medical, Tokyo, Japan); Esaote MyLab 70 (Esaote, Genoa, Italy); Mindray Resona 7T, DC-8 (Mindray Medical International, Shenzhen, China). All US-guided procedures were performed by radiologists with at least 5-year experience in US.
US image analysis was performed by two experienced radiologists (C.P. and Y.L., with 7 and 8 years of experience, respectively, in thyroid imaging). Blinded to the clinical and pathological data, they independently reviewed all US images and assessed the US features of thyroid nodules according to the ACR TI-RADS, including nodule maximum diameter, composition, echogenicity, shape, margin, and echogenic foci. Supplementary Figure S1 shows the scoring system for the ACR TI-RADS and AI TI-RADS. When grading a nodule, the reviewer selected one feature from each of the five categories, and the total score determined the nodule's TI-RADS risk level. Recommendations for FNA or US follow-up were based on a nodule's TI-RADS level and its maximum diameter. Images would be reassessed by an expert (J.Z., with 22 years of experience in thyroid imaging) when disagreement between two reviewers existed.
Exploration of modified criteria to ACR TI-RADS and AI TI-RADS
According to our previous study evaluating the efficacy of AI TI-RADS, 11 the malignancy rate of 6-point nodules in AI TI-RADS was 43.1%, which was significantly higher than the malignant risk level (5–20%) of TR4 suggested in the ACR TI-RADS. Therefore, we hypothesized that upgrading 6-point nodules from AI TR4 to AI TR5 could improve the sensitivity of AI TI-RADS (hereinafter referred to as “TR4-adjusted AI TI-RADS”).
To explore the optimal nodule size thresholds for FNA recommendation, the thresholds of ACR TI-RADS, AI TI-RADS, and TR4-adjusted AI TI-RADS were adjusted, respectively. Five new versions of each guideline were hypothetically established (Table 1). Version 1 simulated FNA size thresholds for TR4 from 1.5 to 2.0 cm. Version 2 simulated FNA size thresholds for TR3 from 2.5 cm to No FNA. Version 3 simulated FNA size thresholds for TR3 from 2.5 cm to No FNA and TR4 from 1.5 to 2.0 cm. Version 4 simulated FNA size thresholds for TR3 from 2.5 cm to No FNA and TR4 from 1.5 to 2.5 cm. Version 5 simulated FNA size thresholds for TR3 from 2.5 cm to No FNA and TR4 from 1.5 to 3.0 cm. The diagnostic performance and the unnecessary FNA rate (UFR) of recommended FNA in all new versions were calculated and compared with those in the original ACR TI-RADS.
New Versions of the American College of Radiology TI-RADS and Artificial Intelligence TI-RADS with Fine-Needle Aspiration Size Threshold Adjustment
Indicates FNA size threshold adjustment.
Refers to 6-point nodules being upgraded from AI TR4 to AI TR5, while other rules were the same as the original AI TI-RADS.
ACR, American College of Radiology; AI, artificial intelligence; FNA, fine-needle aspiration; TI-RADS, Thyroid Imaging Reporting and Data System; TR, TI-RADS category.
Statistical analyses
Patient demographics were compared using descriptive statistics. Quantitative data are summarized as the mean ± standard deviation and comparing the means using the Mann–Whitney U test. Categorical data are summarized as percentages and compared by the chi-square test. The thyroid nodules were dichotomized into two groups, FNA indicated or not based on the criteria for FNA of each TI-RADS category and their new versions. The diagnostic performance in the detection of thyroid cancer was evaluated by sensitivity, specificity, accuracy, positive predictive value (PPV), negative predictive value, and the area under the receiver operating characteristic curve (AUC) in each guideline, along with the 95% confidence intervals (CI). The UFR was defined as the percentage of the FNA-indicated benign nodules in the total number of nodules included. The McNemar test was used to assess for differences in these measures of diagnostic performance, and the DeLong test was applied to compare AUCs.
Statistical analyses were performed using SPSS 22.0 (IBM, Armonk, NY) and R software 4.2.2 (R Foundation, Vienna, Austria). p-Value <0.05 indicated a statistically significant difference.
Results
Study patients and nodule characteristics
Table 2 summarizes the patient demographics and nodule characteristics. Of the 2511 patients included in the study, 1914 (76.2%) were women and 597 (23.8%) were men. Of the total 2596 nodules, 2336 were benign and 260 were malignant. Among the 2336 benign nodules, 1922 were confirmed by cytology, 173 were confirmed by CNB, and 241 were confirmed by surgical resection. Among the 260 malignant nodules, 40 were confirmed by cytology, 7 were confirmed by CNB, and 213 were confirmed by surgical resection. The detailed pathological diagnoses are summarized in Supplementary Table S2.
Patient Demographics and Nodule Characteristics
Unless otherwise specified, data are reported as number of nodules, with percentages in parentheses.
Data are reported as mean ± standard deviation for continuous variables.
Nodules could have more than one type of echogenic focus.
Diagnostic performance and UFR of original and new versions
Supplementary Tables S3–S5 show the diagnostic performance and UFR of original and new versions of each guideline. Among original versions, the highest sensitivity was observed with the ACR TI-RADS and TR4-adjusted AI TI-RADS (both, 93% [CI = 89–96%]). The highest specificity was observed in the AI TI-RADS (56% [CI = 54–58%]). The lowest UFR was observed in the AI TI-RADS (39% [CI = 37–41%]).
We evaluated the impact on diagnostic performance and UFR by applying higher size thresholds for FNA recommendation to each guideline (Fig. 2). As the nodule size thresholds were raised, the specificity, accuracy, PPV, and AUC of each new version gradually increased, while the sensitivity gradually decreased compared with the corresponding original version. Also, the UFR decreased markedly. Without significant difference in the decrease of sensitivity, version 4 of the ACR TI-RADS and TR4-adjusted AI TI-RADS performed best relatively, which would be selected as the modified ACR TI-RADS (mACR TI-RADS; Fig. 3) and modified AI TI-RADS (mAI TI-RADS; Fig. 4). The final modified criteria for mACR TI-RADS were as follows: (1) TR3 nodules were not recommended for FNA; (2) FNA threshold for TR4 increased to 2.5 cm. The modified criteria for mAI TI-RADS were as follows: (1) 6-point nodules were upgraded from TR4 to TR5; (2) TR3 nodules were not recommended for FNA; (3) FNA threshold for TR4 increased to 2.5 cm. For patients with symptoms or cosmetic issues, FNA was indicated before any treatment.

Diagnostic performance and UFR of new versions of ACR TI-RADS and AI TI-RADS with adjustment (defined in Table 1). The TR4-adjusted AI TI-RADS refers to 6-point nodules being upgraded from AI TR4 to AI TR5, while other rules were the same as the original AI TI-RADS. (

Chart shows the comparison of ACR TI-RADS and mACR TI-RADS scheme, including nodule size threshold adjustments in TR3 and TR4. mACR TI-RADS, modified ACR TI-RADS.

Chart shows the comparison of AI TI-RADS and mAI TI-RADS scheme, including nodule size threshold and stratification criteria adjustments. mAI TI-RADS, modified AI TI-RADS.
Compared with the original ACR TI-RADS, the mACR TI-RADS yielded higher specificity (73% vs. 46%, p < 0.001), accuracy (74% vs. 51%, p < 0.001), AUC (0.80 vs. 0.70, p < 0.001), and lower UFR (25% vs. 48%, p < 0.001), although the sensitivity was slightly decreased without a significant difference (87% vs. 93%, p = 0.057). Compared with the original ACR TI-RADS, the mAI TI-RADS yielded higher specificity (73% vs. 46%, p < 0.001), accuracy (75% vs. 51%, p < 0.001), AUC (0.81 vs. 0.70, p < 0.001), and lower UFR (24% vs. 48%, p < 0.001), although the sensitivity was slightly decreased without significant difference (89% vs. 93%, p = 0.13). There was no significant difference between the mACR TI-RADS and mAI TI-RADS in the diagnostic performance and UFR (Table 3).
Comparison of Diagnostic Performance Among Four TI-RADS Versions
Data in parentheses are 95% CIs, with numerators and denominators in brackets. P1 represents the comparison with ACR TI-RADS. P2 represents the comparison with AI TI-RADS. P3 represents the comparison with mACR TI-RADS.
p < 0.05 indicates statistically significant differences.
AUC, area under the receiver operating characteristic curve; CI, confidence interval; mACR TI-RADS, modified ACR TI-RADS; mAI TI-RADS, modified AI TI-RADS; NPV, negative predictive value; PPV, positive predictive value; UFR, unnecessary FNA rate.
Original versions versus modified versions
Table 4 summarizes the risk stratification and indication of FNA among the four TI-RADS versions. When the ACR TI-RADS and mACR TI-RADS were applied, the malignancy risk of most categories was consistent with those recommended in the ACR TI-RADS white paper, except for TR3 and TR4, where the malignancy risk was slightly lower. A total of 1494 nodules were recommended for FNA according to the ACR TI-RADS, of which 1253 (83.9%) were benign and 241 (16.1%) were malignant. Compared with the original ACR TI-RADS, the mACR TI-RADS reduced FNA in 631 nodules, of which 617 (97.8%) were benign. Despite the decrease in the FNA rate of the mACR TI-RADS, the malignancy detection rate was higher (26% [227/863] vs. 16% [241/1494]). A total of 1241 nodules were recommended for FNA according to the AI TI-RADS, of which 1019 (82.1%) were benign and 222 (17.9%) were malignant. After modification, the mAI TI-RADS reduced FNA in 391 benign nodules and increased 8 more FNA in malignant nodules, compared with the original AI TI-RADS.
Comparison of Risk Stratification and Indication of Fine-Needle Aspiration Among Four TI-RADS Versions
Unless otherwise specified, data are reported as number of nodules, with percentages in parentheses.
Numbers in parentheses represent percentage of total nodules of each risk level (TR1 to TR5) in total 2596 nodules.
Numbers in parentheses represent percentage of benign nodules of each risk level (TR1 to TR5) in 2336 benign nodules.
Numbers in parentheses represent percentage of malignant nodules of each risk level (TR1 to TR5) in 260 malignant nodules.
Numbers in parentheses represent percentage of benign nodules among nodules indicated for FNA within each risk level (TR1 to TR5).
Numbers in parentheses represent percentage of malignant nodules among nodules indicated for FNA within each risk level (TR1 to TR5).
NA, Suggested risk of malignancy was not provided according to Wildman-Tobriner's research. 10
Of the 2336 benign nodules, the AI TI-RADS and mAI TI-RADS downgraded 113 nodules from ACR TR3 to AI/mAI TR2 and 45 nodules to AI/mAI TR1. Among ACR TR4 nodules, 94 nodules were downgraded to AI/mAI TR3, 22 to AI/mAI TR2, and 32 to AI/mAI TR1. Among ACR TR5 nodules, 128 nodules were downgraded to AI TR4 (13 to mAI TR4 instead) and 4 to AI/mAI TR3. Ultimately, the new risk level assignments and size thresholds adjustments resulted in 1700 and 1708 benign nodules spared from FNA with the application of mACR TI-RADS and mAI TI-RADS, respectively (Fig. 5).

The application value of modified versions in downgrading risk level (
Discussion
We found that compared with the original ACR TI-RADS, the mACR TI-RADS and mAI TI-RADS had higher specificities (72.8%, 73.1% vs. 46.4%), AUCs (0.800, 0.808 vs. 0.695), and lower UFRs (24.5%, 24.2% vs. 48.3%, all p < 0.001), while the sensitivities were slightly but not significantly decreased (87.3% vs. 92.7%, p = 0.057; 88.5% vs. 92.7%, p = 0.13).
In the past decade, several associations have issued guidelines based on US features and nodule size to grade the risk of malignancy of thyroid nodules. Previous comparative studies revealed that the ACR TI-RADS showed the highest specificity and lowest unnecessary biopsy rate compared with the other guidelines. 6,7 Ha et al. 8 proved that the main reason lied in the larger size thresholds of the ACR guidelines (mildly suspicious nodules, 2.5 cm; moderately suspicious nodules, 1.5 cm) when compared with the American Thyroid Association (ATA) and Korean Thyroid Association/Korean Society of Thyroid Radiology (KTA/KSThR) guidelines (1.5 and 1.0 cm, respectively). As the nodule size thresholds of the ATA and KTA/KSThR guidelines were raised, the diagnostic performance and unnecessary biopsy rates became similar to those seen with the ACR guideline. 8
A recent study 12 proposed that the recommended FNA nodule size in Kwak TI-RADS 4b could be raised to 15 mm, 4a could not consider FNA, and the ATA guideline intermediate suspicion could be raised to 15 or 20 mm, low suspicion and very low suspicion could not consider FNA. As far as we know, this is the first study to explore whether the size thresholds of ACR TI-RADS could be optimized to improve the diagnostic performance and decrease the unnecessary biopsy rates.
In our previous study, 11 ACR TR3 nodules accounted for 31.1% of the total nodules, with a malignancy rate of 0.5%. In this study, ACR TR3 nodules accounted for 21.3% of the total nodules, with a malignancy rate of 1.4%. The most recent ATA guideline suggests that observation without FNA is a reasonable option for nodules in the very low suspicion category with a risk of malignancy <3%. 13 In this study, TR3 nodules were mainly mixed cystic/solid (28.5%) or solid, hyper/isoechoic (71.3%). Mixed cystic/solid nodules account for one-third to one-half of all US-detected thyroid nodules 14 –16 whose malignancy rate varies but is generally low (∼5%), especially in predominately cystic nodules. 14,15
Malignant nodules always show an eccentric solid component with moderately or highly suspicious characteristics such as decreased echogenicity, lobulation, or punctate echogenic foci. 17 –19 Hyperechogenicity and isoechogenicity suggest benign disease. 20 Rosario et al. 21 previously reported a rate of malignancy of only 1.5% for solid, iso- or hyperechoic nodules without suspicious US features, which agreed with the rate of <3% reported by other studies. 22 –24 Therefore, Rosario et al. 21 suggested that FNA was less necessary in the case of iso- or hyperechoic nodules that did not show suspicious US characteristics, provided that the patient was closely followed by US. Although ACR TR3 nodules account for a relatively high proportion, their malignancy rate is quite low. Increasing the FNA recommendation threshold could substantially reduce the biopsy of benign nodules (Fig. 6A, B).

US image of three thyroid nodules. (
ACR TR4 nodules accounted for 30.8% of the total nodules, with a malignancy rate of 4.5%. Nguyen et al. 25 analyzed 112,128 patients and concluded that the risk of local invasion, nodal metastases, or distant metastases was low for differentiated thyroid carcinoma tumors <4 cm, and there was no size threshold associated with a sharp rise in adverse outcomes. Increasing tumor size did not affect survival until a threshold of 2.5 cm. Furthermore, the dimension of nodules on US has been reported to be larger than their size at gross pathology by 5 mm on average. 26,27 These findings suggest that increased FNA size thresholds may not lead to significantly increased risk of morbidity and mortality (Fig. 6C, D).
When AI TI-RADS is applied, due to the simplification, only the features that are important in the differential diagnosis are retained, such as solid nodule composition. A large number of benign nodules will be downgraded, widening the gap between their scores and malignant nodules. The malignancy rate of 6-point nodules in AI TR4 was higher than that of 4-point nodules and 5-point nodules, closer to TR5 instead. Therefore, the sensitivity of the mAI TI-RADS became the same as that of the original ACR TI-RADS after the upgrade of 6-point nodules from AI TR4 to AI TR5, which solved the missed diagnosis problem of the original AI TI-RADS (Fig. 6E, F).
This study has several limitations. First, it is a retrospective study and, therefore, selection bias may be inevitable. Also, nodules were selected based on the specific risk stratification system (ACR TI-RADS) or clinically significant issues such as compressive symptoms. To minimize this limitation, we conducted a multicenter study involving a large sample. Second, nodules with inconclusive final diagnoses were excluded. It is difficult to assess the malignancy rate and TI-RADS performance among this subgroup. Inclusion criteria for future studies will add a follow-up criterion to study nodules that lack pathological diagnoses but remain stable over time (considered benign). Third, the composite reference standard including FNAC and CNB histology used in our study may lead to false-negative and false-positive results.
In conclusion, the mACR TI-RADS and mAI TI-RADS based on FNA thresholds and stratification rules adjustments may significantly improve the specificity and accuracy without sacrificing sensitivity compared with the original ACR TI-RADS. Further validation is required in a larger prospective longitudinal study.
Footnotes
Authors' Contributions
Guarantors of integrity of entire study: X.L., C.P., and J.Z. Study concepts/study design or data acquisition or data analysis/interpretation: all authors. Article drafting or article revision for important intellectual content: X.L. and L.C. Approval of final version of submitted article: all authors. Agrees to ensure any questions related to the work are appropriately resolved: all authors. Literature research: Y.H., L.Y., and Y.Y. Clinical studies: Y.L., H.Z., W.H., Q.L., and N.T. Experimental studies: Y.L., H.Z., W.H., Q.L., and N.T. Statistical analyses: X.L. Article editing: all authors.
Author Disclosure Statement
All authors disclosed no relevant relationships. Activities related to the present article: disclosed no relevant relationships.
Funding Information
No funding was received for this article.
Supplementary Material
Supplementary Figure S1
Supplementary Table S1
Supplementary Table S2
Supplementary Table S3
Supplementary Table S4
Supplementary Table S5
