Abstract
Purpose:
Artificial intelligence (AI) models have shown promise in predicting malignant thyroid nodules in adults; however, research on deep learning (DL) for pediatric cases is limited. We evaluated the applicability of a DL-based model for assessing thyroid nodules in children.
Methods:
We retrospectively identified two pediatric cohorts (n = 128; mean age 15.5 ± 2.4 years; 103 girls) who had thyroid nodule ultrasonography (US) with histological confirmation at two institutions. The AI-Thyroid DL model, originally trained on adult data, was tested on pediatric nodules in three scenarios axial US images, longitudinal US images, and both. We conducted a subgroup analysis based on the two pediatric cohorts and age groups (≥14 years vs. < 14 years) and compared the model’s performance with radiologist interpretations using the Thyroid Imaging Reporting and Data System (TIRADS).
Results:
Out of 156 nodules analyzed, 47 (30.1%) were malignant. AI-Thyroid demonstrated respective area under the receiver operating characteristic (AUROC), sensitivity, and specificity values of 0.913–0.929, 78.7–89.4%, and 79.8–91.7%, respectively. The AUROC values did not significantly differ across the image planes (all p > 0.05) and between the two pediatric cohorts (p = 0.804). No significant differences were observed between age groups in terms of sensitivity and specificity (all p > 0.05) while the AUROC values were higher for patients aged <14 years compared to those aged ≥14 years (all p < 0.01). AI-Thyroid yielded the highest AUROC values, followed by ACR-TIRADS and K-TIRADS (p = 0.016 and p < 0.001, respectively).
Conclusion:
AI-Thyroid demonstrated high performance in diagnosing pediatric thyroid cancer. Future research should focus on optimizing AI-Thyroid for pediatric use and exploring its role alongside tissue sampling in clinical practice.
Introduction
Thyroid cancer is the most common endocrine malignancy in children and adolescents. 1 –3 Although it is less frequent in young children and adolescents than adults, its incidence is increasing worldwide; thus, thyroid cancer is the second most common cancer in adolescents aged 15–19 years. 4,5 The increases in early and advanced forms of thyroid cancer highlight the need for prompt detection and appropriate management in pediatric populations. 6,7 Although several Thyroid Imaging Reporting and Data Systems (TIRADS) assist in the evaluation of thyroid nodules on ultrasonography (US) in adults, these classification systems are not designed for the pediatric population and their performance in pediatric populations has been inconsistent. 8 –12 The adult-based TIRADS biopsy criteria have not demonstrated the sensitivity required for pediatric use. 8 –12 Therefore, clinicians could benefit from tools that more accurately identify pediatric patients at high risk of malignancy to prioritize surgery, while limiting risks of surgery in patients with benign nodules.
Artificial intelligence (AI), particularly deep learning (DL), has shown promise in the evaluation of thyroid nodules among adults. 13,14 Increased computational power and large datasets have facilitated the development, external validation, and web-based accessibility of several AI-based systems. 13,15 However, pediatric thyroid nodules present distinct characteristics, including a higher risk of malignancy (22–26% vs. 5–10%) and distant metastases (30% vs. 5%), compared with adults in previous studies. 16 –21 Therefore, AI-based systems require validation with pediatric datasets to reliably guide decisions regarding cases that require clinical monitoring with repeat US versus cases that require fine-needle aspiration (FNA) biopsy. However, few studies have investigated the use of DL to assess thyroid nodules in pediatric populations. 22 We evaluated the applicability of a DL-based model for the evaluation of thyroid nodules in pediatric populations.
Methods
Study design and two datasets
Datasets were obtained from two tertiary hospitals: Ajou University Medical Center (AUMC) in South Korea and Stanford Lucile Packard Children’s Hospital (LPCH) in the United States. The institutional review boards at both hospitals approved the study protocol, and the requirement for informed consent was waived due to the retrospective study design (AUMC: AJOUIRB-DB-2024-124, LPCH: IRB-68847). A research Collaboration agreement (RRA 527653) was established between the 2 institutions.
We included 150 consecutive pediatric patients (aged ≤18 years) with thyroid nodules ≥5.0 mm who underwent US and FNA at two institutions (2013.1-2022.12 at AUMC and 2019.1-2022.12 at LPCH). The majority were performed for either abnormal physical exam or follow-up of an incidentaloma discovered on imaging performed for other indications. A minority were done for surveillance of specific predispositions (e.g., cancer predisposition syndrome). Pediatric thyroid experts (pediatricians, radiologists, endocrinologists, or surgeons) reviewed the patient’s clinical presentation and US images before recommending FNA. Of these patients, 22 were excluded due to the lack of a definitive diagnosis (Bethesda category I, III–V) on biopsy without surgical confirmation (n = 18) or suboptimal image quality (n = 4). Patients with definitive benign or malignant diagnoses based on surgical specimens or biopsies (Bethesda category II or VI) were included. 23 We finally analyzed 156 thyroid nodules (benign, n = 109; malignant, n = 47) from 128 consecutive patients (mean age 15.5 ± 2.4 years; 103 girls) (Fig. 1).

Flowchart showing the study participants. FNA, fine-needle biopsy; CNB, core-needle biopsy.
US examination and image analysis
All US examinations were performed using a 10–12 or 5–14 MHz linear probe. US images were obtained from four different manufacturers: Samsung (Seoul, South Korea), Philips (Amsterdam, The Netherlands), GE Healthcare (Chicago, IL, USA), and Siemens (Munich, Germany). All US examinations at AUMC were performed by radiologists. At LPCH, they were performed by ultrasound technicians under the supervision of radiologists. A staff radiologist (E.J.H.), who had 17 years of clinical experience in performing and evaluating thyroid US images, retrospectively reviewed all US images and assessed nodule features, including composition, echogenicity, margin, orientation, calcification, and spongiform appearance (presence or absence). The radiologist was blinded to the outcome. In total, 156 thyroid nodules were assessed according to the relevant adult guidelines (American College of Radiology [ACR]- and Korean [K]-TIRADS). 2,3
Evaluation of system performance using a pediatric dataset
We used the AI-Thyroid software, a DL model trained on an adult population, to identify malignant thyroid nodules. The software is freely available online (http://cdss.co.kr/). The AI algorithm was trained on 19,711 US images and validated on 11,185 US images from multicenter adult datasets (24 hospitals). 13 The algorithm is based on a VGG19-based convolutional neural network and uses binary cross-entropy as its loss function. It can process Digital Imaging and Communication in Medicine or JPEG images (both axial and longitudinal) with a manual annotation box. The algorithm outputs a gradient-weighted class activation mapping and an abnormality score (0–1.00) indicating the probability of malignancy. A predefined cut-off value of 0.472 was used to determine the presence of malignancy. 13
We tested AI-Thyroid on two pediatric cohorts across three scenarios, including (1) nodules with axial US images, (2) nodules with longitudinal US images, and (3) nodules with both axial and longitudinal US images. We used the higher value between the two images in scenario 3. We performed a subgroup analysis according to the two different pediatric cohorts (AUMC vs. LPCH) and pediatric age groups (≥14 years vs. <14 years).
Comparison of system performance with adult-based TIRADS classification systems
We compared AI-Thyroid performance with adult-based TIRADS classification systems. 2,3 We used biopsy criteria (TIRADS category and nodule size) to evaluate the performances of ACR- and K-TIRADS. Nodules that met biopsy criteria were classified as malignant; nodules that did not meet such criteria were considered benign. 13
Statistical analyses
Patient demographics and US grayscale features were compared using the chi-square test or Fisher’s exact test. The Kruskal–Wallis rank-sum test was used to compare quantitative variables. The frequency and malignancy risk of each nodule category were expressed as percentages. The diagnostic performances of the adult-based TIRADS and AI-Thyroid in detecting thyroid cancer were evaluated using sensitivities, specificities, positive predictive values (PPVs), and negative predictive values (NPVs), and the area under the receiver operating characteristic (AUROC) curves. AUROC curves were calculated with 95% confidence intervals (CIs). Diagnostic performances were compared using a proportion test with Yates’ continuity correction for proportions and the DeLong test within the R package “pROC” (version 4.1.2; R Foundation for Statistical Computing, Vienna, Austria). p-values <0.05 were considered statistically significant.
Results
Table 1 summarizes the clinical and radiological characteristics of patients with thyroid nodules. The overall prevalence of malignancy was 30.1% (47/156). Malignant nodules included papillary thyroid carcinomas (n = 40, including 3 follicular variants), follicular carcinomas (n = 3), medullary carcinomas (n = 3), and poorly differentiated thyroid carcinoma (n = 1). The prevalence of malignancy was significantly higher at AUMC (41.0%; 34/83) than at LPCH (17.8%; 13/73) (p = 0.003).
Clinicoradiological Characteristics of Thyroid Malignancies vs. Benign Nodules
The numbers in parentheses are percentages.
AUMC, Ajou University Medical Center; LPCH, Stanford Lucile Packard Children’s Hospital.
Patients with thyroid cancer (mean age 16.2 ± 1.7 [12–18] years) were significantly older than patients with benign nodules (mean age 15.2 ± 2.5 [6–18] years) (p = 0.009). Sex distribution did not significantly differ between groups (p = 1.000). The mean size of the nodules was 24.0 ± 15.1 mm; malignant nodules were significantly larger (30.7 ± 16.9 mm) than benign nodules (21.3 ± 13.3 mm; p < 0.001). US features revealed significant between-group differences in composition, echogenicity, orientation, margins, and calcifications (all p < 0.001). Malignant nodules were more likely to exhibit solid composition, hypoechogenicity, nonparallel orientation, spiculated margins, and microcalcifications.
Diagnostic performance of AI-Thyroid applied to the pediatric datasets
Table 2 and Supplementary Table S1 summarizes the diagnostic performances of AI-Thyroid according to different scenarios. AI-Thyroid demonstrated respective AUROC, sensitivity, specificity, and accuracy values of 0.913 (CI: 0.867–0.958), 78.7% (CI: 66.7–90.5), 85.3% (CI: 77.8–91.7), and 83.3% (CI: 76.9–89.1) in scenario 1; 0.929 (CI: 0.885–0.974), 78.7% (CI: 66.7–90.2), 91.7% (CI: 85.7–96.3), and 87.8% (CI: 82.1–92.3) in scenario 2; and 0.927 (CI: 0.891–0.954), 89.4% (CI: 79.5–97.6), 79.8% (CI: 72.1–86.9), and 82.7% (CI: 76.3–88.5) in scenario 3. No statistically significant differences in AUROC values of AI-Thyroid were observed across the image planes (all p > 0.05, Fig. 2).
Diagnostic Performance of AI-Thyroid in Pediatric Cohorts for Discriminating Malignant from Benign Thyroid Nodules
The numbers in parentheses are 95% confidence intervals.
AUROC, area under the receiver operating characteristic curve; NPV, negative predictive value; PPV, positive predictive value.

Diagnostic performance of AI-Thyroid applied to pediatric datasets across different image planes (A: Total cohort, B: AUMC, C: LPCH). AUMC, Ajou University Medical Center; AUROC, area under the receiver operating characteristic; CI, confidence interval; LPCH, Stanford Lucile Packard Children’s Hospital.
Table 2 and Supplementary Table S2 summarizes the diagnostic performances of AI-Thyroid according to different pediatric cohorts (AUMC vs. LPCH). When the overall diagnostic performance of AI-Thyroid applied to the two pediatric datasets, AI-Thyroid had similar performances in two cohorts, with respective AUROC, sensitivity, specificity, and accuracy values of 0.945–0.954, 84.6–92.3%, 76.7–90.0%, and 79.5–89.0% for the LPCH dataset; these values were 0.918–0.933, 73.5–88.2%, 85.7–93.9%, and 84.3–86.8% for the AUMC dataset. No statistically significant differences in AUROC values were observed between the cohorts (p = 0.804; Fig. 3).

Diagnostic performance of AI-Thyroid applied to two pediatric datasets. AUMC, Ajou University Medical Center; AUROC, area under the receiver operating characteristic; CI, confidence interval; LPCH, Stanford Lucile Packard Children’s Hospital.
Diagnostic performance of AI-Thyroid according to different age groups
Figure 4 visualizes the distribution of benign and malignant thyroid nodules according to age in the pediatric population. The prevalence of thyroid cancer showed significant differences even within the pediatric population, and it was higher in pediatric patients aged ≥14 years than in pediatric patients aged <14 years (36.1% vs. 8.8%, p = 0.002). Table 3 summarizes the diagnostic performances of AI-Thyroid according to different age groups. AI-Thyroid demonstrated respective AUROC, sensitivity, specificity, and accuracy values of 0.899–0.926, 77.3–88.1%, 75.8–92.3%, and 80.6–86.9% among pediatric patients aged ≥14 years, 1.000, 100.0%, 83.3–93.5%, and 85.2–94.1% among pediatric patients aged <14 years. The AUROC values were higher for patients aged <14 years compared to those aged ≥14 years (all p < 0.01). However, no statistically significant differences were observed between the different age groups in terms of sensitivity, specificity, PPV, NPV, and accuracy (all p > 0.05).

Diagnostic performance of AI-Thyroid across different age groups and distribution of benign and malignant thyroid nodules according to age. AUROC, area under the receiver operating characteristic; CI, confidence interval; NPV, negative predictive value; PPV, positive predictive value.
Diagnostic Performance of AI-Thyroid by Age Group for Discriminating Malignant from Benign Thyroid Nodules
The numbers in parentheses are 95% confidence intervals.
AUROC, area under the receiver operating characteristic curve; NPV, negative predictive value; PPV, positive predictive value.
Comparison of the diagnostic performance with the TIRADS-based classification
Table 4 summarizes the diagnostic performances of AI-Thyroid and radiologists using TIRADS-based classifications. The respective AUROC, sensitivity, specificity, and accuracy values of the radiologists were 0.684–0.831, 87.2%, 49.5–78.9%, and 60.9–81.4%. AI-Thyroid yielded the highest AUROC values, followed by ACR-TIRADS and K-TIRADS (p = 0.016 and p < 0.001, respectively). AI-Thyroid demonstrated comparable sensitivity and specificity to ACR-TIRADS (p = 0.748 and p = 0.991, respectively), while it showed comparable sensitivity (p = 0.755) but higher specificity (p < 0.001) to K-TIRADS.
Comparison of Diagnostic Performance Between AI-Thyroid and Radiologists Using TIRADS-Based Classifications
p values indicate the comparison results between AI-Thyroid and K-TIRADS.
p values indicate the comparison results between AI-Thyroid and ACR-TIRADS.
AUROC, area under the receiver operating characteristic curve; NPV, negative predictive value; PPV, positive predictive value.
Discussion
Our study demonstrated that AI-Thyroid performs well in diagnosing pediatric thyroid cancer. AI-Thyroid achieved AUROC values ranging from 0.913 to 0.929, with no significant differences observed across different image planes or between the two pediatric cohorts. Notably, AI-Thyroid yielded the highest AUROC values and demonstrated comparable or even higher sensitivity and specificity than radiologist interpretations using TIRADS. These findings suggest that the diagnostic performances of the adult-based AI-Thyroid model are acceptable in the pediatric population.
Thyroid nodules are becoming increasingly prevalent in pediatric US examinations. 5 –7 A recent population-based study conducted in the United States showed the rates for both smaller, early-stage tumors and larger, later-stage tumors increased significantly over time. 6 Although US and FNA cytology offer primary assessments of thyroid nodule malignancy risk, both carry a significant risk of indeterminate results and may fail to accurately exclude malignancy. A recent meta-analysis of a pediatric cytopathology series revealed that 5–28% of FNA specimens were nondiagnostic, whereas 3.3–38% were cytologically indeterminate. 24 Several recent pediatric case series have reported a malignancy rate of 51–64% among surgically managed thyroid nodule patients, highlighting the high rate of unnecessary surgery and the limitations of current preoperative diagnostics in terms of accurately identifying high-risk patients. 18,25,26 In addition, the lack of specific US-based risk stratification systems for pediatric populations complicates the assessment and management of these lesions. 8 –11,27 In this respect, we consider that AI-based systems may have a potential role to identify nodules at high risk of malignancy while reducing unnecessary biopsies by identifying nodules with acceptably low risk. This study evaluated the diagnostic performance of AI-Thyroid on two pediatric datasets. AI-Thyroid yielded AUROC values of 0.913–0.929 without significant differences across different image planes (axial vs. longitudinal planes) and between the two cohorts, despite variations in malignancy prevalence (41.0% vs. 17.8%) and geographic settings (South Korea vs. the United States). We anticipate that the AUMC and LPCH cohorts may have different baseline environmental factors, epigenetic/genetic backgrounds, and US protocols. Nonetheless, this study highlights that the model’s consistent performance across both settings is a significant strength. The results were also similar to those from studies involving adult populations, which reported AUROC values ranging from 0.922 to 0.938 13. These findings suggest that AI-Thyroid, validated in adults using large-scale multicenter data, may also be applicable to pediatric populations. 13
Even within pediatric populations, thyroid cancers can exhibit different characteristics depending on the age at diagnosis. 28,29 Harach et al. reported distinct characteristics between thyroid cancers in children under and over the age of ten, including variations in histological types, a male predominance, and a higher mortality rate in younger children compared to older ones. 28 Sugino et al. suggested that patients aged under 15 years present with distinct clinical manifestations, indicating that an age cutoff of <15 years might be more appropriate than <19 years for pediatric thyroid cancers. 29 In this study, we found a higher prevalence of malignant nodules among pediatric patients aged ≥14 years (36.1%) compared to those aged <14 years (8.8%). Despite the lower prevalence of malignant nodules among patients aged <14 years, no significant differences in sensitivity and specificity were observed between age groups. However, we consider that increasing the diagnostic cutoff value for patients aged <14 years may enhance diagnostic performance, given the lower prevalence of malignant nodules in this younger group. Future studies with larger, age-based pediatric datasets are needed to assess the clinical implications of AI-Thyroid for this population and to guide optimal management strategies for thyroid nodules, which may differ between young children and adolescents.
The lack of pediatric-specific risk stratification systems has led to the frequent use of adult-based TIRADS for assessing pediatric thyroid nodules. However, this practice is controversial due to insufficient evidence supporting these systems in children, particularly concerning biopsy indications. A recent study applying ACR-TIRADS to 404 thyroid nodules in 314 pediatric patients found that biopsy was not indicated for 22% of the 77 malignant thyroid nodules. 30 Another study evaluated five adult-based US risk stratification systems across 277 nodules in 221 children, revealing diagnostic performance that varied from 70% to 78% for sensitivity and from 42% to 78% for specificity. 31 Similarly, a meta-analysis indicated that ACR-TIRADS, the American Thyroid Association guidelines, and European-TIRADS demonstrated moderate diagnostic performance for pediatric thyroid nodules, while K-TIRADS’s diagnostic efficacy was lower than expected. 11 Consequently, there is ongoing confusion regarding the use of adult-based TIRADS for managing pediatric thyroid nodules. 32 Although the source of heterogeneity may mainly be the size threshold for suggesting biopsy, it remains unclear whether this is due to malignant nodules exhibiting different US morphology, carrying a higher risk on a per-nodule basis, a combination of both, or some other factor in the pediatric population. In our study, we found that the sensitivity and specificity of adult-based TIRADS were 87.2% and 49.5–78.9%, respectively. AI-Thyroid demonstrated comparable sensitivity and either comparable or higher specificity compared to adult-based TIRADS and achieved the highest AUROC values, followed by ACR-TIRADS and K-TIRADS. These findings suggest that AI-Thyroid could serve as a valuable decision-making aid by accurately identifying pediatric patients at high risk of malignancy, thus prioritizing FNA for those cases while reducing unnecessary biopsies for patients with benign nodules.
Few studies have explored the application of AI for predicting malignancy in pediatric thyroid nodules. Radebe et al. reported the first attempt at developing an interpretable machine learning-based clinical tool using 67 pediatric thyroid nodules. 33 Yang et al. developed a DL algorithm to differentiate between benign and malignant thyroid nodules using 139 pediatric thyroid nodules, reporting a sensitivity of 87.5% and a specificity of 36.1%. 22 Although these studies demonstrated the potential of AI applications in pediatric thyroid nodules, the models were trained on relatively small patient populations due to the rarity of pediatric thyroid cancer. In this study, we utilized AI-Thyroid, a DL model originally trained on an adult population, to identify malignant thyroid nodules in a pediatric population. 13 Given the limited sample size of pediatric thyroid cancer cases for developing new models, we assessed the applicability of an adult-based DL model for evaluating thyroid nodules in the pediatric populations. Our study demonstrated the high performance of AI-Thyroid in diagnosing pediatric thyroid cancer. Based on these results, we consider that this DL-based model is applicable for assessing thyroid nodules in children.
Our study had several limitations. First, the sample size was relatively small, particularly in the younger age group (<14 years). Nevertheless, this study represents one of the largest cohorts to date evaluating the diagnostic performance of an AI system in pediatric thyroid nodules. Further validation in larger cohorts is warranted to confirm our findings. Additionally, all patients underwent pathological confirmation, which may have introduced selection bias. Second, the data selection process, including the exclusion of low-quality images and the selection of the best representative axial and longitudinal images for the AI-Thyroid test, could have introduced bias. Third, we did not directly compare AI-Thyroid outcomes with physician-based outcomes. Additionally, comparing AI-Thyroid performance with adult-based TIRADS classification systems using biopsy criteria may not be entirely appropriate. Clinical diagnosis is a complex process, often informed by the real-time assessment of multiple images, rendering a direct comparison with AI data inappropriate. This algorithm did not incorporate clinical data (such as any pertinent family history), but a future state incorporating clinical factors may improve the robustness of the neural network and is a potential future direction. Similarly, this algorithm is not intended to produce a patient-facing summary or generate treatment recommendations, but those may also be future directions. Fourth, the assignment of US features has repeatedly been shown to exhibit considerable inter-observer variability. It would have been preferable if the images had been reviewed by at least two experts in consensus, possibly with a third expert as a tie-breaker in cases of disagreement. Fifth, the Bethesda classification categorizes FNA cytology into 6 different groups of varying malignant potential. Fusing the AI-Thyroid results with Bethesda III, IV, and V nodules could be a direction for future research.
In conclusion, we demonstrated the robust performance of AI-Thyroid in diagnosing thyroid cancer. Future investigations should focus on optimizing AI-Thyroid as a valuable adjunct to tissue sampling in clinical practice and drive the development of AI-based and/or conventional systems tailored to children.
Footnotes
Acknowledgment
A preliminary version of this study was presented in abstract form at the 2023 Annual Meeting of the American Thyroid Association.
Authors’ Contributions
E.J.H., K.W.Y., and K.D.M.: Acquisition of imaging data and data collection; E.J.H.: Article writing; J.H.L.: Statistical analysis and article review; K.W.Y., K.D.M., N.M., A.K.D., and E.T.: Article review. All authors read and approved the article.
Author Disclosure Statement
The authors of this article declare no relationships with any companies whose products or services may be related to the subject matter of the article.
Funding Information
This work was supported by the National Research Foundation of Korea (NRF) grant by the Korea government (MSIT) (#2021R1C1C100698711).
Supplementary Material
Supplementary Tables
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
