Diagnostic Performance of Four Ultrasound Risk Stratification Systems: A Systematic Review and Meta-Analysis

Abstract

Background:

Several ultrasound (US)-based risk stratification systems have been increasingly used for the optimal management of thyroid nodules. However, there are considerable discrepancies across these systems. This study aimed to summarize and compare the category-based diagnostic performance in the detection of thyroid cancer of different US-based risk stratification systems from four societies: the American College of Radiology-Thyroid Imaging Reporting and Data System (ACR-TIRADS), the American Thyroid Association (ATA), the Korean Thyroid Association/Korean Society of Thyroid Radiology (KTA/KSThR; K-TIRADS), and the European Thyroid Association (EU-TIRADS).

Methods:

MEDLINE/PubMed and EMBASE databases were searched to identify original articles investigating the category-based diagnostic performance according to at least one of the following guidelines: ACR-TIRADS, ATA, K-TIRADS, and EU-TIRADS. Pooled sensitivity and specificity were calculated using a bivariate random-effects model. A subgroup analysis on nodules of 1 cm or larger and a meta-regression analysis to identify factors associated with the diagnostic performance were performed.

Results:

A total of 29 articles including 33,748 thyroid nodules met the eligibility criteria and were included in the analysis. For ACR-TIRADS, the pooled sensitivity and specificity were, respectively, 66% and 91% for category 5 and 95% and 55% for category 4 or 5. For ATA, the pooled sensitivity and specificity were, respectively, 74% and 88% for category 5 and 91% and 64% for category 4 or 5. For K-TIRADS, the pooled sensitivity and specificity were, respectively, 55% and 95% for category 5 and 89% and 64% for category 4 or 5. For EU-TIRADS, the pooled sensitivity and specificity were, respectively, 82% and 90% for category 5 and 96% and 52% for category 4 or 5. Study location, proportion of female patients and malignant nodules, and study design were associated with study heterogeneity.

Conclusions:

The overall diagnostic performance of the four US-based risk stratification systems was comparable.

Introduction

As ultrasound (US) is a highly sensitive diagnostic modality for the characterization of thyroid nodules (1), neck US is recommended as the primary imaging workup test for the diagnosis of thyroid cancer (2 –4). This has resulted in increasing use of US examinations, fine-needle aspiration biopsy (FNAB), and core needle biopsy (CNB) and has consequently contributed to the increase in the recorded incidence of thyroid cancer (5,6). However, given that thyroid cancer frequently shows a less aggressive nature, not all lesions require invasive diagnostic and therapeutic procedures. In this context, several representative US-based risk stratification systems have been increasingly used to risk stratify thyroid nodules and minimize unnecessary biopsies (7 –10) and also to assess the requirement for ablative treatment of benign nodules (11 –13).

However, there are considerable discrepancies across these risk stratification systems with respect to the imaging features used for the risk categories, expected malignancy risk, diagnostic performance, and size cutoffs for biopsy. There have been many attempts to externally validate and compare the systems (14 –18), but comprehensive interpretation is still difficult because of heterogeneity in study designs and populations. Castellana et al. (19) conducted a meta-analysis on the selection of thyroid nodules for FNAB according to five well-known risk stratification systems; however, category-based diagnostic performance, subgroup analyses, and meta-regression analyses were not reported.

This study aimed to summarize and compare the diagnostic performance in the detection of thyroid cancer of the different US-based risk stratification systems from four societies: the American College of Radiology-Thyroid Imaging Reporting and Data System (ACR-TIRADS) (7), the American Thyroid Association (ATA) (8), the Korean Thyroid Association/Korean Society of Thyroid Radiology (KTA/KSThR; K-TIRADS) (9), and the European Thyroid Association (EU-TIRADS) (10). In addition, we performed subgroup and meta-regression analyses to identify any factors associated with the diagnostic performance.

Materials and Methods

This study was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (20).

Search strategy and eligibility criteria

A literature search of the MEDLINE/PubMed and EMBASE databases was conducted using pertinent MeSH or EMTREE terms with common keywords for relevant articles up to August 5, 2019. The search terms were as follows: ((thyroid)) AND ((thyroid imaging reporting and data system) OR (TIRADS) OR (TI-RADS) OR (guideline)) AND ((American Thyroid Association) OR (ATA) OR (American College of Radiology) OR (ACR) OR (Europe*) OR (EU-TIRADS) OR (Korea*) OR (K-TIRADS)). The search was limited to English-language publications but was not limited by human or animal studies, or publication date.

After eliminating duplicate publications, articles were screened according to their title and abstract. Full-text articles were then thoroughly assessed according to the following eligibility criteria: (a) population: patients who underwent US examinations for thyroid nodules; (b) index test: US-based risk stratification systems according to at least one of the following guidelines: ACR-TIRADS (7), ATA (8), K-TIRADS (9), and EU-TIRADS (10); (c) reference standard: pathological diagnosis or imaging follow-up; (d) outcomes: sensitivity and specificity of the US-based risk stratification systems for diagnosing malignant thyroid nodules; and (e) study design: not limited. Studies were excluded if any of the following criteria were met: (a) review articles; (b) case reports or case series including fewer than 10 patients; (c) conference abstracts; (d) letters, editorials, and commentaries; (e) animal studies; (f) studies with a partially overlapping patient cohort (for studies with an overlapping study population, the publication with the largest population was selected); (g) studies conducted with a pediatric population; (h) studies using a cytopathology reporting system other than the Bethesda classification system (21); or (i) studies not providing sufficient details to extract 2 × 2 tables.

The literature search and application of the criteria were conducted independently by two authors (P.H.K. and C.H.S.; with 3 and 8 years of experience in performing thyroid US and interventional procedures, respectively), and any discrepancies were resolved through discussion and consensus with a third author (J.H.B.; 21 years of experience in performing thyroid US and interventional procedures).

Data extraction and quality assessment

A standardized database form was used to obtain the following information from the selected studies: (a) study characteristics: institution, study period, study design (prospective vs. retrospective; single-center vs. multicenter), consecutive or nonconsecutive enrollment, reference standard, and blinding to the reference standard; (b) demographic and clinical characteristics: total number of patients, total number of nodules and malignant nodules, mean age (range), and sex; (c) characteristics associated with the US examinations: vendor, model, transducer frequency, and number and length of experience of participating readers; and (d) diagnostic performance of US-based FNAB criteria for diagnosing malignant thyroid nodules in the form of a 2 × 2 table. The quality assessment of selected studies was investigated using the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) (22).

Data synthesis and analysis

Two-by-two tables were constructed for each study, choosing the results with the highest performance if the diagnostic performance was separately evaluated for different radiologists. The criteria for the positive test results were set to be (a) category 5 or (b) category 4 or 5 of each risk stratification system. For example, if we set category 5 as a cutoff value, true-positive nodules indicated the nodules classified as category 5 on US and turned out to be malignant, false-positive nodules indicated the nodules classified as category 5 and turned out to be benign, false-negative nodules indicated the nodules classified as category 1 to 4 and turned out to be malignant, and true-negative nodules indicated the nodules classified as category 1 to 4 and turned out to be benign. Similarly, if we set category 4 or 5 as a cutoff value, true-positive nodules indicated the nodules classified as category 4 or 5 on US and turned out to be malignant. The diagnostic performance with the category 3, 4, or 5 was additionally evaluated, as provided in the Supplementary Data. We followed the reference standard set in each study. Since the ATA system (categorizing sonographic pattern as benign, very low, low, intermediate, and high suspicion) is not called TIRADS, we treated benign to high suspicion pattern in ATA as category 1 to 5, respectively.

The pooled sensitivity and specificity and their 95% confidence intervals [CIs] were calculated using a bivariate random-effects model, and a coupled forest plot was constructed (23 –27). In addition, a hierarchical summary receiver operating characteristics (HSROC) curve with 95% confidence and prediction regions was plotted. Heterogeneity was assessed using the Higgins inconsistency index (I ²) test with a value >50% indicating the presence of heterogeneity, and a coupled forest plot was used to graphically assess the presence of a threshold effect (a positive correlation between sensitivity and false-positive rate among the selected studies) (23). Deeks' funnel plot was constructed to test for publication bias, with statistical significance being assessed using Deeks' asymmetry test. In addition, indirect (comparison using the pooled estimates from separate studies) and direct comparisons (meta-analytic pooling of head-to-head comparison studies) of sensitivity and specificity between the guidelines were performed when possible. In addition, we investigated unnecessary biopsy rates (defined as the number of benign nodules among FNAB-requiring nodules) for each system.

A subgroup analysis of the studies with respect to nodule size of 1 cm or larger and a meta-regression were performed to explore the sources of heterogeneity. The following variables were considered for the bivariate meta-regression model: study design (prospective vs. retrospective), single-center vs. multicenter, study location (East Asia vs. other countries), proportion of female patients (cutoff at 79%; mean value of the proportions reported by the included studies), proportion of malignant nodules (cutoff at 28%; mean value of the proportions reported by the included studies), and inclusion of follow-up as a reference standard.

Statistical analyses were conducted by one of the authors (C.H.S., with six years of experience in performing systematic reviews and meta-analyses) using the “metandi” and “midas” modules in Stata 15.0 (StataCorp, College Station, TX), the “mada” package in R version 3.4.1, and MedCalc version 18.11 (MedCalc Software, Ostend, Belgium). A value of p < 0.05 was taken to indicate statistical significance.

Results

Literature search

A flowchart summarizing the publication selection process is presented in Figure 1. A total of 411 nonduplicate studies were identified. Of these, 307 articles were excluded on the basis of their titles and abstracts because of the following reasons: (a) not in the field of interest (n = 232); or they were (b) guidelines (n = 63); (c) reviews (n = 8); (d) case reports (n = 2); (e) erratum (n = 1); or (f) an animal study (n = 1). Subsequently, 104 potentially eligible full-text articles were assessed according to the eligibility criteria, and a further 75 studies were excluded because of the following reasons: (a) articles included nonconsecutive nodules (n = 29); (b) articles did not use any of the four risk stratification systems of interest (ACR-TIRADS, ATA, K-TIRADS, or EU-TIRADS; n = 11); (c) articles used data included in subsequent articles (n = 10); (d) articles not in the field of interest (n = 9); (e) articles included inseparable adult and pediatric patients (n = 6); (f) articles not using each guideline's category as a standard for positive test results (n = 4); (g) articles with insufficient details to derive a 2 × 2 table (n = 4); (h) articles using a cytopathologic reporting system other than the Bethesda system (n = 1); and (i) articles not including histopathology as a reference standard (n = 1). Consequently, a total of 29 articles including 33,748 thyroid nodules met the eligibility criteria and were included in the analysis (14 –18,28 –51).

FIG. 1.

Flowchart of the publication selection process.

Characteristics of the studies included

The detailed study characteristics are summarized in Table 1. Seven of the 29 studies were a prospective study design (28,29,35,36,39,42,44), and 6 were multicenter studies (15,16,29,33,40,45). The number of included patients ranged from 52 to 3190, and the mean patient age ranged from 43 to 59 years. The proportion of female patients in each study ranged from 61.2% to 94.9%. The proportion of malignant nodules in each study ranged from 3.9% to 66.1%. The diagnostic performances of ACR-TIRADS, ATA, K-TIRADS, and EU-TIRADS were reported in 21 (72.4%) (14 –18,29,30,32 –38,40,42,43,46,47,49,51), 13 (44.8%) (14,15,17,20,22,23,30,36,38,39,41,47,48), 8 (27.6%) (14 –16,28,33,34,49,50), and 5 studies (17.2%) (16,17,44,45,49), respectively, and 30,280, 15,504, 12,659, and 7549 nodules were analyzed for evaluating the diagnostic performances of ACR-TIRADS, ATA, K-TIRADS, and EU-TIRADS, respectively. In 10 studies, follow-up imaging was considered as the reference standard in parallel with pathology (14 –16,28,30,33,34,41,42,47), while in the other 19 studies, only pathology from FNAB, CNB, or surgery was considered as the reference standard (17,18,29,31,32,35 –40,43 –46,48 –51), with 7 of these 20 studies considering only postsurgical histopathology as the reference standard (17,18,31,32,39,44,45).

Table 1.

Characteristics of Studies Included

Author (year of publication)	Country	Study period (month year)	No. of patients	Mean age (range), years	M:F	No. of nodules (malignancy %)	Study design	Minimum nodule size for inclusion, mm	Risk stratification system				Reference standard
Author (year of publication)	Country	Study period (month year)	No. of patients	Mean age (range), years	M:F	No. of nodules (malignancy %)	Study design	Minimum nodule size for inclusion, mm	ACR	ATA	K	EU	Surgery	Biopsy	Follow-up^a
Ahmadi et al. (2019)	USA	January 2016–January 2017	213	55 (NA)	58:161	323 (27.2)	Single-center, retrospective	5	Yes	Yes	No	No	Yes	No	No
Bae et al. (2018)	South Korea	November 2015–February 2016	190	NA	NA	201 (46.3)	Single-center, prospective	10	No	No	Yes	No	Yes	Yes	Yes
Basha et al. (2019)	Egypt	May 2017–December 2018	380	45.3 (18–71)	66:314	948 (14.3)	Multicenter, prospective	10	Yes	No	No	No	No	Yes	No
Chen et al. (2019)	China	March 2016–August 2017	195	56 (19–78)	48:155	203 (21.2)	Single-center, retrospective	10	Yes	No	No	No	Yes	Yes	Yes
Chng et al. (2018)	Singapore	January 2010–June 2015	150	NA	21:129	160 (31.3)	Single-center, retrospective	10	No	Yes	No	No	Yes	No	No
Gao et al. (2019)	China	January 2015–December 2015	1758	NA	NA	2544 (66.1)	Single-center, retrospective	All	Yes	Yes	No	No	Yes	No	No
Ha et al. (2018)	South Korea	January 2010–May 2011	1802	51.2 (13–79)	415:1398	2000 (22.7)	Multicenter, retrospective	10	Yes	Yes	Yes	No	Yes	Yes^b	Yes
Ha et al. (2018)	South Korea	June 2013–May 2015	750	NA (9–81)	156:594	902 (29.5)	Multicenter, retrospective	5	Yes	Yes	Yes	No	Yes	Yes^b	Yes
Ha et al. (2017)	South Korea	January 2013–December 2014	954	50.8 (13–86)	NA	1112 (37.2)	Single-center, retrospective	5	Yes	Yes	Yes	No	Yes	Yes^b	Yes
Ha et al. (2019)	South Korea	January 2013–December 2013	3190	43.5 (14–94)	673:2517	3323 (25.8)	Single-center, retrospective	All	Yes	Yes	Yes	No	Yes	Yes^b	Yes
Jabar et al. (2019)	India	December 2017–August 2018	127	NA	17:110	127 (18.1)	Single-center, prospective	NA	Yes	No	No	No	Yes	Yes	No
Jin et al. (2019)	China	July 2017–December 2018	316	NA	95:221	332 (30.1)	Single-center, prospective	10	Yes	No	No	No	No	Yes	No
Koseoglu Atilla et al. (2018)	Turkey	2010–2014	2614	NA	351:2263	2614 (3.9)	Single-center, retrospective	5	Yes	No	No	No	Yes	Yes	No
Li et al. (2019)	China	December 2016–March 2018	128	47.8 (17–68)	27:101	130 (56.2)	Single-center, retrospective	5	Yes	No	No	No	Yes	Yes	No
Macedo et al. (2018)	Brazil	July 2014–August 2015	178	59 (49–66)	9:169	45 (22.2)	Single-center, prospective	15	No	Yes	No	No	Yes	No	No
Middleton et al. (2017)	USA	August 2006–May 2010	NA	NA	NA	3422 (10.3)	Multicenter, retrospective	NA	Yes	No	No	No	Yes	Yes	No
Pang et al. (2019)	Canada	2011–2018	152	55.6 (22–88)	34:118	189 (16.4)	Single-center, retrospective	NA	No	Yes	No	No	No	Yes	No
Rosario et al. (2018)	Brazil	NA	1106	NA	NA	1490 (14.0)	Single-center, prospective	10	Yes	No	No	No	Yes	Yes	Yes
Ruan et al. (2019)	China	May 2016–December 2017	918	45.7 (14–78)	356:562	1001 (39.2)	Single-center, retrospective	5	Yes	Yes	No	No	Yes	No	No
Shen et al. (2019)	China	January 2012–December 2017	1568	NA	412:1156	1612 (48.0)	Single-center, retrospective	5	Yes	Yes	No	Yes	Yes	No	No
Skowrońska et al. (2018)	Poland	June 2016– January 2018	52	55	8:44	140 (5.7)	Single-center, prospective	NA	No	No	No	Yes	Yes	No	No
Trimboli et al. (2019)	Switzerland, France, UK	January 2013–December 2017	495	NA	114:381	1058 (24.3)	Multicenter, retrospective	5	No	No	No	Yes	Yes	No	No
Wildman-Tobriner et al. (2019)	USA	August 2006–May 2010	1264	52.9 (18–93)	238:1026	1425 (10.6)	Single-center, retrospective	NA	Yes	No	No	No	Yes	Yes	No
Wu et al. (2019)	China	April 2016–March 2017	894	NA	NA	1000 (53.0)	Single-center, retrospective	All	Yes	Yes	No	No	Yes	Yes^b	Yes
Xu et al. (2019)	China	January 2014–October 2017	2031	47.68	415:1616	2465 (40.8)	Multicenter, retrospective	All	Yes	No	Yes	Yes	Yes	Yes	Yes
Yoon et al. (2016)	South Korea	November 2013–July 2014	1241	50.8	205:1036	1293 (18.1)	Single-center, retrospective	10	No	Yes	No	No	Yes	Yes	No
Yoon et al. (2019)	South Korea	January 2011–December 2016	1836	55.1 (9–92)	342:1494	2274 (13.2)	Single-center, retrospective	10	Yes	No	Yes	Yes	Yes	Yes	No
Zhao et al. (2019)	China	May 2014–August 2017	308	43.2 (20–66)	86:222	382 (33.2)	Single-center, retrospective	All	No	No	Yes	No	Yes	Yes	No
Zheng et al. (2018)	China	January 2015–December 2016	1013	45.3 (15–81)	308:725	1033 (29.8)	Single-center, retrospective	5	Yes	No	No	No	Yes	Yes	No

In all studies using follow-up as a reference standard, thyroid nodules with initial benign results of biopsy and decreased or stable size at follow-up US more than 12 months later were considered as benign.

FNAB and CNB were used to obtain the specimen. In other included studies, only FNAB was used.

ACR, 2017 American College of Radiology; ATA, 2015 American Thyroid Association Management Guidelines for Adult Patients with Thyroid Nodules and Differentiated Thyroid Cancer; CNB, core needle biopsy; EU, 2017 European Thyroid Association; FNAB, fine-needle aspiration biopsy; K, 2016 Korean Thyroid Association/Korean Society of Thyroid Radiology; US, ultrasound.

Quality assessment

The results of the quality assessment based on the QUADAS-2 criteria are shown in Supplementary Figure S1. Two (17,29) of the 29 studies had a high risk, and 9 studies (18,31,32,34,35,40,45,48,51) had an unclear risk of bias in patient selection because of nonconsecutive enrollment. One study (31) had a high risk, and 12 studies had an unclear risk (28,29,33 –36,42,44,48 –51) of bias in the index test domain because of no or unclear blinding to the reference standard during the US examinations. One study (29) had a high risk, and 28 studies (14 –18,28,30 –51) had an unclear risk of bias in the reference standard domain because of no or unclear blinding to the index test during pathologic evaluation. Additionally, five studies (15,31,33,34,49) had a high risk, and one study (35) had an unclear risk of bias in the flow and timing domain because of inconsistency or unclear consistency on the reference standard for diagnosing benign nodules across the study population. Six studies (16,35,37,39,48,49) had a high concern, and three studies (18,40,42) had an unclear concern on the applicability of the index test because of single or unreported numbers of readers for the US images. One study (35) had an unclear concern on the applicability of the reference standard because of no information on how the tissue specimens were examined. There were no concerns on the applicability of patient selection.

Diagnostic performance of different US risk stratification systems

The pooled diagnostic performances of each risk stratification system for diagnosing malignant nodules are summarized in Table 2, Supplementary Figure S2 (category 5 as positive), and Supplementary Figure S3 (category 4 or 5 as positive). For ACR-TIRADS, the pooled sensitivity and specificity were, respectively, 66% [CI 56–75%] and 91% [CI 87–94%] for category 5 and 95% [CI 92–97%] and 55% [CI 45–64%] for category 4 or 5. For ATA, the pooled sensitivity and specificity were, respectively, 74% [CI 62–84%] and 88% [CI 82–93%] for category 5 and 91% [CI 84–95%] and 64% [CI 54–74%] for category 4 or 5. For K-TIRADS, the pooled sensitivity and specificity were, respectively, 55% [CI 38–70%] and 95% [CI 90–98%] for category 5 and 89% [CI 83–93%] and 64% [CI 60–69%] for category 4 or 5. For EU-TIRADS, the pooled sensitivity and specificity were, respectively, 82% [CI 71–89%] and 90% [CI 77–96%] for category 5 and 96% [CI 92–98%] and 52% [CI 37–66%] for category 4 or 5. When considering category 3, 4, or 5 as positive test results, the pooled sensitivity reached almost 100% and the pooled specificity decreased to 3–23% for each system (Supplementary Table S1). HSROC curves are presented in Supplementary Figure S4 (category 5 as positive) and Supplementary Figure S5 (category 4 or 5 as positive).

Table 2.

Pooled Sensitivity and Specificity for Malignant Thyroid Nodules for Each Risk Stratification System

Guideline	Category 5 as positive			Category 4 or 5 as positive
Guideline	Sensitivity, % [CI]	Specificity, % [CI]	Area under the HSROC curve [CI]	Sensitivity, % [CI]	Specificity, % [CI]	Area under the HSROC curve [CI]
ACR	66 [56–75]	91 [87–94]	0.89 [0.86–0.92]	95 [92–97]	55 [45–64]	0.88 [0.84–0.90]
ATA	74 [62–84]	88 [82–93]	0.90 [0.87–0.92]	91 [84–95]	64 [54–74]	0.85 [0.82–0.88]
K	55 [38–70]	95 [90–98]	0.88 [0.85–0.91]	89 [83–93]	64 [60–69]	0.78 [0.74–0.82]
EU	82 [71–89]	90 [77–96]	0.91 [0.89–0.93]	96 [92–98]	52 [37–66]	0.90 [0.87–0.92]

Significant heterogeneity (I ² > 50%) was noted for all meta-analytic calculations described in this table.

Neither indirect (comparison using pooled estimates from separate studies) nor direct comparisons (meta-analytic pooling of head-to-head comparison studies) identified any statistical differences in pooled diagnostic performance between the four guidelines.

CI, 95% confidence interval; HSROC, hierarchical summary receiver operating characteristics.

Deeks' funnel plot and asymmetry test did not show a significant probability of publication bias, except for the diagnostic performance of K-TIRADS for category 5 (p < 0.01). Indirect comparisons did not identify any statistical differences in the pooled diagnostic performance between any of the guidelines. Direct comparisons between ACR-TIRADS and ATA were available in nine studies for category 5 (14,15,17,18,32 –34,43,47) and eight studies for category 4 or 5 (14,15,17,18,32 –34,43,47), but these comparisons did not identify any statistical differences between the guidelines. Additionally, we investigated unnecessary biopsy rates across the systems. Among the included studies, unnecessary biopsy rates were available in eight studies for ACR-TIRADS (14 –16,33,35,43,47,49), five for ATA (14,15,33,43,47), five for K-TIRADS (14 –16,33,49), and two for EU-TIRADS (16,49). Indeed, the reported unnecessary biopsy rates ranged 17–40% (median, 25.5%) in ACR-TIRADS, 35–61% (median, 52%) in ATA, 32–66% (median, 59%) in K-TIRADS, and 25–53% (median, 39%) in EU-TIRADS.

Subgroup analysis

A subgroup analysis was performed on nodules of 1 cm or larger, and the results are summarized in Table 3. For ACR-TIRADS, the pooled sensitivity and specificity were, respectively, 66% [CI 52–77%] and 93% [CI 88–96%] for category 5 and 95% [CI 88–98%] and 60% [CI 41–77%] for category 4 or 5. For ATA, the pooled sensitivity and specificity were, respectively, 76% [CI 52–90%] and 89% [CI 74–95%] for category 5 and 87% [CI 76–93%] and 64% [CI 48–77%] for category 4 or 5. In K-TIRADS, the pooled sensitivity and specificity were, respectively, 41% [CI 24–60%] and 98% [CI 94–99%] for category 5 and 84% [CI 80–88%] and 72% [CI 67–76%] for category 4 or 5.

Table 3.

Pooled Sensitivity and Specificity for Malignant Thyroid Nodules ≥1 cm for Each Guideline

Guideline	Category 5 as positive			Category 4 or 5 as positive
Guideline	Sensitivity, % [CI]	Specificity, % [CI]	Area under the HSROC curve [CI]	Sensitivity, % [CI]	Specificity, % [CI]	Area under the HSROC curve [CI]
ACR	66 [52–77]	93 [88–96]^a	0.90 [0.87–0.92]	95 [88–98]^b	60 [41–77]	0.91 [0.88–0.93]
ATA	76 [52–90]	89 [74–95]	0.90 [0.87–0.93]	87 [76–93]	64 [48–77]	0.84 [0.80–0.87]
K	41 [24–60]	98 [94–99]^a	0.87 [0.83–0.89]	84 [80–88]^b	72 [67–76]	0.85 [0.81–0.88]

A meta-analysis for EU-TIRADS was not possible because only two studies reported sensitivity and specificity using EU-TIRADS.

Significant heterogeneity (I ² > 50%) was noted for all meta-analytic calculations described in this table.

When ACR-TIRADS and K-TIRADS were indirectly compared with category 5 as a positive test result, the specificity of K-TIRADS was higher with borderline significance (98% [CI 94–99%] vs. 93% [CI 88–96%]; p = 0.05).

When ACR-TIRADS and K-TIRADS were indirectly compared with category 4 or 5 as a positive test result, the sensitivity of ACR was higher with borderline significance (95% [CI 88–98%] vs. 84% [CI 80–88%]; p = 0.05). Otherwise, indirect comparisons did not identify any statistical differences in pooled diagnostic performance between the four guidelines. A meta-analysis on the direct comparisons was not possible.

TIRADS, Thyroid Imaging Reporting and Data System.

When K-TIRADS and ACR-TIRADS were indirectly compared (comparison between the pooled estimates from separate studies), the specificity of K-TIRADS for category 5 was higher than that of ACR-TIRADS, with a statistical trend (98% [CI 94–99%] vs. 93% [CI 88–96%]; p = 0.05). Conversely, the sensitivity of K-TIRADS for category 4 or 5 was lower than that of ACR-TIRADS, again with a statistical trend (84% [CI 80–88%] vs. 95% [CI 88–98%]; p = 0.05). Otherwise, the indirect comparisons did not identify any statistically significant differences in the pooled diagnostic performance between the four systems. It was not possible to perform a meta-analysis on the direct comparisons.

Meta-regression

The results of the meta-regression analyses are summarized in Supplementary Tables S2, S2, S3, S4, S5. For ACR-TIRADS with both category 5 and category 4 or 5 as positive and ATA with category 5 as positive, the study location (East Asia vs. other countries; p = 0.01 for ACR-TIRADS for both category 5 and category 4 or 5; p = 0.04 for ATA category 5), the proportion of female patients (cutoff set to 79%; p < 0.01 for ACR-TIRADS for both category 5 and category 4 or 5; p = 0.01 in ATA for category 5), and the proportion of malignant nodules (cutoff set to 28%; p < 0.01 for ACR-TIRADS for both category 5 and category 4 or 5; p < 0.01 for ATA category 5) were significantly associated with study heterogeneity. For ATA with category 4 or 5 as positive, only the proportion of female patients was associated with study heterogeneity (p < 0.01). For K-TIRADS (both category 5 and category 4 or 5 as positive), the study design (prospective vs. retrospective; p < 0.01 for category 5; p = 0.02 for category 4 or 5) and the proportion of female patients (p < 0.01 for both category 5 and category 4 or 5) were associated with study heterogeneity. For EU-TIRADS (for both category 5 and category 4 or 5 as positive), the study location (p < 0.01 for both category 5 and category 4 or 5) and the proportion of malignant nodules (p = 0.04 for category 5; p < 0.01 for category 4 or 5) were significantly associated with study heterogeneity.

Discussion

The present meta-analysis investigated the diagnostic performance of four risk stratification systems using 29 studies including 33,748 thyroid nodules. Diagnostic performance was evaluated according to the reference standards of category 5 and 4 or 5. The current meta-analysis demonstrated that diagnostic performance was comparable between the four risk stratification systems for both category 5 and 4 or 5. In the subgroup analysis of nodules of 1 cm or larger, ACR-TIRADS showed lower specificity for category 5 (93% vs. 98%; p = 0.05), but higher sensitivity for category 4 or 5 (95% vs. 84%; p = 0.05) compared with K-TIRADS. In the meta-regression, the study location (East Asia vs. other countries), the proportion of female patients, and the proportion of malignant nodules were common sources of study heterogeneity. To the best of our knowledge, this is the first systematic review and meta-analysis to include the four representative US-based risk stratification systems (ACR-TIRADS, ATA, K-TIRADS, and EU-TIRADS) and perform subgroup and meta-regression analyses to evaluate the diagnostic performance of each system in the variable clinical setting. We believe that our results can not only guide clinical practice and future research but also provide information that is important when developing multinational guidelines.

This meta-analysis showed that the category-based diagnostic accuracies of the four guidelines were closely comparable. In addition, this meta-analysis showed that the sensitivity and specificity of each system varied depending on what category was set to be positive results (category 5 vs. 4 or 5), and the population in which the examinations were performed. Thus, considering these results, US practitioners may flexibly adapt each risk stratification system to the clinical setting to which they practice, considering the proportion of malignant thyroid nodules, the proportion of female patients, and geographic characteristics. Specifically, if the clinical setting is prone to increase sensitivity but decrease specificity (e.g., lower proportion of female and higher proportion of malignant nodules when using ACR-TIRADS), clinicians can select noninvasive strategy such as active surveillance for the nodules with similar category, and one can select FNAB rather than active surveillance in the opposite situation. For the proportion of malignant nodules, although in theory sensitivity and specificity should not change according to disease prevalence, in real settings, they often do vary with disease prevalence (52). For a future international TIRADS, a balanced worldwide collection of data with high and low cancer proportions will be necessary to create a clinically applicable system for both primary care and referral center settings.

Recently, the emphasis has been on the unnecessary biopsy rate in the thyroid, rather than diagnostic performance. As it is well known that most thyroid cancers have a less aggressive natural history, it is important to minimize not only false-negative rates resulting in delayed diagnosis but also unnecessary biopsy rates resulting in increased health care burden and patient anxiety, and unnecessary interventions. In this context, the concept of “active surveillance” for low-risk thyroid cancer has been increasingly recognized in the medical field (53), and one recent meta-analysis reported the pooled proportion of tumor growth (increase in maximum diameter by ≥3 mm) to be only 4.4% in low-risk papillary thyroid cancer (T1a/b, N0, M0) (54). Therefore, US practitioners, especially those who belong to the primary care setting, need to understand this concept well. The reported unnecessary biopsy rates tended to be lower in ACR-TIRADS, but great attention should be given to the interpretation since inclusion criteria for minimum nodule size varied across the studies. In addition, national/institutional policy for the biopsy might act as a confounder.

This meta-analysis has several limitations. First, the majority of the included studies (75.9%; references 22 of 29) were retrospective, implying a risk of categorization error derived from insufficient and unstandardized image acquisition during the examination. Second, all the included studies were performed at a referral center, limiting the application of our results to the primary care setting. Further studies conducted in a primary care setting are required. Third, although category-based comparisons of diagnostic performance are intuitive in their interpretation, they are inherently limited because the malignancy risks of the categories suggested in the guidelines vary. Fourth, substantial between-study heterogeneity was consistently observed in the meta-analytic calculations; we performed subgroup and meta-regression analyses, but there was still an unresolved limitation because individual patient data were not available, and the number of covariates and items to calculate outweighed the sample size. Finally, the meta-analysis for nodule size, which is an important factor when triaging nodules for biopsy, was impossible since most of the included studies emphasized only US features and not nodule size. This likely accounted for differences between the pooled data presented in this meta-analysis and the results from well-conducted studies in which nodule size was considered. Further investigation is necessary to address this issue.

In conclusion, the overall diagnostic performance of the four risk stratification systems of the representative society guidelines was comparable.

Footnotes

Author Disclosure Statement

No competing financial interests exist.

Funding Information

No funding was received for this article.

Supplementary Material

Supplementary Figure S1

Supplementary Figure S2

Supplementary Figure S3

Supplementary Figure S4

Supplementary Figure S5

Supplementary Table S1

Supplementary Table S2

Supplementary Table S3

Supplementary Table S4

Supplementary Table S5

References

, Lim

, Yoon

, Baek

, Do

, Choi

, Lee

, Na DG; Korean Society of Radiology and National Evidence-Based Healthcare Collaborating

Agency

. 2018. Primary imaging test and appropriate biopsy methods for thyroid nodules: guidelines by Korean Society of Radiology and National Evidence-Based Healthcare Collaborating Agency. Korean J Radiol, 19:623–631.

Moon

, Baek

, Jung

, Kim

, Kwak

, Lee

, Na

, Park

, Park SW; Korean Society of Thyroid Radiology (KSThR); Korean Society of

Radiology

. 2011. Ultrasonography and the ultrasound-based management of thyroid nodules: consensus statement and recommendations. Korean J Radiol, 12:1–14.

Gharib

, Papini

, Paschke

, Duick

, Valcavi

, Hegedus

, Vitti P; AACE/AME/ ETA Task Force on Thyroid

Nodules

. 2010. American Association of Clinical Endocrinologists, Associazione Medici Endocrinologi, and European Thyroid Association Medical guidelines for clinical practice for the diagnosis and management of thyroid nodules. Endocr Pract, 16(Suppl 1):1–43.

Perros

, Boelaert

, Colley

, Evans

, Gerrard Ba

, Gilbert

, Harrison

, Johnson

, Giles

, Moss

, Lewington

, Newbold

, Taylor

, Thakker

, Watkinson

, Williams

, British Thyroid

Association

. 2014. Guidelines for the management of thyroid cancer. Clin Endocrinol (Oxf), 81(Suppl 1):1–122.

Vaccarella

, Dal Maso

, Laversanne

, Bray

, Plummer

, Franceschi

. 2015. The impact of diagnostic changes on the rise in thyroid cancer incidence: a population-based study in selected high-resource countries. Thyroid, 25:1127–1136.

Trimboli

, Giovanella

. 2018. Reliability of core needle biopsy as a second-line procedure in thyroid nodules with an indeterminate fine-needle aspiration report: a systematic review and meta-analysis. Ultrasonography, 37:121–128.

Tessler

, Middleton

, Grant

, Hoang

, Berland

, Teefey

, Cronan

, Beland

, Desser

, Frates

, Hammers

, Hamper

, Langer

, Reading

, Scoutt

, Stavros

. 2017. ACR Thyroid Imaging, Reporting and Data System (TI-RADS): white paper of the ACR TI-RADS Committee. J Am Coll Radiol, 14:587–595.

Haugen

, Alexander

, Bible

, Doherty

, Mandel

, Nikiforov

, Pacini

, Randolph

, Sawka

, Schlumberger

, Schuff

, Sherman

, Sosa

, Steward

, Tuttle

, Wartofsky

. 2016. 2015 American Thyroid Association Management Guidelines for Adult Patients with thyroid nodules and differentiated thyroid cancer: the American Thyroid Association Guidelines Task Force on thyroid nodules and differentiated thyroid cancer. Thyroid, 26:1–133.

Shin

, Baek

, Chung

, Ha

, Kim

, Lee

, Lim

, Moon

, Na

, Park

, Choi

, Hahn

, Jeon

, Jung

, Kim

, Kwak

, Lee

, Park

, Sung JY; Korean Society of Thyroid Radiology (KSThR) and Korean Society of

Radiology

. 2016. Ultrasonography diagnosis and imaging-based management of thyroid nodules: revised Korean Society of Thyroid Radiology Consensus Statement and Recommendations. Korean J Radiol, 17:370–395.

10.

Russ

, Bonnema

, Erdogan

, Durante

, Ngu

, Leenhardt

. 2017. European Thyroid Association guidelines for ultrasound malignancy risk stratification of thyroid nodules in adults: the EU-TIRADS. Eur Thyroid J, 6:225–237.

11.

Garberoglio

, Aliberti

, Appetecchia

, Attard

, Boccuzzi

, Boraso

, Borretta

, Caruso

, Deandrea

, Freddi

, Gallone

, Gandini

, Gasparri

, Gazzera

, Ghigo

, Grosso

, Limone

, Maccario

, Mansi

, Mormile

, Nasi

, Orlandi

, Pacchioni

, Pacella

, Palestini

, Papini

, Pelizzo

, Piotto

, Rago

, Riganti

, Rosato

, Rossetto

, Scarmozzino

, Spiezia

, Testori

, Valcavi

, Veltri

, Vitti

, Zingrillo

. 2015. Radiofrequency ablation for thyroid nodules: which indications? The first Italian opinion statement. J Ultrasound, 18:423–430.

12.

Kim

, Baek

, Lim

, Ahn

, Baek

, Choi

, Chung

, Ha

, Hahn

, Jung

, Kim

, Lee

, Park

, Shin

, Suh

, Sung

, Sim

, Youn

, Choi

, Na DG; Guideline Committee for the Korean Society of Thyroid Radiology (KSThR) and Korean Society of

Radiology

. 2018. 2017 Thyroid radiofrequency ablation guideline: Korean Society of Thyroid Radiology. Korean J Radiol, 19:632–655.

13.

Mauri

, Pacella

, Papini

, Solbiati

, Goldberg

, Ahmed

, Sconfienza

. 2019. Image-guided thyroid ablation: proposal for standardization of terminology and reporting criteria. Thyroid, 29:611–618.

14.

, Baek

, Na

, Suh

, Chung

, Choi

, Lee

. 2019. Diagnostic performance of practice guidelines for thyroid nodules: thyroid nodule size versus biopsy rates. Radiology, 291:92–99.

15.

, Na

, Moon

, Lee

, Choi

. 2018. Diagnostic performance of ultrasound-based risk-stratification systems for thyroid nodules: comparison of the 2015 American Thyroid Association guidelines with the 2016 Korean Thyroid Association/Korean Society of Thyroid Radiology and 2017 American Congress of Radiology guidelines. Thyroid, 28:1532–1537.

16.

, Wu

, Zhang

, Gu

, Ye

, Tang

, Xu

, Liu

, Wu

. 2019. Validation and comparison of three newly-released Thyroid Imaging Reporting and Data Systems for cancer risk determination. Endocrine, 64:299–307.

17.

Shen

, Liu

, He

, Wu

, Chen

, Wan

, Gao

, Cai

, Ding

, Fu

. 2019. Comparison of different risk-stratification systems for the diagnosis of benign and malignant thyroid nodules. Front Oncol, 9:378.

18.

Ahmadi

, Oyekunle

, Sara

, Scheri

, Perkins

, Stang

, Roman

, Sosa

. 2019. A direct comparison of the ATA and TI-RADS ultrasound scoring systems. Endocr Pract, 25:413–422.

19.

Castellana

, Castellana

, Treglia

, Giorgino

, Giovanella

, Russ

, Trimboli

. 2020. Performance of five ultrasound risk stratification systems in selecting thyroid nodules for FNA. A meta-analysis. J Clin Endocrinol Metab, 105:dgz170.

20.

Liberati

, Altman

, Tetzlaff

, Mulrow

, Gotzsche

, Ioannidis

, Clarke

, Devereaux

, Kleijnen

, Moher

. 2009. The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration. Ann Intern Med, 151:W65–W94.

21.

Cibas

, Ali

. 2017. The 2017 Bethesda system for reporting thyroid cytopathology. Thyroid, 27:1341–1346.

22.

Whiting

, Rutjes

, Westwood

, Mallett

, Deeks

, Reitsma

, Leeflang

, Sterne

, Bossuyt

. QUADAS-2 Group. 2011. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med, 155:529–536.

23.

Kim

, Lee

, Choi

, Huh

, Park

. 2015. Systematic review and meta-analysis of studies evaluating diagnostic test accuracy: a practical review for clinical researchers-Part I. General guidance and tips. Korean J Radiol, 16:1175–1187.

24.

Lee

, Kim

, Choi

, Huh

, Park

. 2015. Systematic review and meta-analysis of studies evaluating diagnostic test accuracy: a practical review for clinical researchers-Part II. Statistical methods of meta-analysis. Korean J Radiol, 16:1188–1196.

25.

Reitsma

, Glas

, Rutjes

, Scholten

, Bossuyt

, Zwinderman

. 2005. Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. J Clin Epidemiol, 58:982–990.

26.

Rutter

, Gatsonis

. 2001. A hierarchical regression approach to meta-analysis of diagnostic test accuracy evaluations. Stat Med, 20:2865–2884.

27.

Suh

, Park

. 2016. Successful publication of systematic review and meta-analysis of studies evaluating diagnostic test accuracy. Korean J Radiol, 17:5–6.

28.

Bae

, Hahn

, Shin

, Ko

. 2018. Inter-exam agreement and diagnostic performance of the Korean thyroid imaging reporting and data system for thyroid nodule assessment: real-time versus static ultrasonography. Eur J Radiol, 98:14–19.

29.

Basha

MAA

, Alnaggar

, Refaat

, El-Maghraby

, Refaat

, Abd Elhamed

, Abdalla

AAEHM

, Aly

, Hanafy

, Mohamed

AEM

, Afifi

AHM

, Harb

. 2019. The validity and reproducibility of the Thyroid Imaging Reporting and Data System (TI-RADS) in categorization of thyroid nodules: multicentre prospective study. Eur J Radiol, 117:184–192.

30.

Chen

, Zhan

, Diao

, Liu

, Shi

, Chen

, Zhan

. 2019. Additional value of superb microvascular imaging for Thyroid Nodule Classification with the Thyroid Imaging Reporting and Data System. Ultrasound Med Biol, 45:2040–2048.

31.

Chng

, Tan

, Too

, Lim

, Chiam

PPS

, Zhu

, Nadkarni

, Lim

AYY

. 2018. Diagnostic performance of ATA, BTA and TIRADS sonographic patterns in the prediction of malignancy in histologically proven thyroid nodules. Singapore Med J, 59:578–583.

32.

Gao

, Xi

, Jiang

, Yang

, Wang

, Zhu

, Lai

, Zhang

, Zhao

, Zhang

. 2019. Comparison among TIRADS (ACR TI-RADS and KWAK-TI-RADS) and 2015 ATA Guidelines in the diagnostic efficiency of thyroid nodules. Endocrine, 64:90–96.

33.

, Na

, Baek

, Sung

, Kim

, Kang

. 2018. US fine-needle aspiration biopsy for thyroid malignancy: diagnostic performance of seven society guidelines applied to 2000 Thyroid Nodules. Radiology, 287:893–900.

34.

, Ahn

, Baek

, Ahn

, Chung

, Cho

, Park

. 2017. Validation of three scoring risk-stratification models for thyroid nodules. Thyroid, 27:1550–1557.

35.

Jabar

ASS

, Koteshwara

, Andrade

. 2019. Diagnostic reliability of the thyroid imaging reporting and data system (TI-RADS) in routine practice. Polish J Radiol, 84:274–280.

36.

Jin

, Yu

, Mo

, Su

. 2019. Clinical study of the prediction of malignancy in thyroid nodules: modified score versus 2017 American College of Radiology's Thyroid Imaging Reporting and Data System Ultrasound Lexicon. Ultrasound Med Biol, 45:1627–1637.

37.

Koseoglu Atilla

, Ozgen Saydam

, Erarslan

, Diniz Unlu

, Yilmaz Yasar

, Ozer

, Akinci

. 2018. Does the ACR TI-RADS scoring allow us to safely avoid unnecessary thyroid biopsy? Single center analysis in a large cohort. Endocrine, 61:398–402.

38.

, Hou

, Du

, Wu

, Wang

, Zhou

. 2019. Virtual Touch Tissue Imaging and Quantification (VTIQ) combined with the American College of Radiology Thyroid Imaging Reporting and Data System (ACR TI-RADS) for malignancy risk stratification of thyroid nodules. Clin Hemorheol Microcirc, 72:279–291.

39.

Macedo

, Izquierdo

, Golbert

, Meyer

ELS

. 2018. Reliability of Thyroid Imaging Reporting and Data System (TI-RADS), and ultrasonographic classification of the American Thyroid Association (ATA) in differentiating benign from malignant thyroid nodules. Arch Endocrinol Metab, 62:131–138.

40.

Middleton

, Teefey

, Reading

, Langer

, Beland

, Szabunio

, Desser

. 2017. Multiinstitutional analysis of thyroid nodule risk stratification using the American College of radiology thyroid imaging reporting and data system. Am J Roentgenol, 208:1331–1341.

41.

Pang

, Margolis

, Menezes

, Maan

, Ghai

. 2019. Diagnostic performance of 2015 American Thyroid Association guidelines and inter-observer variability in assigning risk category. Eur J Radiol Open, 6:122–127.

42.

Rosario

, Da Silva

, Nunes

, Borges

MAR

. 2018. Risk of malignancy in thyroid nodules using the American College of Radiology Thyroid Imaging Reporting and Data System in the NIFTP era. Horm Metab Res, 50:735–737.

43.

Ruan

, Yang

, Liu

, Liang

, Han

, Xu

, Luo

. 2019. Fine needle aspiration biopsy indications for thyroid nodules: compare a point-based risk stratification system with a pattern-based risk stratification system. Eur Radiol, 29:4871–4878.

44.

Skowrońska

, Milczarek-Banach

, Wiechno

, Chudziński

, Żach

, Mazurkiewicz

, Miśkiewicz

, Bednarczuk

. 2018. Accuracy of the European Thyroid Imaging Reporting and Data System (EU-TIRADS) in the valuation of thyroid nodule malignancy in reference to the post-surgery histological results. Polish J Radiol, 83:e577–e584.

45.

Trimboli

, Ngu

, Royer

, Giovanella

, Bigorgne

, Simo

, Carroll

, Russ

. 2019. A multicentre validation study for the EU-TIRADS using histological diagnosis as a gold standard. Clin Endocrinol, 91:340–347.

46.

Wildman-Tobriner

, Buda

, Hoang

, Middleton

, Thayer

, Short

, Tessler

, Mazurowski

. 2019. Using artificial intelligence to revise ACR TI-RADS risk stratification of thyroid nodules: diagnostic accuracy and utility. Radiology, 292:112–119.

47.

, Du

, Wang

, Jin

, Sui

, Yang

, Lin

, Luo

, Fu

, Li

, Teng

. 2019. Comparison and preliminary discussion of the reasons for the differences in diagnostic performance and unnecessary FNA biopsies between the ACR TIRADS and 2015 ATA guidelines. Endocrine, 65:121–131.

48.

Yoon

, Lee

, Kim

, Moon

, Kwak

. 2016. Malignancy risk stratification of thyroid nodules: comparison between the Thyroid Imaging Reporting and Data System and the 2014 American Thyroid Association management guidelines. Radiology, 278:917–924.

49.

Yoon

, Na

, Gwon

, Paik

, Kim

, Song

, Shim

. 2019. Similarities and differences between thyroid imaging reporting and data systems. Am J Roentgenol, 213:W76–W84.

50.

Zhao

, Liu

, Lei

, Cheng

, Li

, Wu

, Ma

. 2019. Impact of thyroid nodule sizes on the diagnostic performance of Korean thyroid imaging reporting and data system and contrast-enhanced ultrasound. Clin Hemorheol Microcirc, 72:317–326.

51.

Zheng

, Xu

, Kang

, Zhan

. 2018. A single-center retrospective validation study of the American College of Radiology Thyroid Imaging Reporting and Data System. Ultrasound Q, 34:77–83.

52.

Leeflang

, Rutjes

, Reitsma

, Hooft

, Bossuyt

. 2013. Variation of a test's sensitivity and specificity with disease prevalence. CMAJ, 185:E537–E544.

53.

Zanocco

, Hershman

, Leung

. 2019. Active surveillance of low-risk thyroid cancer. JAMA, 321:2020–2021.

54.

Saravana-Bawan

, Bajwa

, Paterson

, McMullen

. 2020. Active surveillance of low-risk papillary thyroid cancer: a meta-analysis. Surgery, 167:46–55.