Abstract
Background:
To minimize potential harm from overuse of fine-needle aspiration, Thyroid Imaging Reporting and Data Systems (TIRADSs) were developed for thyroid nodule risk stratification. The purpose of this study was to perform validation of three scoring risk-stratification models for thyroid nodules using ultrasonography features, a web-based malignancy risk-stratification system, and a model developed by the Korean Society of Thyroid Radiology and the American College of Radiology.
Methods:
Using ultrasonography images, radiologists assessed thyroid nodules according to the following criteria: internal content, echogenicity of the solid portion, shape, margin, and calcifications. A total of 954 patients (M age = 50.8 years; range 13–86 years) with 1112 nodules were evaluated at the authors' institute from January 2013 to December 2014. The discrimination ability of the three models was assessed by estimating the area under the receiver operating characteristic curve. Additionally, Hosmer–Lemeshow goodness-of-fit statistics (calibration ability) were used to evaluate the agreement between the observed and expected number of nodules that were benign or malignant.
Results:
Thyroid malignancy was present in 37.2% (414/1112) of nodules. According to the 14-point web-based scoring risk-stratification system, malignancy risk ranged from 4.5% to 100.0% and was positively associated with an increase in risk scores. The areas under the receiver operating characteristic curve of the validation set were 0.884 in the web-based model, 0.891 in the Korean Society of Thyroid Radiology model, and 0.875 in the American College of Radiology model. The Hosmer–Lemeshow goodness-of-fit test indicated that the web-based scoring system showed the best-calibrated result, with a p-value of 0.078.
Conclusion:
The three scoring risk-stratification models using the ultrasonography features of thyroid nodules to stratify malignancy risk showed acceptable predictive accuracy and similar areas under the curve. The web-based scoring system demonstrated the strongest agreement in calibration ability analysis. The easily accessible automated web-based scoring risk-stratification system may overcome the complexity of the various Thyroid Imaging Reporting and Data System guidelines and provide simplified guidance on personalized and optimal management in real practice.
Introduction
A
With the development of various guidelines and its widespread use, the role of US in the personalized management of patients with thyroid nodules has been further emphasized, which necessitates the validation of these scoring risk-stratification models. The purpose of this study was to validate the web-based prediction system and the scoring risk-stratification models developed by the KSThR and the ACR as risk-stratification tools for thyroid nodules.
Patients and Methods
This retrospective study was approved by the Institutional Review Board of the Chung-Ang University Hospital; the requirement for informed consent was waived for data evaluation. Written informed consent for routine thyroid US and US-guided procedures was obtained from all patients before each US examination. The patient cohort was validated using the US characteristics of thyroid nodules to stratify the risk of malignancy using the scoring system developed by Choi et al. (11), an online automatically calculated scoring system (
Study population
The patient cohort was retrospectively collected from patients assessed from January 2013 to December 2014. Only patients who met at least one of the following criteria were included: (i) nodule size >5 mm; (ii) patients who underwent surgery or core-needle biopsy (CNB) after thyroid US; (iii) patients who underwent US-guided FNA cytology for benign thyroid lesions at least twice within a one-year interval (Bethesda category 2); and (iv) patients who underwent initial US-guided FNA cytology and US follow-up (>12 months after US-guided FNA cytology) for benign thyroid lesions. Nodules with indeterminate or non-diagnostic results were excluded from the study unless they were followed by diagnostic FNA or surgery. Finally, 954 patients (M age = 50.8 years; range 13–86 years) with 1112 nodules were included.
US findings for potential diagnostic determinants
US images for the evaluation of thyroid nodules were obtained using an iU22 ultrasound system (Philips Healthcare, Bothell, WA) equipped with a 50 mm linear array transducer with a bandwidth of 7–12 MHz. The scanning protocol in all cases included both transverse and longitudinal real-time imaging of thyroid nodules. Thyroid radiologists with 8 and 10 years of clinical experience in performing and evaluating thyroid US data reviewed all of the US images and reached a consensus.
When analyzing the US images, the radiologists were asked to assess thyroid nodules using criteria obtained from the literature (13 –16), including internal content, echogenicity of the solid portion, shape, margin, and calcifications. Although the presence of intranodular vascularity might increase the risk of malignancy, there are no consistent results regarding the association of an intranodular vascularity pattern with the risk of malignancy (17 –19). Accordingly, three scoring systems did not recommend its inclusion on the basis of inconsistent literature about its value in differentiating malignancy from benign nodules. Therefore, vascularity was excluded as a criterion. The internal content of a nodule was categorized according to the ratio of the cystic to the solid portion within a nodule, that is, solid (≤10% cystic), predominantly solid (>10% cystic and ≤50% cystic), predominantly cystic (>50% cystic), and spongiform appearance. A spongiform appearance was defined as the aggregation of multiple microcystic components consisting of >50% of the total nodule volume (15), and the solid component of spongiform nodules was not assessed. The echogenicity of the solid portion was classified as hyper- or isoechogenic, hypoechogenic, or marked hypoechogenic. When the echogenicity of the nodule was similar to that of the surrounding thyroid parenchyma, it was classified as isoechogenic. Hypoechogenicity was defined as decreased echogenicity compared to the thyroid parenchyma. Marked hypoechogenicity was defined as decreased echogenicity compared to that of the strap muscles (14). The nodule shape was categorized as follows: ovoid to round (when the anteroposterior diameter of the nodule was equal to or less than its transverse diameter on a transverse or longitudinal plane), taller-than-wide (when the anteroposterior diameter of a nodule was longer than its transverse diameter on a transverse or longitudinal plane), or irregular (when a nodule was neither ovoid to round nor taller-than-wide). Margins were classified as well-defined smooth, microlobulated or spiculated, or ill-defined (15). Calcifications were categorized as microcalcifications, macrocalcifications, rim calcifications, or none. Microcalcifications were defined as calcifications ≤1 mm in diameter and were visualized as tiny, punctate, hyperechoic foci with or without acoustic shadows. If tiny, bright reflectors with a clear-cut, comet-tail artifact were observed with conventional US, these were considered to be colloid. Macrocalcifications were defined as hyperechoic foci >1 mm, and rim calcifications were defined as nodules with peripheral curvilinear or eggshell calcifications (13,20). When nodules had both types of calcifications, that is, macrocalcifications including rim calcifications intermingled with microcalcifications, the nodule was considered to have microcalcifications.
Between three scoring risk-stratification models, there was difference in US feature definitions. Therefore, during image analysis, each US feature was categorized according to each of the scoring model definitions; for example, irregular margin in the ACR TIRADS was regarded as spiculated margin in the web-based scoring system. Additionally, extrathyroidal extension was evaluated according to the definitions in the ACR TIRADS (12). An online automatically calculated scoring system (
Reference standard
For each thyroid nodule, the final diagnosis was determined by either histopathology or radiological follow-up. For malignant nodules, the pathological diagnosis was confirmed by surgery or CNB. For benign nodules, the pathological diagnosis was confirmed by surgery or CNB, FNA repeated at least twice with benign results, or a benign result on FNA and no change or reduced size on follow-up US (>12 months).
Data and statistical analysis
Multivariate logistic regression analysis was performed to estimate the risk of thyroid cancer associated with US findings. Risk score of each US feature of thyroid nodule was assigned, and total risk score was calculated according to the web-based model developed by Choi et al. (11) and those developed by the KSThR (8) and the ACR (12) (Table 1). Validation of the models was performed separately by measuring the discrimination and calibration abilities. The discrimination ability of the models was assessed by estimating the area under the receiver operating characteristic (ROC) curve. p-Values of <0.05 were considered statistically significant. Second, Hosmer–Lemeshow goodness-of-fit statistics (calibration ability) were used to evaluate the agreement between the observed and expected number of nodules that were benign or malignant. All statistical analyses were carried out using IBM SPSS Statistics for Windows v23.0 (IBM Corp., Armonk, NY).
US, ultrasound; KsThR, Korean Society of Thyroid Radiology; ACR, American College of Radiology.
Results
Thyroid malignancy was present in 37.2% (414/1112) of the nodules in the present series, of which 78.7% (326/414) were papillary thyroid carcinomas, 1.9% (8/414) were follicular carcinomas, 13.0% (54/414) were follicular variant papillary thyroid carcinomas, and 6.3% (26/414) were other malignancies (Supplementary Table S1; Supplementary Data are available online at
Table 2 presents the risk of malignancy according to the 14-point web-based scoring risk-stratification system, which was 4.5% in thyroid nodules without suspicious malignant US features (a score of 0). The malignancy risk increased as the risk score increased and peaked at 100.0% in all scoring risk-stratification models (odds ratio [OR] = 1.808 [confidence interval (CI) 1.690–1.934], p < 0.001 for the web-based model; OR = 1.815 [CI 1.691–1.949], p < 0.001 for the KSThR model; OR = 1.750 [CI 1.645–1.861], p < 0.001 for the ACR TIRADS). The area under the curve (AUC) of the web-based scoring risk-stratification system was 0.884, with a CI of 0.863–0.905 (p < 0.001). According to the KSThR scoring risk-stratification model, the risk of malignancy was 4.1% in thyroid nodules without suspicious malignant US features and ranged from 4.1% to 100.0%, with an AUC of 0.891 ([CI 0.871–0.911]; p < 0.001; Table 3). According to the ACR TIRADS, the malignancy risk ranged from 4.5% to 100.0%, with an AUC of 0.875 ([CI 0.853–0.896]; p < 0.001; Table 4). Among category 2 (not suspicious) and category 5 (highly suspicious) nodules, about 7.0% (78/1112) of all nodules were above the risk threshold. Overall, 93.0% (1034/1112) of all nodules were below the established ACR TIRADS specified risk thresholds. The ROC curves for each model are shown in Figure 1. The Hosmer–Lemeshow goodness-of-fit test indicated that the web-based scoring risk-stratification model showed the best-calibrated result, with a p-value of 0.078, indicating the strongest agreement between the observed and model-predicted number of nodules that were malignant or benign across all of the strata, whereas the others showed p-values of <0.05 (Table 5).

Scoring system performances of the web-based, KSThR, and ACR systems with areas under the receiver operating characteristics (ROC) curves of 0.884, 0.891, and 0.875, respectively. KSThR, Korean Society of Thyroid Radiology; ACR, American College of Radiology.
Data indicate the number of lesions.
AUC, area under the curve; CI, confidence interval.
The web-based scoring risk-stratification model determined the malignancy rate of thyroid nodules with suspicious US features according to various malignancy risk-stratification systems. In the present study population, the web-based system accurately predicted the malignancy rate in 87.4% of nodules (250/286), showing higher positive predictive value than the various TIRADS guidelines applied by this system (i.e., French TIRADS 74.0% [322/435]; ATA guidelines 81.4% [293/360]; and K-TIRADS 81.7% [286/350]; Supplementary Table S3). The ATA pattern-based system could not classify about 20.6% (229/1112) nodules.
Discussion
The current validation study has revealed that the web-based, KSThR, and ACR scoring risk-stratification models show acceptable predictive accuracy with similar AUCs. In particular, the web-based scoring system showed the highest agreement in calibration ability. Furthermore, the advantage of this web-based scoring system was found to be rapid, with its online automatically calculated system overcoming the complexity of previous scoring risk-stratification models.
Several TIRADS have been developed for malignancy risk stratification (4,6,7) that incorporate US features to categorize thyroid nodules and recommend cytological diagnosis. Two previous studies described the US patterns of thyroid nodules and the related pattern-associated rates of malignancy, but these patterns were not applicable to some thyroid nodules with multiple US features (4,6). Kwak et al. (7) used suspicious US features and related the risk of malignancy to the number of malignancies, but they weighted the same risk of malignancy to each suspicious US feature, and one category (category 4c) was associated with a wide range of malignancy risk (21.0–91.9%). To integrate the combination of specific US features with different odds ratios for malignancy risk, quantitative grading systems were established that stratify malignancy risk by combining the suspicious US features possessed by thyroid nodules and categorizing the results (16). One such model— the Korean TIRADS—was recently published by the KSThR (16), and was validated prospectively in a multicenter study (22). Recently, the ATA management guidelines for thyroid nodules also stratified the risk of malignancy into five categories (5). Meanwhile, for malignancy risk stratification, several attempts have been made to convert this “pattern-based” approach to a “score-based” approach. Previously, Park et al. (9) proposed an equation for predicting the probability of malignancy based on 12 US features, but it was difficult to assign each thyroid nodule to a proposed equation in clinical practice. To overcome the associated complexity, the KSThR (8) developed a prediction model that assigned a different risk score to each suspicious US feature and obtained the malignancy risk by summing the total score. The score-based approach has the advantage of achieving more personalized management, with >10 subdivided ranges for the malignancy risk scores of the thyroid nodules. However, in real practice, the complexity and lack of congruence of these previous systems have limited their adoption and may have been more cumbersome for those more used to pattern-based analysis. Due to a greater emphasis on personalized management of patients with thyroid nodules and to overcome the low reproducibility and practicality, Choi et al. (11) developed a web-based automatic scoring risk-stratification system using US characteristics. This system also classifies nodules according to various guidelines, such as the French TIRADS (23), ATA guidelines (5), and Korean TIRADS (16). More recently, the ACR developed TIRADS (12) by allocating points to more suspicious features, summing the points, and determining the TIRADS category of nodules. However, validation of these scoring risk-stratification models remains to be conducted.
All scoring risk-stratification models investigated in the present study proved to be acceptable tools for effective malignancy risk stratification in clinical practice, with an AUC range of 0.875–0.891. A recent study evaluating thyroid nodules according to the 2015 ATA guidelines yielded an AUC ranging from 0.721 to 0.839 with respect to thyroid nodule size (24). Moreover, the web-based scoring system showed agreement in both discrimination and calibration abilities. This may be due to the involvement of multiple centers in the design of this web-based predictive model, and the results may thus be more reproducible and generalizable. Choi et al. (11) collected data from nine affiliated hospitals and used true external validation set data from different time periods and institutions. In contrast, the KSThR (8) used split-sample development and validation sets. In addition, the strongest advantage of the web-based scoring system is that it is linked to an online risk calculator (
Clinicians are often interested in developing points-based risk scoring systems and value simple scoring systems, as well as the ease with which they can be used in routine clinical practice (30). The recently developed ACR TIRADS determines the final category according to the following process: nodule detection, evaluation of the score of each US feature, summation of scores, determination of the category, size matching, and FNA decision. In the current era of mobile computing and web-based risk calculators, this scoring risk-stratification model can be more readily applied in real practice due to straightforward implementation of the automatic score-calculating system (11). In accordance with this trend, Cheng et al. (31) have recently developed an automatic score-calculating system based on the ACR TIRADS (12). Regarding the tendency for personalized medicine, the three scoring risk-stratification models provide >10 ranges of malignancy risk scores. Indeed, the score of the ACR TIRADS can be up to 16. In a recent validation study of the ACR TIRADS (10), the cancer risk of each category was 0.3%, 1.5%, 4.8%, 9.1%, and 35.0%. However, category 5 (highly suspicious), allocated a median score of ≥7, comprises a relatively low malignancy risk (35.0%) and a wide range of malignancy probability. This may be because this scoring risk-stratification model was based on a review of the literature, expert opinion, and preliminary analysis of patients, with arbitrary score assignation rather than by multivariate analysis to estimate the thyroid cancer risk associated with each US feature. For risk analysis, the odds ratios for each US feature from the logistic regression model were used (8,11), and this statistical aspect may permit more effective risk stratification with future revisions of the ACR TIRADS. Regarding the lowest malignancy risk assigned to benign nodules without any suspicious malignant features or a score of 0, a lower biopsy rate can be expected. According to the web-based scoring risk-stratification model, the overall risk of malignancy for score 0 nodules was <5% (11). Therefore, the number of unnecessary biopsies may be reduced. The current results equated to 4.1% in score 0 nodules based on the web-based scoring system. In contrast, the risk of malignancy was 7.3% in the training set and 6.2% in the validation set of the KSThR (8), which are higher than in the benign category of The Bethesda System for Reporting Thyroid Cytopathology (32).
Recently, an artificial intelligence-adapted US machine was developed for thyroid nodule characterization (S-Detect for Thyroid). Choi et al. (33) validated the system and demonstrated that it shows satisfactory diagnostic performance (sensitivity 90.7%; specificity 74.6%) for thyroid malignancy. The thyroid US computer-aided diagnosis system using artificial intelligence investigated in this study was installed within the US system, allowing real-time decision making regarding the need for FNA (33). It is expected that future implementation of the web-based scoring risk-stratification model and the artificial intelligence-adapted US machine will guide and simplify personalized management and reduce analysis time.
The present study has some limitations. First, due to its retrospective design, a selection bias is inevitable, and it could not be determined whether the patients are representative of the general population. The usefulness of this scoring system requires confirmation by prospective studies with large cohorts representing samples from the general population. Second, the interobserver variability in the interpretation of US images between the two radiologists was not evaluated. Third, many benign nodules (90.1%) were not confirmed by surgery. Fourth, the malignancy rate of thyroid nodules included in the analysis was relatively high (37.2%), which may due to the fact that the authors' institution is a referral center and many nodules with suspicious US features warranted biopsy; many patients with indeterminate results who were lost to follow-up were omitted. Lastly, about 42.1% of nodules measured <1 cm in this study. The fact that very small nodules were subject to biopsy can be questioned. However, the aim of this study was to test the probability for malignancy as independently as possible. As discussed above, there are differing opinions on the management of small thyroid nodules, and patient preferences have also to be considered. A recently published study (34) demonstrated that biopsy should be considered for thyroid nodules <1 cm prior to active surveillance to prevent unnecessary active surveillance and patient anxiety. For these reasons, size has been included as an essential data point for the establishment of the management plan, but not for risk estimation. Previous literature has shown that the risk of malignancy is not dependent on nodule size, but the management (observation, lobectomy, and thyroidectomy) can depend on the nodule size, among other factors (13,35,36). Future studies combining US findings and other factors such as size, clinical characteristics, and family history of cancer may refine management plans offered through this web-based system.
In summary, the easily accessible and reliable automated scoring risk-stratification system described herein will support clinical decision making, increase FNA efficacy, compensate for the complexity of various TIRADS, and enable more personalized and optimized management. The current study is the first to evaluate the diagnostic efficacy of various scoring risk-stratification models, and it is expected that it will be followed by a future prospective study with a large cohort.
Footnotes
Author Disclosure Statement
The authors declare no conflicts of interest.
