Abstract
Background:
To establish a practical and simplified method for analyzing thyroid nodules in a clinical setting, the development of a new practical prediction model was required. This study aimed to construct and validate a simple and reliable web-based predictive model using the ultrasonography characteristics of thyroid nodules to stratify the risk of malignancy.
Methods:
To analyze ultrasonography images, radiologists were asked to assess thyroid nodules according to the following criteria: internal content, echogenicity of the solid portion, shape, margin, and calcifications. Multivariate logistic regression was performed to predict whether nodules were diagnosed as malignant or benign. The developmental data set included 849 nodules (January–June 2003). The validation set included different data (n = 453, June 2008–February 2009).
Results:
Ultrasonography features, including solid content, taller-than-wide shape, spiculated margin, ill-defined margin, hypoechogenicity, marked hypoechogenicity, microcalicifications, and rim calcifications, were selected as predictors for malignant nodules in the development set. A 14-point risk scoring system was developed. Malignancy risk ranged from 3.8% to 97.4%, and the risk of malignancy was positively associated with increases in risk scores. The areas under the receiver operating characteristic curve of the development and validation sets were 0.903 and 0.897, respectively.
Conclusion:
A simple and reliable web-based predictive model was designed using ultrasonography characteristics to stratify thyroid nodules according to the probability of malignancy.
Introduction
R
Recently, the thyroid imaging reporting and data system (TI-RADS) was developed for use in thyroid nodule risk stratification using various US features derived from the breast imaging reporting and data system (10 –13). However, Horvath et al. and Cheng et al. (10,11) have described US patterns that were not applicable to all thyroid nodules. Park et al. (12) used a complex equation that could make it difficult to assign each thyroid nodule to the equation, while Kwak et al. (13) weighed the same risk of malignancy for each suspicious US feature and used a category that covered a wide range of malignancy risks that could make it difficult to stratify the risk of malignancy. Hence, the clinical use of previous TI-RADS remains limited and its practicality in clinical practice has been questioned. To overcome the drawbacks of these previous studies, the Korean Society of Thyroid Radiology (KSThR) has proposed a simple diagnostic prediction model using the US features of thyroid nodules (14). However, in that study, all thyroid nodules were estimated as not benign but as indeterminate or malignant, that is, the lowest malignancy risk was found to be 7.3%, so unnecessary biopsies of benign lesions could not be avoided. Additionally, the calculation and summation of the malignancy risk is complex, making it difficult to assign a score to each thyroid nodule.
To establish a practical and simplified method for analyzing thyroid nodules in a clinical setting that would reduce unnecessary biopsy of benign thyroid nodules and improve patient management, the development of a new practical prediction model was required. This study aimed to develop and validate a simple, reproducible, web-based prediction model using US characteristics of thyroid nodules to stratify the risk of malignancy.
Materials and Methods
This retrospective study was approved by the Institutional Review Board of the Asan Medical Center. Informed consent was waived for data evaluation. Written informed consent for routine thyroid US and US-guided biopsy procedures was obtained from all patients before each US examination. First, a prediction model was developed using US characteristics of thyroid nodules to stratify the risk of malignancy. Then the model was internally validated using bootstrap techniques and a simple scoring system was developed, that is, a development and internal validation study. Finally, this system was externally validated using data from a comparable patient cohort (a validation study).
Study population
Development set data were retrospectively collected in patients enrolled in the Korean Thyroid Study Group multicenter retrospective study of US differentiation between benign and malignant thyroid nodules (15), and used to build a prediction model. These data were used for the development set for the following reasons. This published study was one of the most cited articles regarding thyroid nodule US findings. The data have been validated by several studies. These data could be readily applied to clinical practice (6,14). The requirements for a contributing nodule were: (i) a nodule ≥5 mm in maximum diameter, and (ii) the patient had undergone US-guided FNA at their medical institution between January and June 2003. A series of 8024 consecutive patients with thyroid nodules who had undergone thyroid US at one of nine university-affiliated hospitals were considered for inclusion in the development set. Additionally, only patients who met at least one of the following criteria were included: (i) patients who underwent surgery or core-needle biopsy (CNB) after thyroid US; (ii) patients who underwent US-guided FNA cytology for benign thyroid lesions (except for adenomas as a pre-Bethesda cytologic diagnosis) at least twice within a one-year interval; and (iii) patients who underwent initial US-guided FNA cytology and US follow-up (>12 months after US-guided FNA cytology) for benign thyroid lesions (except for adenomas). For follicular and Hürthle cell neoplasms, patients without final diagnosis after surgery were excluded. Finally, a total of 831 patients (M age = 51.1 years; range 6–84 years) with 849 nodules were included in the development set.
To determine the generalizability of the derived prediction rules for new patients, the rules were applied to a new data set (a validation set). The validation set included patients from a different data set who were studied between June 2008 and February 2009 at a different institution. The inclusion criteria were identical to those used for the development set. Finally, 429 patients (M age = 51.9 years; range 21–74 years) with 453 nodules were included in the validation set. Using validation set data, external validation was performed by estimating the area under the curve (AUC) of the scoring system that was developed.
US findings for potential diagnostic determinants
US images for the evaluation of thyroid nodules were obtained using various US units and with a high-frequency linear array transducer. The scanning protocol in all cases included both transverse and longitudinal real-time imaging of thyroid nodules, with the use of representative Digital Imaging and Communications in Medicine (DICOM) images. A faculty thyroid radiologist with 15 years of clinical experience in performing and evaluating thyroid US data reviewed all of the development set US images in the DICOM images. In the validation study, two faculty thyroid radiologists with 20 and 13 years of clinical experience, respectively, performed and evaluated the thyroid US images.
In analyzing the US images, the radiologists were asked to assess thyroid nodules using criteria obtained from published reports (2,6 –9,15), including internal content, echogenicity of the solid portion, shape, margin, and calcifications. Vascularity was excluded as a criterion.
The internal content of a nodule was categorized according to the ratio of the cystic to the solid portion within a nodule, that is, solid (≤10% cystic), predominantly solid (>10% cystic and ≤50% cystic), predominantly cystic (>50% cystic), and spongiform appearance. There were only three cystic cases in our study because a cyst is rarely indicated for biopsy (16). Therefore, these samples were included among the predominantly cyst cases. A spongiform appearance was defined as the aggregation of multiple, microcystic components consisting of >50% of the total nodule volume (15), and the solid component of spongiform nodules was not assessed.
Echogenicity of the solid portion was classified as hyper- or iso-echogenicity, hypoechogenicity, or marked hypoechogenicity. When the echogenicity of the nodule was similar to that of the surrounding thyroid parenchyma, it was classified as isoechogenicity. Hypoechogenicity was defined as decreased echogenicity compared with the thyroid parenchyma. Marked hypoechogenicity was defined as decreased echogenicity compared with that of the strap muscles (9).
The nodule shape was categorized as follows: ovoid to round (when the anteroposterior diameter of the nodule was equal to or less than its transverse diameter on a transverse or longitudinal plane); taller than wide (when the anteroposterior diameter of a nodule was longer than its transverse diameter on a transverse or longitudinal plane); or irregular (when a nodule was neither ovoid to round nor taller than wide).
Margins were classified as well-defined smooth, microlobulated or spiculated, or ill-defined (15).
Calcifications were categorized as microcalcifications, macrocalcifications, rim calcifications, or none. Microcalcifications were defined as calcifications ≤1 mm in diameter, and were visualized as tiny, punctate, hyperechoic foci either with or without acoustic shadows. If tiny, bright reflectors with a clear-cut, comet-tail artifact were observed by conventional US, these were considered to be colloid. Macrocalcifications were defined as hyperechoic foci >1 mm, and rim calcifications were defined as a nodule with peripheral curvilinear or eggshell calcifications (6). When nodules had both types of calcifications, that is, macrocalcifications including rim calcifications intermingled with microcalcifications, the nodule was considered to have microcalcifications.
Reference standard
For each thyroid nodule, the final diagnosis was determined by either histopathology or radiological follow-up. For malignant nodules, the pathological diagnosis was confirmed by surgery or CNB. For benign nodules, the pathological diagnosis was confirmed by surgery or CNB, FNA repeated at least twice with benign results, or a benign result on FNA and no change or reduced size seen on follow-up US (>12 months).
Data and statistical analysis
Multivariate logistic regression analysis was prepared to estimate the risk of thyroid cancer associated with US findings. Thyroid nodule US characteristics, including internal content, shape, margin, echogenicity, and calcification, were selected for the final model by backward elimination. The Hosmer–Lemeshow goodness-of-fit statistic was used to evaluate the agreement between the observed and expected number of nodules that were malignant or benign across all of the strata given probabilities of being malignant estimated from the prediction model. Internal validation was performed using the bootstrap validation algorithm (17,18). Bootstrap resampling started with fitting the logistic model in a bootstrap sample of the same number of nodules as the original sample (n = 849), which was drawn with replacement from the original sample. Averages of performance measurements, calculated as the AUC of the receiver operating characteristic (ROC) curve, were calculated for >200 repetitions. The performance in a bootstrapped sample represents an estimation of the apparent performance, and the performance in the original sample represents the test performance. The difference between these performances is an estimate of the optimism in the apparent performance. The optimism is subtracted from the apparent performance to estimate the internally validated performance: estimated performance = apparent performance – average (bootstrap performance – test performance) (17
–19). Based on the selected variables that were internally validated, a simple scoring system was developed using the model parameter estimates according to Sullivan et al. (20). Using validation set data, external validation was performed by estimating the AUC of the scoring model that was developed. The scoring system AUC was calculated to estimate its performance. Values of p < 0.05 were considered to be statistically significant, and all statistical analyses were carried out using SPSS v20.0 (IBM Corp., Armonk, NY) and R v3.0.2 (
Results
Thyroid malignancy was present in 42.4% (360/849) and 45.3% (205/453) of nodules in the development and validation sets, respectively. Of the benign nodules, 35.4% (173/489) of the development set and 36.7% (91/248) of the validation set were confirmed with surgical pathology. The development and validation sets ranged in nodule size from 0.5 to 6.4 cm (M = 1.7 ± 1.1 cm) and from 0.5 to 10.0 cm (M = 1.9 ± 1.5 cm), respectively. The proportion of nodules >1 cm in diameter were 68.9% (585/849) and 64.5% (292/453) in the development and validation sets, respectively. Associations between US features and malignancy are listed in Table 1. By multivariate logistic regression analysis, solid content, taller-than-wide shape, spiculated margin, ill-defined margin, hypoechogenicity, marked hypoechogenicity, microcalicifications, and rim calcifications were US features that showed significant differences between groups. Marked hypoechogenicity showed a high odds ratio of >8. All of the thyroid nodules (n = 43) with a spongiform appearance were benign. All of the cystic cases (n = 3) did not accompany other US findings, so they scored lowest malignancy risk.
Data are expressed as n (%).
By multiple logistic regression analysis.
US, ultrasonography; CI, confidence interval.
A 14-point risk scoring system was developed. Table 2 indicates the risk of malignancy according to the US score. In this scoring system, the risk of malignancy was 3.8% in thyroid nodules without suspicious malignant US features (a score of 0). Additionally, the malignancy risk became greater as the risk scores increased and was highest at 97.4% in thyroid nodules showing a solid content, taller-than-wide shape, spiculated margin, marked hypoechogenicity, and microcalcifications. The AUC of the prediction model was 0.903 with a confidence interval (CI) of 0.884–0.925. The performance of bootstrap test for internal validation was excellent (AUC 0.910 [CI 0.891–0.929]; p < 0.001). Therefore, the optimism-corrected performance was 0.903, which indicates the reliability of this prediction model. The AUC of the risk scoring system was 0.903 [CI 0.883–0.922], indicating the robust transformation of the prediction model to this scoring system. The Hosmer–Lemeshow goodness-of-fit test indicated that the prediction model and scoring system were both well calibrated (p = 0.405 and 0.826, respectively).
The scoring system showed good performance in external validation tests (AUC 0.897 [CI 0.867–0.924]). This suggested its suitability for use in routine clinical practice. The ROC curves for this model are shown in Figure 1. For simplified calculation and summation of the scoring system for malignancy risk stratification using US thyroid characteristics, an online automatically calculated scoring system was developed (

Performance of the prediction model, novel scoring system, and validation set based on the risk score. Areas under the receiver operating characteristic (ROC) curves were 0.903, 0.903, and 0.897 in the prediction model, score system, and validation set, respectively.

An online resource is available at
Discussion
The results of the present study can be summarized as follows. First, a simple and easily accessible, web-based, diagnostic scoring system was developed using the US characteristics of thyroid nodules to stratify the risk of malignancy. Second, this scoring system showed an excellent predictive accuracy, with AUCs of approximately 0.9 in an external validation set. This web-based scoring system could be used to calculate and estimate the risk of malignancy of thyroid nodules based on US findings in routine practice.
Several previous studies have focused on the TI-RADS and the point malignancy rating scale for reporting thyroid nodules using US (10 –13). However, these TI-RADS systems have proven difficult to apply in routine clinical practice. Horvath et al. and Cheng et al. described US patterns of thyroid nodules and the related pattern-associated rate of malignancy, but these patterns were not applicable to all thyroid nodules (10,11). Park et al. proposed an equation for predicting the probability of malignancy in thyroid nodules based on 12 US features. Despite its ability to stratify malignancy, it can be difficult to assign each thyroid nodule to a proposed equation in clinical practice (12). Kwak et al. used suspicious US features and related the risk of malignancy to the number of malignancies. That study was limited, however, by the fact that they weighed the same risk of malignancy according to each suspicious US feature, even though each US feature is associated with a different probability of malignancy, and they used one category (category 4c) that covered a wide range of malignancy risks (21–91.9%), which was not useful for malignancy stratification (13).
To overcome these drawbacks, including low reproducibility, less practicality, and the wide range of malignancy risk in each proposed category, a predictive model using well-known US characteristics was developed by the KSThR, which was similar to that used in the present study (14). In the previous prediction model, each suspicious US feature was assigned a different risk score, and the risk of malignancy could be obtained by calculating the total score, which was similar to the system used herein. The present study also showed a tendency for an increased predicted probability of malignancy that was associated with increased risk scores, as seen in previous reports in which the risk of malignancy increased proportionally to the number of suspicious malignant US features. However, the KSThR system reported that the risk of malignancy was 7.3% in the training set and 6.2% in the validation set, which was higher than that for the benign category of The Bethesda System for Reporting Thyroid Cytopathology (21). In other words, all thyroid nodules were estimated to be indeterminate or malignant nodules, rather than benign, even though the thyroid nodules did not have any suspicious malignant features or a score of 0. Therefore, nodules required biopsy if they were larger than the threshold size to indicate this procedure (7).
However, in the current study, a score of 0 for the nodule risk of malignancy equated to 3.8% in nodules in the scoring system, 4.8% in nodules in the development set, and 0% in nodules in the validation set. The overall risk of malignancy for score 0 nodules was <5%. The proportion of score 0 nodules in the present study was 17.2% in the development set (146/849) and 5.5% in the validation set (25/453). Therefore, the need for biopsy would be reduced accordingly if the present system were applied. Split-sample development and validation sets were used in the KSThR scoring system, but this split-sample approach is inferior to a true external validation set from a unique sample. In the present study, true external validation set data were also used, which included different time periods and institutions. For the KSThR scoring system, it was difficult to calculate and summarize the score, so it was deemed impractical.
To overcome this complexity, the present model was linked to an online risk calculator (
Papillary thyroid cancer represents the vast majority of all thyroid cancers. It is well known that most papillary thyroid cancers usually have an indolent behavior, and the five-year cancer-specific survival rate has been estimated at ∼99% (26). This indicates that a conservative approach is a reasonable strategy for managing thyroid nodules. In the scoring system presented here, the cutoff score for biopsy could be determined and modified, considering the indolent behavior of and conservative management for thyroid cancer. Additional research and validation studies will be required in the future to provide a more acceptable and reasonable cutoff value for biopsy in patients with thyroid nodules.
This study has some notable limitations. First, the study population was not free from a selection bias, and it could not be determined whether the patients were representative of the general population. However, a larger patient population was included to compensate for any selection bias. Second, the potential for interobserver variability in interpretations of US images among the three radiologists was not evaluated. However, as the radiologists who participated in the study had more than 10 years of clinical experience in thyroid imaging, the results may more reproducible than that interpreted by less clinically experienced radiologists. Third, many of the benign nodules were not confirmed by surgery (64.2%). However, this limitation must be considered in the context of the recommendations of The Bethesda System, in which clinical follow-up is recommended for benign FNA readings. Additionally, in previous studies, the proportion of surgically confirmed benign thyroid nodules was quite low (4.3–15.6%) (13,27). Fourth, in the small subgroups (score of 1 and 12), malignancy risk was based on extrapolation from other data because of the small numbers in each subgroup. Finally, there was an approximate five-year gap between the data for the development and validation sets, so there was some difference in terms of US image resolution or quality. However, the image evaluation method was consistent for the two data sets.
In summary, a simple and reliable scoring system has been devised using the US characteristics of thyroid nodules to stratify these nodules according to the probability of malignancy. An online risk calculator (
Footnotes
Author Disclosure Statement
No competing financial interests exist.
