Abstract
Background:
Over 50% of newly diagnosed thyroid nodules are either cytologically benign or presumed to be benign on the basis of low-suspicion sonographic findings. The strategies used for their long-term surveillance are based mainly on the estimated residual risk of malignancy calculated with various ultrasonographic classification systems (e.g., Thyroid Image Reporting and Data Systems [TIRADS]). We conducted a longitudinal study to evaluate the temporal stability of the initial risk estimates computed with five widely used systems and to determine whether risk class increases during follow-up are indeed predictive of malignancy.
Methods:
We re-analyzed data prospectively collected at a single academic referral center on 232 patients (age: 54.1 ± 13.7 years) with 432 asymptomatic, sonographically or cytologically benign thyroid nodules at baseline (T0) and 122 new nodules that were present five years later (T5). At both time points, the sonographically estimated risk of malignancy was calculated as recommended by the American Association of Clinical Endocrinologists/American College of Endocrinology/Associazione Medici Endocrinologi, the American College of Radiologists' TIRADS, the American Thyroid Association's 2015 practice guidelines, the European Thyroid Association's TIRADS (EU-TIRADS), and the TIRADS of the Korean Society of Thyroid Radiology (K-TIRADS).
Results:
For 57 to 127 (13.2–29.4%) of the original nodules, depending on the system used, the estimated malignancy risk increased over the 5-year interval. Of the nodules whose baseline risk had not warranted cytological assessment, very few (6.3–8.3%) met the criteria for cytology at the 5-year evaluation. Biopsy was indicated for only 4 to 8 (3.3–6.6%) of the new nodules based on T5 risk estimates. Despite these changes, none of the 232 patients was ever diagnosed with a cancer.
Conclusions:
Ultrasound-based risk classes of presumably benign thyroid nodules remain fairly stable over time, and changes warranting biopsy are rare indeed. The appearance of new nodules is a frequent event, but very few (<5%) are classified as high risk, and only the 3–7% meet the criteria for cytological assessment. Collectively, these findings support the view that patients with presumably benign thyroid nodules can be safely followed with less intensive protocols.
Introduction
T
Almost half of all nodules subjected to this sonographic work-up have benign or low-suspicion ultrasound patterns (20), or, if biopsied, prove to be benign (21). In both cases, the patient is enrolled in a long-term surveillance program, although the optimal frequency of examinations and duration of the follow-up are not well defined. Moreover, no attempt has been made to characterize the possible evolution over time of the initial risk of malignancy estimated with any of the ultrasound-based systems mentioned above, and there is no clear indication about the risk-class changes that warrant FNA (18). To address these issues, we retrospectively analyzed prospectively collected data on cohort of patients with presumably benign thyroid nodules. Our aims were: (1) to assess the temporal stability of the nodules' initial risk estimates computed with five widely used ultrasonographic classification systems, and (2) to determine whether risk-class increases that occur during follow-up are predictive of malignancy.
Materials and Methods
Cases
The cases described herein represent a subset of those contributed by our thyroid cancer unit to a larger multicenter cohort that was prospectively analyzed to explore the natural history of benign thyroid nodules (5). The original study was conducted with institutional review board approval and participants' written informed consent for multiple analyses of their data and publication of the results. In that study, each participating center consecutively enrolled all patients presenting between January 1, 2006, and January 31, 2008, with one to four asymptomatic, presumably benign thyroid nodules, each with a maximum diameter of 4–40 mm. The lesions' “presumably benign” status was determined by the absence at baseline ultrasound features widely regarded as suspicious at the time of the study (i.e., hypoechogenicity, irregular margins, taller-than-wide shape, intranodular vascular spots, and microcalcifications) (22) or, in the presence of one or more suspicious ultrasound features, by a benign FNA cytology report. All enrolled cases were managed with active surveillance as long as there was no evidence of malignancy. The protocol included yearly color Doppler ultrasound examinations of the thyroid and neck, with systematic documentation of the intraglandular location of each nodule, its dimensions, and its ultrasound features. FNA was proposed for nodules that grew and/or developed one or more new suspicious ultrasound features, as described above (1,22).
As shown in Figure 1, cases in the original cohort were excluded from the present analysis unless complete reports and ultrasound images were available for the ultrasound studies performed at baseline (T0) and after five years of surveillance (T5).

Composition of the study cohort. *Seven of the 8 thyroidectomies were performed for cosmetic reasons or to alleviate compressive symptoms. All 7 patients had nodules that were histologically confirmed to be benign. The eighth patient had a single nodule that was cytologically benign at T0, although its ultrasound features had been highly suspicious. Because of the latter findings, fine needle aspiration was repeated 1 year later and the results were suspicious for malignancy. The surgical diagnosis was follicular-variant papillary thyroid cancer. **The exclusion of these 9 nodules did not result at any exclusion at the patient level. All of the patients with nodules that disappeared had other nodules that were still present at T5. T0, baseline; T5, after 5 years of surveillance.
Ultrasound examinations and FNA cytology
All ultrasound examinations considered in this study had been performed by a single examiner, with 10 years of experience in thyroid ultrasound (C.D.), using an Esaote MyLab 25 ultrasound system (Esaote SpA, Genoa, Italy) with a high-frequency linear transducer. FNAs were obtained under ultrasound guidance, and the smears were analyzed by experienced thyroid cytopathologists. The slides were interpreted and classified according to the criteria published in the Italian Consensus for Thyroid Cytopathology (23,24).
Review of ultrasound images and description of individual nodule features
Ultrasound images of each nodule at T0 and T5 were converted to and stored as deidentified bitmap image files. The stored files were visualized on a liquid crystal display monitor and reviewed jointly by two clinicians (G.G. and L.L.), each with six years of experience in thyroid ultrasound imaging. The images were viewed in random order, and readers were blinded to the identity of the patient, the date of the scan, and all other clinical information regarding the case. Using standardized rating forms (25) based on published recommendations (26,27), the two readers recorded their consensus judgement on the sonographic features of each nodule. The decision to use joint examinations and consensus judgments was made to avoid the interobserver variability that we have documented in assessments of the single sonographic features of thyroid nodules (28). Agreement between the readers was highest regarding the presence/absence of macrocalcifications (Krippendorff alpha 0.83), and lower for evaluation of margins (alpha 0.44), composition (alpha 0.5), and echogenicity (alpha 0.66). Even if the use of the classification systems provided a better interobserver agreement than single suspicious features, all nodules were jointly reviewed.
The readers rated the following nodule features: margins (well defined, microlobulated or irregular, ill defined, peripheral halo); structure (solid, cystic, mixed); echogenicity (hyperechoic, isoechoic, hypoechoic—all relative to the surrounding thyroid parenchyma—or markedly hypoechoic; i.e., less echoic than the adjacent strap muscle); calcifications (absent, microscopic, macroscopic—the latter including eggshell calcifications); other hyperechoic foci (comet-tail artifacts or indeterminate, the latter including areas of fibrosis); and suspected extrathyroidal extension (loss of the echogenic thyroid border, abutment, or contour bulging). For mixed-content nodules, the location of the solid component (non-nodular, eccentric, central) was also rated. Echogenicity and structure were not evaluated in nodules with complete rim calcification. Images whose quality precluded proper evaluation of the above features were labeled nonassessable.
Three additional nodule features had been documented by the original examiner but were excluded from the readers' rating: size (because the original measurements were visible on the stored images); shape (which depended strictly on the nodule diameters); and vascularity, which could not be adequately assessed on static images and was in any case a classification element only in one of the systems considered in the study.
T0-T5 image pairing
When the above features had been recorded for all nodules, the readers re-examined the stored ultrasound images and attempted to pair the T0 image for each nodule with its T5 counterpart (based on careful analysis of lesion location and features). If the stored image at one of the time-points had been judged nonassessable, the nodule was excluded from the study cohort. If instead the T0 or the T5 image was physically missing, the decision to exclude the nodule was based on the original ultrasound reports. If the T5 image was unavailable and the nodule was not described in the T5 report, the nodule was classified as no longer present. If the T0 image was unavailable and the nodule was not mentioned in the T0 report, the nodule was classified as new.
Classification of nodules using five sonographic risk stratification systems
For each nodule retained in the study cohort, the consensus ratings of each ultrasound feature (plus the original examiner's recorded data on nodule size, shape, and vascularity) were used to classify the risk of malignancy at T0 and T5 (or, for new nodules, at T5 alone) using the following five systems: Guidelines of the American Association of Clinical Endocrinologists/American College of Endocrinology/Associazione Medici Endocrinologi (12); the Thyroid Imaging Reporting and Data System (TIRADS) system developed by the ACR (19); the 2015 Guidelines of the American Thyroid Association (ATA) (11); the EU-TIRADS system proposed by the European Thyroid Association (17); and the Korean Society of Thyroid Radiology's K-TIRADS system (18).
Identification of nodules requiring FNA
With each of the above systems, the likelihood of malignancy for a given thyroid nodule is indicated by its risk class, which is defined by a set of ultrasound features. In addition, within each risk class, the advisability of FNA is indicated based on lesion size. For nodules in high-risk categories, FNA is usually indicated if the maximum diameter is 1 cm or more; for those in low-risk categories, the size threshold is generally higher (range: 1.5–3.0 cm, depending on the system). Using each system, we identified all nodules with indications for FNA at T0 and/or T5 (or, for new nodules, at T5 alone) based on the size threshold for the assigned risk class.
Analysis of variation in ultrasound-based malignancy risk between T0 and T5
For each nodule (except those classified as “new”), we compared risk classes and biopsy indications assigned at the two time-points with each system to determine whether and how each had changed during the five-year surveillance period. This same analysis was performed at the patient level: in this case, the comparison involved the risk classes and biopsy indications for the most suspicious nodule at T0 and T5. For the patient-level analysis only, nodules assigned to the “unclassifiable” category in the ATA system were considered “intermediate suspicion” nodules, which is consistent with the measured risks reported for such nodules by some authors (16–18%) (29,30). This solution allowed us to avoid excluding patients whose only nodule was ATA- unclassifiable.
Results
As shown in Figure 1, 26 of the 265 patients enrolled by our center in the original cohort were excluded from the present study, because they were lost to follow-up (n = 18) or underwent thyroidectomy (n = 8) before the T5 assessment. As a result, the readers' image review included 462 nodules in 239 patients. Thirty of the 462 nodules were excluded during the T0–T5 image pairing process because the stored image for one of the time points had been judged non-assessable (n = 21) or because the nodule was classified as no longer present at T5 (n = 9). The final study cohort comprised 232 patients (mean age: 54.1 ± 13.7 years; 81.5% females) with a total of 432 nodules at T0 and T5 (see Supplementary Table S1 for baseline ultrasound profiles; Supplementary Data are available online at
As shown in Table 1, for most nodules, the risk class assigned on the basis of the T5 images was the same as or lower than that assigned at T0. However, for some nodules (from 13.2% to 29.4%, depending on the classification system used), the T5 images were consistent with a higher class of risk than that based on the features observed at T0. As shown in Supplementary Table S2, similar trends were observed for the subgroup of nodules measuring less than 1 centimeter. Risk-class increases were most common for nodules assigned to the intermediate suspicion categories at T0 (Supplementary Table S3). The sonographic features whose appearance led to the increases are shown in Supplementary Tables S4 and S5.
Excluding the 39 nodules that could not be classified with the ATA system (i.e., iso- or hyperechoic nodules with high-suspicion features, such as irregular margins, microcalcifications, taller-than-wide shape, disrupted rim calcifications with a small extrusive hypoechoic soft tissue component, or evidence of extrathyroidal extension).
AACE/ACE/AME, American Association of Clinical Endocrinologists/American College of Endocrinology/Associazione Medici Endocrinologi; ACR, American College of Radiologists; ATA, American Thyroid Association; EU-TIRADS, European Thyroid Imaging Reporting and Data Systems; K-TIRADS, Korean Thyroid Imaging Reporting and Data Systems; T0, baseline timepoint; T5, after five years of surveillance; TIRADS, Thyroid Imaging Reporting and Data Systems.
As shown in Table 2, based on the T5 reclassifications, FNA was not indicated for the vast majority of the nodules (including some that had met the criteria for aspiration at T0). However, for some nodules (6.3–8.3% of the total, depending on the system used), the T0 ultrasound images did not justify performing an FNA, but the picture had changed by T5 (Table 2). In roughly half of these cases, the indication for biopsy at T5 was based exclusively on the size of nodule (i.e., the sonographically defined risk class was not upgraded but the maximum diameter of the nodule had increased and now met the risk class's size criterion for FNA).
The risk class at T5 for these nodules was increased vs. that assigned at T0. Regardless of whether or not the maximum diameter of the nodule had changed since T0, the size of the nodule at T5 met the new risk class–specific size criterion for FNA.
The risk class for these nodules at T5 was not upgraded from T0, but the maximum diameter of the nodule had increased. As a result, at T5 the nodule met the risk class–specific size criterion for FNA.
Excluding the 39 nodules that could not be classified with the ATA system (as specified in Table 1).
FNA, fine-needle aspiration.
Analysis of new nodules
As noted above, the T5 images also revealed 122 new nodules that had not been present at baseline, most (n = 99, 81.1%) measuring 1 cm or less. In some of the 88 patients with new nodules (13.6–31.8%), the appearance of the latter lesions led to an increase in the patient-level risk class (i.e., that based on the ultrasound features of the most suspicious nodule) (Table 3). As shown in Table 4, very few of the new nodules met the criteria for FNA, and only 3 caused an increase in the patient-level risk estimated with at least one of the classification systems. Of the 7 patients harboring the 8 nodules flagged for aspiration by at least one system, 2 had refused biopsy altogether in the original study. In 3 other cases, the new nodule had been biopsied, and the aspirate was cytologically diagnosed as benign (n = 2) or nondiagnostic (n = 1). In the remaining two cases, FNA had been performed on another nodule considered clinically more significant than the new lesion.
T5
For this analysis only, nodules assigned to the ATA's “unclassifiable” category in were considered “intermediate suspicion” nodules. This solution, which is consistent with reports showing measured risks for such nodules of 16–18% (29,30), allowed us to avoid excluding patients with a single nodule that was ATA-unclassifiable.
Excluding 10 of the 122 new nodules that could not be classified with the ATA system (as specified in Table 1).
Four patients underwent thyroidectomy after more than 5 years of surveillance. Three had opted for surgery for cosmetic reasons or personal preferences, and all of their nodules were histologically benign. The fourth case was that of an older woman (age 73 at T0) with a single nodule at baseline. Surgery was performed after 7 years of surveillance because of an indeterminate cytology report, and the nodule proved to be follicular-variant PTC. At T0, it had been classified as a low- to intermediate-risk lesion (American Association of Clinical Endocrinologists, intermediate; ACR TIRADS, 3; ATA, low suspicion; EU-TIRADS, 4; K-TIRADS, 3), and the classifications assigned at T5 were identical to those at baseline.
Discussion
The aim of the present study was to evaluate the changes that occur over time in the ultrasonographic features of presumably benign thyroid nodules and the impact of these changes on the risk for malignancy estimated during surveillance using five widely endorsed sonographic classification systems. At the 5-year visit, the risk class assigned to well over 70% of the 432 nodules present at baseline was the same or lower than that recorded at baseline. For 13–30% of the nodules, the risk class increased at T5, but the increase was rarely (4–6%) sufficient to satisfy the systems' requirements for FNA.
In patients with cytologically benign or sonographically non-suspicious nodules, changes in the lesions' ultrasound appearance is by no means specific for malignancy: none of our patients was diagnosed with a cancer as a result of such changes. A recent prospective study by Rosário et al. (7) showed that sonographic features are the best guide for repeating FNA on nodules with initially benign cytology. They reported re-aspiration of 26 nodules owing to the appearance of suspicious ultrasound features (with or without growth), and the results revealed three additional cases of malignancy (11.5%)—all three involving nodules whose margins had become irregular (7). This finding raises the possibility that risk reassessment during follow-up might yield more accurate results if it were based on a smaller set of suspicious ultrasound features—[i.e., some of those with highest specificity (13 –15)]—than those used for the baseline assessment.
In our cohort, none of the risk-class increases was associated with a diagnosis of malignancy, suggesting that sonographic classification systems whose estimated risks of malignancy remain relatively consistent over time can be considered more reliable. Within each system, individual features are differentially weighted, and the weight assigned to a given feature frequently varies across systems. Robust evidence supporting the superiority of one of the proposed systems over the others is not currently available. The choice must be based on multiple aspects of the system [diagnostic accuracy (31,32), interobserver reliability (28), reliability over time] and of the setting in which it is to be used (equipment, staff experience, etc.).
As data on these aspects accumulate, a single system with broader consensus may one day emerge (as, for example, the Bethesda system did for the purpose of reporting thyroid cytology). In our study, the systems most likely to indicate higher malignancy risks at T5 (vs. those recorded at baseline) were the ACR system and EU-TIRADS (29.4% and 24.8% of the nodules, respectively). Echogenicity is the most important variable for EU-TIRADS risk estimates, whereas marked hypoechogenicity, punctate echogenic foci, and taller-than-wide shape are major determinants in the ACR system (each of which contributes three points to the final score). Our experience suggests that certain ultrasound features—including those considered key findings, such as taller-than-wide shape and hypoechogenicity—are inadequately defined and described in most systems. As a result, estimates display substantial inter- and intra-observer variability. For instance, acquisition of a taller-than-wide shape was the cause of up to 47.4% of the risk-class increases recorded at T5 in our cohort and up to 63.1% of the new indications for FNA. This feature was defined—in strict accordance with the literature (18,26)—as an anteroposterior diameter greater than the transverse one, with no specification of a minimum magnitude for the difference. Substantial variability in repeated ultrasound measurements is well documented, and it diminishes the reliability of reports of small differences in diameters (33). Decreases in echogenicity were also responsible for a number of risk-class increases in our cohort, and evaluations of this parameter are also associated with only fair to moderate interobserver concordance (Cohen kappa values of about 0.40–0.50) (28,34 –36).
Real variations of composition and echogenicity are also possible: in partially cystic nodules, production or reabsorption of colloid fluid, as well as degenerative changes, are possible over time, with variable appearance (37 –39). In our cohort, however, this did not result in new cancer diagnoses.
The appearance of new nodules during the first five years of follow-up was observed in 38% of our patients, although in 9 out of 10 of these cases the new nodules measured 1 cm or less, a figure consistent with other reports (40,41). As previously shown, these new lesions almost never cause significant increases in the volume of the thyroid as a whole (42), and based on our experience in the present study, very few meet the requirements for cytological evaluation (3–7%, depending on the system used).
Our study involved retrospective reevaluation of prospectively collected data on a selected, closely followed population of individuals with presumably benign thyroid nodules, and as such it has several limitations. In addition, the cohort we studied is also relatively small when compared with those used to validate some of the ultrasound classification systems (43 –45). However, those studies were cross-sectional, whereas ours is the first longitudinal assessment of these systems' performance. Our findings thus provide novel insights into the behavior of sonographically estimated risks of malignancy in the context of repeated assessments. They indicate that, regardless of the system used to define it, the ultrasound-based risk class of a presumably benign thyroid nodule shows little change during ongoing surveillance, and changes that warrant FNA are rare indeed.
The appearance of new nodules is a frequent event, but fewer than 5% of these lesions will be classified as high risk, and FNAC cytology will be warranted for only the 3–7%. Current ATA guidelines suggest that thyroid ultrasound be repeated after 24 months (or not at all) for very-low-suspicion nodules measuring more than 1 cm, and no follow-up study at all is indicated for smaller nodules of this type (9,11). Collectively, our findings support the safety of less intensive follow-ups for patients with presumably benign thyroid nodules, suggesting that the intervals indicated above could also be extended.
Footnotes
Acknowledgments
G.G., L.L., R.F., and V.R. contributed to this article as part of their PhD studies in biotechnologies and clinical medicine at the University of Rome, Sapienza. Writing support was provided by Marian Everett Kent, BSN.
Author Disclosure Statement
No competing financial interests exist.
