Abstract
Objective:
The primary purpose of this study was to validate the proposed modified 2009 American Thyroid Association Risk Stratification System (M-2009-RSS) in patients with thyroid cancer and to compare the findings with those of the 2009 ATA Risk of Recurrence (2009 ATA-RR) and the Ongoing Risk of recurrence system. The secondary purpose was to assess which risk stratification system had the best predictive value to foresee the probability of structural incomplete response or the no evidence of disease (NED) status at the end of follow-up.
Subjects and Methods:
This retrospective review included 149 patients with differentiated thyroid cancer who had low and intermediate 2009 ATA-RR and were treated at a single experienced center and followed-up for a median of 6 years (range 3–12 years). Each patient was risk stratified using both the 2009 ATA-RR and the M-2009-RSS. The primary endpoints were 1) the best response to initial therapy defined as either excellent response, biochemical or structural incomplete response, or indeterminate response; 2) clinical status at final follow-up defined as either NED, biochemical incomplete response, structural incomplete response, indeterminate response, or recurrence (biochemical or structural disease identified after a period of NED), and 3) ongoing RR defined as low or high risk according several outcomes after initial treatment.
Results:
Mean age of included patients was 45.3±13 years. Both the ATA 2009-RR and the M-2009-RSS provided clinically meaningful graded estimates with regard to the status of NED at the end of follow-up in low-risk patients (84% for 2009 ATA-RR and 74% for M-2009-RSS) or the likelihood of having persistent structural disease (0% for 2009 ATA RR and 3.6% for the M-2009-RSS). When patients were classified as low risk, the positive predictive value (PPV) and negative predictive value (NPV) to predict structural disease was 0% and 88.7% for the 2009 ATA-RR, 3.6% and 86.5% for the M-2009-RSS, and 1.6% and 68.2% for the ongoing RR (p=0.022 and 0.055 of chi-square test for PPV and NPV, respectively).
Conclusions:
Despite expanding the definition of low risk to include small-volume lymph node metastases, minor extrathyroidal extension, and minimally invasive follicular thyroid cancer, the M-2009-RSS predicts clinical outcomes (structural incomplete response and NED at the end of follow-up) that are very similar to the previously validated 2009 ATA RR classification system.
Introduction
R
After validation of the ATA risk of recurrence classification, new data indicated that patients with low-risk DTC could be a larger group than previously considered (9 –11). Due to the current changes observed in the worldwide classifications of the risk of recurrence in patients with DTC, we aimed to recategorize patients according to these new variables that appeared after the validation of the 2009 ATA RR classification (Haugen et al., 2015 ATA Management Guidelines for Patients with Thyroid Nodules and Differentiated Thyroid Cancer, in review; 12,13). As an example, several studies have shown that the presence of less than five metastatic lymph nodes and/or the presence of micrometastasis (<2 mm) independently of the number of affected lymph nodes and/or minimal extrathyroidal extension (T3) have a probability of recurrence not higher that 5–8% (14 –23). Currently, the 2015 ATA guidelines will probably propose that these group of patients might be considered together as low risk of recurrence (Haugen et al., in review).
Although most of the published studies endorse this new classification (14 –23), the modified 2009 ATA risk stratification system (M-2009 RSS) has not yet been validated. Therefore, the aim of the present study was to describe both early and late clinical outcomes in the same cohort of low and intermediate risk of recurrence DTC patients who were risk stratified according to the 2009 ATA RR classification compared to the modified 2009 ATA risk stratification system from the ATA 2015 guidelines (24). Secondarily, we aimed to evaluate the impact on prediction of the final status after using the ongoing risk of recurrence classification obtained after the initial response to treatment.
Materials and Methods
We retrospectively reviewed our database containing 563 file records of patients with DTC who had been followed from January 2001 to December 2013. To be included in the analysis, patients were required to have undergone total thyroidectomy with or without lymph node dissection and should have received remnant ablation with radioiodine (RAI) after thyroid hormone withdrawal (THW) and being classified as low or intermediate risk of recurrence according to the 2009 ATA RR classification (Table 1) and having a follow-up not less than 3 years after initial treatment (1). Of 288 DTC patients with low and intermediate risk of recurrence evaluated at our center, 139 were excluded because the follow-up was less than 3 years. With these criteria, 149 DTC patients were included in the study.
ETE, extrathyroidal extension; M0, absence of distant metastasis; NO, absence of lymph node metastasis; TB, thyroid bed; Tg, thyroglobulin.
Risk of recurrence classifications
The 2009 ATA RR classification is summarized in Table 1. The M-2009-RSS for low and intermediate risk of recurrence, as described in the ATA draft guidelines (Haugen et al., in review) can be seen in Figure 1. In addition, we classified patients with minor extrathyroidal extension as low risk even though they are considered to be borderline between intermediate and low risk and will probably be classified as intermediate risk in the final version of the M-2009-RSS.

Proposed 2009 American Thyroid Association Modified Risk Stratification System compared to 2009 American Thyroid Association risk of recurrence classification (frequency of structural disease reported in the literature between parentheses). FTC, follicular thyroid cancer; LN, lymph node; PTC, papillary thyroid cancer.
Ablation protocol
Our ablation protocol used fixed RAI activities based on the extent of initial disease. Patients typically received 3.70 GBq (100 mCi) 131I for low risk (2009 ATA RR) disease or 5.55 GBq (150 mCi) for intermediate risk (2009 ATA RR) disease. A low-iodine diet was prescribed from 1 week before RAI administration through 2 days afterwards. THW comprised at least 3 weeks without thyroid hormone, starting from thyroidectomy or THW for the diagnostic studies. RAI was administered following that interval, in all cases with thyrotropin (TSH) levels above 50 mIU/L. A posttherapy whole-body scan (WBS) was performed 5–7 days after therapeutic RAI administration.
Thyroglobulin and thyroglobulin antibody measurement
Samples for thyroglobulin (Tg) and thyroglobulin antibody (TgAb) measurement were taken on the day of ablative RAI administration. Tg and TgAb levels were assessed in one of two reference laboratories from Argentina using one of two commercial immunometric assays; the same laboratory and assay were used throughout a patient's follow-up. Tg assays comprised the Elecsys Tg Electrochemiluminescence Immunoassay (Roche Diagnostics GmbH, Mannheim, Germany), which has a 0.5 μg/L detection limit, or the Immulite 2000 Tg Chemiluminiscence Assay (Siemens Corp., Los Angeles, CA), with a 0.9 μg/L functional sensitivity. TgAb assays comprised the Elecsys Anti-Tg Electrochemiluminescence Immunoassay (RSR Ltd., Pentwyn, Cardiff, UK), or the Immulite 2000 Anti-TG Ab chemiluminescent immunometric assay method (Siemens). For both TgAb assays, values >20 IU/mL were considered to be positive and to render Tg measurements uninterpretable. These patients were excluded from the study.
Clinical management during follow-up
Clinical status in response to initial therapy was assessed using THW-stimulated (n=98) or recombinant human (rh)TSH-stimulated (n=51) Tg testing and neck ultrasonography (US) in all patients and diagnostic WBS in intermediate-risk patients (150 MBq [4 mCi] activity) performed 9–18 (mean 12±3) months after ablation. Neck US using an 11 MHz linear array transducer was performed every 6 months after ablation. Patients with measurable stimulated or unstimulated Tg, suspicious neck US findings, or both during follow-up underwent morphological or functional imaging or both, including computed tomography (CT) (n=19 [12%] or [18F]fluorodeoxyglucose positron emission tomography [FDG-PET] (n=15 [10%]). All US suspicious nodules ≥1 cm in diameter underwent fine-needle aspiration biopsy (FNAB) with measurement of Tg in the wash out of the aspirate.
After ablation, all patients were kept on a suppressed TSH level until January 2008 when all patients started thyroid hormone therapy according to the LATS recommendations for each risk of recurrence group (target TSH: <0.1 mIU/L for intermediate risk; 0.4–1 mIU/L for low risk; and thyroid hormone replacement for very low risk LATS classification) (2).
Clinical outcome definitions
The primary endpoint of the study was the best response to initial therapy (surgery+RAI ablation) assessed at the 12 (±3 months) follow-up visit based on stimulated Tg values, neck US, diagnostic WBS, and risk appropriate additional functional and cross-sectional imaging (5,6,10). Excellent response to therapy was defined as a stimulated Tg <1 μg/L in the absence of TgAb, plus absent or <0.1% thyroid bed uptake on diagnostic WBS (if done), with a normal postoperative neck ultrasound. Patients demonstrating a stimulated Tg value between 1 and 10 μg/L without structural evidence of disease or having nonspecific findings in the ultrasound or persistent measurable TgAb were classified as having an indeterminate response. Those patients who showed a stimulated Tg level >10 μg/L or detectable Tg levels under thyroid hormone therapy without any findings in US were classified as having biochemical incomplete response. Patients with structural evidence of disease (with or without abnormal Tg values) were classified as having structural persistent disease.
The second endpoint of the study was clinical status at time of final follow-up (5,10). Patients were classified as having no evidence of disease (NED) if at the time of final follow-up the suppressed Tg was<1 μg/L, Tg antibodies were negative, neck US was free of suspicious signs, and there were no pathological findings on any other imaging studies performed for clinically indicated reasons (WBS, radiography, CT, FDG-PET, or any other modality) or in any biopsy specimen. Patients with persistent disease at the time of final follow-up were classified as either indeterminate or biochemical or structural incomplete response using the same definitions used in the evaluation of response to initial therapy. Patients who had structural or biochemical evidence of disease identified following a period of NED were classified as having recurrent disease. Disease sites were classified as local (thyroid bed), lymph node metastasis confirmed by FNAB with positive cytology, and/or distant metastasis confirmed by biopsy and/or imaging.
Ongoing risk of recurrence classification
After defining the initial response to treatment, we reclassified these patients according to the ongoing RR classification. We created the following criteria to define this variable. Low risk of recurrence for the ongoing risk were those patients with 1) initial excellent response; 2) those with indeterminate response without any suspicious US finding, or stable detectable stimulated Tg levels between 1 and 10 μg/L, or stable or decreasing TgAb, and 3) those with stable or decreasing stimulated Tg levels >10 μg/L or stable or decreasing Tg levels <5 μg/L under thyroid hormone therapy (biochemical incomplete response). High risk of recurrence for the ongoing RR was considered when 1) patients had a structural persistence as initial response to treatment, 2) indeterminate response with US suspicious findings (suspicious lymph nodes <1 cm in the larger diameter, which were not evaluated with FNAB), 3) biochemical incomplete response with detectable Tg levels >5 μg/L under thyroid hormone therapy or increasing Tg levels during follow-up (whether on thyroid hormone therapy or after TSH stimulation), and biochemical incomplete response due to increasing titers of TgAb antibodies during follow-up (Table 2).
TgAb, thyroglobulin antibody; US, ultrasonography.
Statistical analysis
Epidemiological data are presented as the mean±SEM, with median and range when appropriate. To evaluate significant differences in data frequency, we analyzed two-way contingency tables by the Fisher exact test or 2×3 contingency tables by the chi-squared test.
The agreement between different risk stratification systems was calculated using Cohen's k coefficient; a value of 1 implies perfect agreement and a value <1 implies less than perfect agreement. It was evaluated using the Landis and Koch semiquantitative scale (poor agreement ≤0.20, fair agreement 0.21–0.40, moderate agreement 0.41–0.60, good agreement 0.60–0.80, and very good agreement 0.80–1.0) (24).
Diagnostic accuracy was calculated according to Galen and was based on true-positive (TP), true-negative (TN), false-positive (FP), and false-negative (FN) results. The positive predictive value (PPV) was TP/(TP+FP) and the negative predictive value (NPV) was TN/(FN+TN); the 95% confidence interval [95% CI] of all estimates was also evaluated (25,26).
Statistical analysis was performed using the SPSS statistical software (SPSS version 10.0, SPSS Inc., Chicago, IL). We considered p<0.05 to be statistically significant for all analyses.
Results
Each of the 149 low- and intermediate-risk patients had total thyroidectomy in a specialized center with subsequent RAI remnant ablation following traditional THW. The median follow-up in the whole cohort was 6 years (range 3–12 years; mean 7.7±3.5 years). As can be seen in Table 3, the majority of patients had classic papillary thyroid cancer (PTC) (70%), 86% were female, and 72% were AJCC stage I (27).
AJCC, American Joint Committee on Cancer; NED, no evidence of disease; RA, remnant ablation; SEM, standard error mean.
As expected, while only 46% of patients were classified as low risk in the 2009 ATA-RR system, 75% were classified as low risk by the M-2009-RSS. Conversely, 54% were classified as 2009 ATA-RR intermediate risk, while 25% were classified as intermediate risk by the M-2009-RSS (see Table 3).
For the entire cohort, the best response to initial therapy was excellent in 47%, structural persistent disease in 12% (lymph nodes in all 18 patients), biochemical incomplete response in 15%, and indeterminate response in 26%. At the time of final follow-up, 66% were classified as NED, 4% as having structural persistent disease, 9% as having biochemical persistent disease, and 19% as having an indeterminate response (Table 4). On the comparison between the initial responses to treatment for the respective low-risk categories of 2009 ATA-RR and M-2009-RSS, no statistically significant differences were observed (excellent response 67% vs. 55%, respectively, p=0.18; and structural incomplete response 2.9% vs. 9.8%, respectively, p=0.14). Also, when the respective intermediate-risk categories from 2009 ATA-RR and M-2009-RSS classifications were compared, again no statistically significant differences were observed (excellent response 30% vs. 22%, p=0.47; and structural incomplete response 20% vs. 19%, p=0.91) (Table 5).
2009-ATA RR, 2009 ATA Risk of Recurrence; M-ATA-2009 RSS, Modified 2009 ATA Risk stratification system.
Low-risk 2009-ATA RR versus low-risk M-2009-ATA RSS excellent response (p=0.18) and structural incomplete response (p=0.14). Intermediate RR ATA 2009 versus intermediate M-2009-ATA RSS excellent response (p=0.47) and structural incomplete response (p=0.91).
Agreement among classifications
When the ongoing RR was assessed, we classified 22/149 (14.8%) patients as high ongoing risk of recurrence and 127/149 (85.2%) patients as low ongoing risk of recurrence.
Using the Cohen's k coefficient, the agreement between 2009 ATA-RR and M-2009-RSS was classified as moderate, but fair agreement was found between 2009 ATA-RR or the M-2009-RSS and the ongoing RR stratification (Table 6).
RR, risk of recurrence.
Cohen's coefficient (k) and confidence intervals (CIs) are presented.
Comparison among risk stratification systems (2009 ATA-RR, M-2009-RSS, and ongoing RR) for different levels in predicting final outcome
To assess which risk stratification system had the better predictive value in any of the low or intermediate risk of recurrence, the 2009 ATA-RR, the M-2009-RSS, and the ongoing RR systems were correlated with the final outcome (NED status) (Table 7).
As expected, 84% of the low-risk patients in the 2009 ATA-RR system, 74% in the low-risk M-2009-RSS, and 74% in the low-risk ongoing RR stratification were with NED at final follow-up, without any statistical significant differences (p=0.23 for chi square test). The frequency of patients with NED showed significant differences in the ongoing high-risk category with respect to 2009 ATA-RR and M-2009-RSS intermediate-risk category (50%, 40%, and 18%, for 2009 ATA-RR, M-2009-RSS and ongoing RR stratification, respectively), p=0.03 for chi-square test.
On the other hand, the rate of patients with structural incomplete response in the low-risk group was similar among the three classifications (p=0.22 for chi-square test), but with borderline significance for the intermediate risk group for the 2009 ATA-RR and M-2009-RSS with respect to the high-risk ongoing RR category: 11%, 13%, and 32%, for 2009 ATA-RR, M-2009-RSS, and ongoing RR stratification, respectively (p=0.06 for chi-square test).
PPV, NPV, and accuracy of the three risk stratification systems for different levels in predicting structural persistent disease as final outcome
We evaluated the ability of different risk stratification systems to predict structural persistent disease as final outcome by determining PPV, NPV, and accuracy (Table 8).
There is uncertainty in only one direction; this range is the 97.5% CI.
NPV, negative predictive value; PPV, positive predictive value.
When patients were classified as low risk by the 2009 ATA-RR, by the M-2009-RSS, and by the ongoing RR, the PPV to predict structural persistent disease was 0%, 3.6%, and 1.6%, respectively (p=0.22 for chi-square test) and the NPV was 88.7%, 86.5%, and 68.2%, respectively (p=0.06 for chi-square test). Therefore, the 2009 ATA-RR accuracy was significantly higher than M-2009-RSS and ongoing RR accuracies: 48% [39%–56%], 24% [17%–32%], and 11% [7%–18%], respectively [95% CI] (Table 8).
When patients were classified as intermediate RR by the 2009 ATA-RR and by the M-2009-RSS and high risk by the ongoing RR, the PPV to predict structural persistent disease was 11%, 13%, and 32% (p=0.06 for chi-square test) and the NPV were 100%, 96%, and 98%, respectively (p=0.22 for chi-square test). The ongoing RR accuracy was significantly higher than the 2009 ATA-RR and the M-2009-RSS accuracies: 88% [82%–93%], 76% [68%–82%], and 52% [44%–61%], respectively [95% CI] (Table 8).
Discussion
By using the 2009 ATA RR, the M-2009-RSS, and the ongoing risk of recurrence prognostic systems to risk stratify the same cohort of 149 low- and intermediate-risk DTC patients treated with total thyroidectomy and RAI ablation at a single thyroid cancer specialty center, we have confirmed again the utility of the ATA 2009 system (5,7,11) and for the first time, demonstrated the clinical utility of the M-2009-RSS. The 2009 ATA RR system has already been validated in cohorts of DTC patients in Argentina (4), Brazil (11), Italy (7), and New York (5) confirming its clinical applicability across a wide spectrum of patients and health care systems.
Our data demonstrate that both the 2009 ATA RR and the M-2009-RSS effectively risk stratify patients with regard to a broad spectrum of clinical outcomes, even though the low-risk category in the M-2009-RSS was expanded to include tumors beyond intrathyroidal PTCs (see Fig. 1). As would be expected, the precise estimates for remission, persistent disease, and recurrence might vary between risk categories based on the specific criteria used to define low risk and intermediate risk. Therefore, it is important to clearly specify what is meant when describing a patient as either intermediate risk or low risk. For clear communication between clinicians and in research reports, it is important to be very specific with regard to what risk is being referred to and what classification system (specific definitions) are being used.
With regard to predicting the NED status at the end of follow-up, both the 2009 ATA RR and the M-2009-RSS demonstrated almost similar results (84% vs. 74%, respectively). As can be seen in Table 6, the 2009 ATA RR system classified a larger number of patients as intermediate risk (n=80) than the M-2009-RSS (n=37). This is primarily related to the difference in classification of N1 patients put all together as intermediate risk in the 2009 ATA RR system. It appears that with inclusion of some of the N1 patients in the low-risk category, the use of the M-2009-RSS resulted in barely lower rates of excellent responses (74%). In this investigation, we confirm the data from a recent review that have demonstrated that the risk of structural disease recurrence can vary from 4% in patients with fewer than five metastatic lymph nodes, to 5% if all involved lymph nodes are <0.2 cm (9). Given the 0% risk of structural recurrence in the ATA 2009 low-risk category and the 3.6% risk of structural recurrence in the M-2009-RSS low-risk category, we confirm these previous findings.
The ongoing risk of recurrence, proposed initially by Tuttle et al. (5) and validated in 588 DTC patients stratified according to the response to therapy after 2 years of follow-up, was later confirmed by Castagna et al. (7) who showed that the reclassification of DTC patients on the basis of the results observed after initial therapy (total thyroidectomy and RAI ablation), particularly in the intermediate/high-risk patients, was an effective way of classifying patients. This ongoing risk of recurrence was designated as delayed risk of recurrence by these authors, and they concluded that this dynamic staging allowed establishing a better plan to adapt the subsequent follow-up. For example, an excellent response to treatment allows excluding a significant number of patients from unnecessary intensive work-up (5,7). Perhaps the lower accuracy percentages observed in our study might be related to the absence of inclusion of a high risk RR for the 2009 ATA RR.
There are several other limitations in our study; for instance, each of the individual groups has a low number of patients when divided by risk of recurrence, which might impact the statistical results. As such, the results of this study remain to be validated by larger multicenter studies that can pool data in order to minimize confounders. On the other hand, every patient in our cohort, even the ones classified as low risk, were treated with surgery and RAI, a common practice in the past in most countries of Latin America (2). Currently, this practice of care may not apply to many institutions where low-risk patients do not receive RAI. Therefore, additional validation studies are required in patients treated without RAI ablation.
In conclusion, both the 2009 ATA RR and the M-2009-RSS appear to similarly risk stratify patients with regard to the main clinical outcomes (structural incomplete response and NED) even though the M-2009-RSS expanded the definition of low risk to include patients beyond those with intrathyroidal classical papillary thyroid cancer. Furthermore, ongoing RR is a better predictor of the structural incomplete response at final follow-up than static risk predictions established at the time of diagnosis.
Footnotes
Author Disclosure Statement
Fabián Pitoia and R. Michael Tuttle have been consultants for Genzyme-Sanofi. The other authors declare no competing financial interests.
