Abstract
Background:
Thyroid cancer is unique for having age as a staging variable. Recently, the commonly used age cut-point of 45 years has been questioned.
Objective:
This study assessed alternate staging systems on the outcome of overall survival, and compared these with current National Thyroid Cancer Treatment Cooperative Study (NTCTCS) staging systems for papillary and follicular thyroid cancer.
Methods:
A total of 4721 patients with differentiated thyroid cancer were assessed. Five potential alternate staging systems were generated at age cut-points in five-year increments from 35 to 70 years, and tested for model discrimination (Harrell's C-statistic) and calibration (R 2). The best five models for papillary and follicular cancer were further tested with bootstrap resampling and significance testing for discrimination.
Results:
The best five alternate papillary cancer systems had age cut-points of 45–50 years, with the highest scoring model using 50 years. No significant difference in C-statistic was found between the best alternate and current NTCTCS systems (p = 0.200). The best five alternate follicular cancer systems had age cut-points of 50–55 years, with the highest scoring model using 50 years. All five best alternate staging systems performed better compared with the current system (p = 0.003–0.035). There was no significant difference in discrimination between the best alternate system (cut-point age 50 years) and the best system of cut-point age 45 years (p = 0.197).
Conclusions:
No alternate papillary cancer systems assessed were significantly better than the current system. New alternate staging systems for follicular cancer appear to be better than the current NTCTCS system, although they require external validation.
Introduction
P
Thyroid cancer is unique among malignancies for including age as a staging variable. Early prognostic studies recognized the association of younger age with excellent survival, with some studies suggesting a worse prognosis in patients older than 40 years of age (1 –4), or older than 50 years of age (5). In 1983, the second edition of the American Joint Committee on Cancer Manual for Staging Cancer first settled on a dichotomous cut-point of 45 years (6). This found support in a landmark study by Tubiana et al. (7), and the authors' group, the National Thyroid Cancer Treatment Cooperative Study (NTCTCS), adopted the cut-point of age 45 years for their unique staging system prior to registry inception in 1986 (Table 1) (8).
The final stage is the highest of the individual primary tumor size, primary tumor description, and metastases scores. Age is in years.
Several recent studies have suggested that this dogma of an altered prognosis after the age of 45 years should be reevaluated. In a previous analysis, the authors' group noted that most deaths in thyroid cancer patients appear to occur in those diagnosed after the age of 55 years (9). Other studies have variously suggested under-staging of certain patients younger than 45 years of age (10), or that mortality does not rise before an age at diagnosis of 50 years (11) or 55 years (12). In analysis of data from EUROCARE, the relative risk of death increased in the group of women older than 55 years of age (13).
This controversy is a good opportunity to reevaluate the NTCTCS staging system. In addition, if a different age cut-point could be documented in a large registry such as the NTCTCS, then the information might be useful to the American Joint Committee on Cancer (AJCC)/Union for International Cancer Control (UICC) groups working on the eighth editions of their staging systems.
Methods
Registry protocol and data collection
The data collection and analytical methods of the NTCTCS have been described elsewhere (8,9,14 –20). Briefly, 11 North American centers contributed patient data, with registration beginning in January 1987 (this data analysis captures patients registered through to 2011). New patients were registered within three months of their initial surgery. Institutional Review Boards (IRB) of contributing centers approved the study, and ongoing oversight of the project occurs through the University of Texas MD Anderson Cancer Center IRB, where the central database is currently maintained.
Management of patients was nonrandomized and was solely at the discretion of their treating physicians on the basis of perceived best practice and clinical need at that period of time at their institution, independent of registry participation. Pre-specified baseline demographic, clinical, histologic, and radiologic data were entered into a PC-based clinical data management system locally (Medlog v2000-2, Incline Village, NV) and transmitted to the central registry database. Clinical status, investigations, and treatments were updated on a yearly basis. Where possible, the causes of death were reviewed and mortality data confirmed through the Social Security Death Index for U.S. patients and the Ontario Registrar General for Canadian patients.
Principles of restaging
The general principles of the NTCTCS staging system reevaluation were: • Any alteration to the staging system must have been data-driven and internally valid. Any new staging systems devised were to be tested against the current staging system, specifically using statistical methods allowing for significance testing. • Multiple age cut-points in addition to 45 years were assessed. • Overall survival was the outcome of interest for primary analysis, to be consistent with the AJCC/UICC systems. • Any new staging system devised should be easily transferable to Tumor Node Metastasis (TNM) format, for ease of external validation and potential adaption to other data sets. • Papillary and follicular cancer were assessed separately, consistent with the current system. • Current NTCTCS staging variables were assessed. The current NTCTCS staging system includes all anatomic/pathologic variables collected by the registry (e.g., there is no distinction between central and lateral cervical lymph node status in the data entry; thus any future NTCTCS staging system can only specify general cervical lymph node status, and maximal tumor diameter is recorded as <1 cm, 1–4 cm, or >4 cm). • No other nonanatomic variables, such as sex, were considered (as per current AJCC/UICC practice), with the exception of an exploratory analysis incorporating the tall-cell variant of papillary thyroid cancer.
Using these principles, alternate potential staging systems were created for testing against the current system.
Specific approaches to creating alternate staging systems
First the data set was separated into patients with papillary thyroid cancer and those with follicular thyroid cancer. Cut-points were then created for age from 35 to 70 years in five-year increments. Using Kaplan–Meier estimators, univariate overall 5- and 10-year survivals were generated in papillary cancer patients for each potential prognostic variable currently in the NCTCTS, at all the age cut-points defined above. Independent significance of each prognostic variable was also assessed at all the age cut-points through Cox models. Based on these assessments, five potential alternate staging systems were created at each age cut-point (rather than one per cut-point, because the estimates may be subject to interaction, confounding, and colinearity). The rules for this were: (a) for each potential alternate staging system, there must be four stages (I–IV); and (b) plausibility of stage progression was ensured (e.g., while microscopic extraglandular extension and macroscopic extraglandular extension could have the same stage designation or macroscopic extension could be staged higher, microscopic extraglandular extension could not be staged above macroscopic).
Method of testing the potential alternate staging systems
To test the potential alternate staging systems for papillary cancer, Kaplan–Meier plots were generated and Cox proportional hazards models were performed on each potential alternate staging system, with the outcome of overall survival. From the potential alternate systems across all age cut-points, the best five were chosen for detailed analysis based on model discrimination to the overall survival data, specifically through the highest Harrell's C-statistic values (analogous to area under the curve in receiver operator characteristic curves) (21). For these highest-scoring staging systems and the current NTCTCS staging system, confidence intervals for the Harrell's C-statistic were generated using bootstrap resampling (22) with 1000 replicates, and hypothesis testing was performed comparing the alternate systems to the current NTCTCS staging system to identify whether any differences in the Harrell's C-statistic were statistically significant. Model calibration was assessed by calculating R 2 = 1 – exp([(L0 – Lp) × 2/n]) (23), where L0 is the log partial likelihood of the null model (no covariates), Lp is the log partial likelihood of the fitted model, and n is the number of subjects. The same assessment was performed for follicular cancer.
Results
There were 4721 patients with differentiated thyroid cancer and complete staging information registered through to 2011, representing a total of 31,356 patient-years of follow-up. Of these, 4159 had a diagnosis of papillary thyroid cancer and 562 had follicular thyroid cancer (including Hürthle cell cancers and poorly differentiated cancers without areas of papillary architecture). Table 2 summarizes demographic and clinical tumor characteristics of included patients. There were 277 deaths in the papillary thyroid cancer patients, and 104 deaths in patients with follicular cancer.
Papillary cancer includes 3205 conventional papillary cancers, 710 mixed follicular–papillary histology, 108 occult sclerosing, 132 tall-cell cancers, and 4 columnar cell cancers. Follicular cancer includes 320 well-differentiated follicular cell cancers, 192 Hürthle cell cancers, and 50 poorly differentiated cancers.
Generation and testing of potential alternate staging systems
The analyses of prognostic variable associations with survival (univariate and multivariate) at each age cut-point are detailed in the accompanying Supplementary Appendices (Supplementary Appendix S1 for papillary cancer and Supplementary Appendix S2 for follicular cancer; Supplementary Data are available online at
Papillary cancer comparison of potential alternate systems against the current NTCTCS system
The Harrell's C-statistic and bootstrapped confidence intervals, and R 2 values for the current NTCTCS staging system and the best five alternate papillary thyroid cancer staging systems are detailed in Table 3. Table 3 also gives the p-values for the differences in C-statistic between the current NTCTCS system and each alternate system.
Model details are provided in Supplementary Appendices S1 and S2.
The best alternate staging system was model B, incorporating a cut-point of 50 years (Table 4). The best alternate staging system downstaged a substantial proportion of patients (2803 stage I patients, 866 stage II, 286 stage III, and 126 stage IV compared with 1911 stage I, 1219 stage II, 806 stage III, and 145 stage IV patients under the current staging system). Although the Harrell's C-statistic was numerically higher than the current NTCTCS system and visually the stages are better separated on Kaplan–Meier plots (Fig. 1A current system; Fig. 1B best alternate system), the difference was not statistically significant. The best alternate systems with cut-points age 50 and 45 years were also not significantly different for Harrell's C-statistic (p = 0.320).

Kaplan–Meier plots of overall survival for (
The final stage is the highest of the individual primary tumor size, primary tumor description, and metastases scores. Age is in years.
Follicular cancer comparison of potential alternate systems against the current NTCTCS system
The Harrell's C-statistic and bootstrapped confidence intervals, and R 2 values for the current NTCTCS staging system and the best five alternate follicular thyroid cancer staging systems are detailed in Table 3. Table 3 also gives the p-values for the differences in C-statistic between the current NTCTCS system and each alternate system.
All five best alternate follicular thyroid cancer staging systems had statistically significantly better discrimination than the current NTCTCS system. The numerically best alternate staging system had a cut-point of 50 years (model E; Table 5). The best alternate staging system again downstaged a substantial proportion of patients (233 stage I patients, 124 stage II, 129 stage III, and 70 stage IV compared with 144 stage I, 67 stage II, 270 stage III, and 75 stage IV patients under the current staging system). The Kaplan–Meier plots for the current follicular thyroid cancer and best alternate staging systems are shown in Figure 2A and B, respectively. When testing the best alternate system (cut-point age 50 years) with the best system using a cut-point of age 45 years (an identical staging system except the age cut-point), there was no significant difference (difference in Harrell's C-statistic = 0.006; p = 0.197). Testing the best alternate system (cut-point age 50 years) with the current papillary thyroid cancer NTCTCS system in follicular thyroid cancer patients (to assess the need for separate ongoing staging systems) suggested that the best alternate system (cut-point age 50 years) was significantly better (difference in Harrell's C-statistic = 0.02; p = 0.048).

Kaplan–Meier plots of overall survival for (
The final stage is the highest of the individual primary tumor size, primary tumor description, and metastases scores. Age is in years.
Exploratory analysis—tall-cell variant of papillary cancer
As an exploratory analysis, the inclusion of the tall-cell variant of papillary thyroid cancer in potential staging systems was also examined (analogous to the way in which the staging system for follicular thyroid cancer includes the poorly differentiated subtype). This analysis was performed after finding that patients recorded as having tall-cell variant had an independently worse prognosis, particularly when diagnosed after 50 years (see Supplementary Appendix S3). The best amended staging system with tall-cell variant incorporated is similar to the staging system shown in Table 4, but tall-cell variant is added under primary tumor description with patients <50 years being stage I and patients ≥50 years being accorded stage III disease. The best alternate staging system in this case was model B, using the cut-point of age 50 years. The associated Kaplan–Meier plot is shown in Figure 1C. As can be seen from this figure, more patients are classified as stage III compared with the best alternate staging system not using tall-cell variant (Fig. 1B). This approach resulted in a marginally improved Harrell's C-statistic over the best alternate papillary thyroid cancer staging systems not incorporating the tall-cell variant (see Table 3), although it was not significantly better than the current NTCTCS papillary cancer staging system.
Discussion
Potential revision of staging system
Papillary thyroid cancer
For papillary thyroid cancer, alternate staging systems were constructed that predicted overall survival well on the NTCTCS data set. The best alternate staging systems had marginally higher numerical scores than the current staging system for discrimination and calibration, although none appeared significantly better. Given the lack of a statistically significant difference at this time, there is currently no plan to alter the NTCTCS papillary thyroid cancer staging system that was first devised in 1986, thereby allowing ease of backward comparability cross-analyses and publications.
Further use of subtypes of papillary cancer was not pursued in the main analysis because of potential lack of standardization of the histologic classification of papillary thyroid cancer subtypes across centers and eras, and the inability to review retrospectively pathology centrally across the NTCTCS cohort. This uncertainty about classification of the tall-cell variant is common to many analyses that are nevertheless reported and show a worse prognosis associated with tall-cell variant. An example is an analysis where use of the Surveillance, Epidemiology, and End Results (SEER) database precludes any central review of pathology (24). This decision not to incorporate tall-cell variant does not greatly alter the summary measures of model performance in the present results because subtypes such as the tall-cell variant are rare. In addition, adding a second nonanatomical factor in addition to age could further complicate stage group definitions. However, visually, the Kaplan–Meier plots show clearer separation of stages II and III (Fig. 1C), and this use of tall-cell variant could potentially improve prognostication in larger databases. Moreover, for an individual patient, including important histologic subtypes may improve prediction of a patient's prognosis, and the merits of this approach should be examined in future analyses of stage and proposals for thyroid cancer–specific prognostic factor tools.
There are good reasons for taking a conservative approach to revaluating cancer staging systems, and only adopting a new alternate system based on rigorous evidence. In the current case, while the new potential alternate papillary thyroid cancer staging systems scored numerically higher on the Harrell's C-statistic and R 2, none was demonstrably better than the old system, and there appear to be minimal differences in overall model fit for staging systems using either a 45 or 50 years cut-point for papillary thyroid cancer. In addition, the current NTCTCS and best alternatives had excellent discrimination, as evidenced by their Harrell's C-statistic being >0.80 (25). Retaining the current well-performing staging system, especially when faced with a lack of statistically superior alternative, allows continuity of analysis with previous studies, and easier benchmarking and comparison.
Recent studies have questioned the legitimacy of the age 45 years cut-point for thyroid cancer staging systems (11,12). The best alternate papillary cancer staging systems had dichotomous age cut-points of either 45 or 50 years, with very little objective difference in model performance. This indicates that current thyroid cancer staging systems likely have their age cut-point close to optimized. It is not surprising that the overall model fit is similar at these age cut-points because the cut-points are close together and mortality between ages 45 and 50 years is still low. Thus, most patients would have been staged the same under both systems. In the absence of a definitively better overall new staging system, there might still be an argument for altering future AJCC/IUCC staging systems. The present data show that a staging system using older cut-points of 50 or even 55 years can also generate excellent discrimination. Under current staging systems, patients aged just above 45 years may be “up-staged” because of tumor features, when in fact their mortality risk remains low. These patients therefore face potential harms from being “up/over-staged” if clinicians use the staging results to prescribe more aggressive adjuvant therapy (e.g., more aggressive primary surgery or radioiodine ablation therapy decisions based on cancer stage). A change in staging system may therefore be defensible if new staging systems using different age cut-points are at least as good as old systems, yet expose fewer patients to unnecessary treatment. The authors are not aware of any decision analysis studies that have investigated this important question.
Follicular thyroid cancer
In contrast to the papillary thyroid cancer analysis, the new alternate follicular cancer staging systems performed significantly better than the current NTCTCS system. Prior to adopting a revised staging system for follicular thyroid cancer (e.g., model E, age 50; Table 5), the authors hope to perform external validation to confirm these results in an independent data set. Practically speaking, overall model characteristics were similar for cut-points between 45 and 55 years. As discussed above, considerations other than overall model fit may be appropriate in deciding age cut-points for future thyroid cancer staging systems. The current analysis also shows that the best alternate follicular cancer system (model E, age 50) had significantly higher discrimination than the current papillary thyroid cancer NTCTCS system when applied to follicular thyroid cancer patients. Separate staging systems for patients with papillary and follicular thyroid cancers will therefore apply.
Strengths and weaknesses
The robust and data-driven assessment of the staging system strengthened the analysis. A quantitative assessment of the models (Harrell's C-statistic) was also used, which enabled significance testing between the staging systems prognostic discrimination. The NTCTCS data set is large and mature, providing opportunity to assess staging systems in a cohort of more than 31,000 patient-years.
Quantitative comparison of survival models is a young and evolving field. Future statistical progress may yield new and improved methods in comparing discrimination and calibration for staging systems. At present, the authors are not aware of any available methods to compare significance of differences in model calibration generally, and few methods are available to test calibration in prognostic models with only four risk levels, as in the present case. The available calibration method, R 2, has drawbacks, including lack of comparability of values between different study populations because of differential censoring (23). As an illustration, the R 2 values that were calculated were much higher in the follicular cancer models because mortality was substantially higher in the follicular cancer cohort, leading to less-frequent censoring of patients.
The decision was made to reanalyze the staging system using a traditional TNM-like system with a dichotomous age cut-point, as opposed to a prognostic scoring system, such as the Metastasis, Age, Completeness of resection, Invasiveness, and Size (MACIS) score (26), or a more complicated TNM-like system with multiple age cut-points. The categorical data format of the registry does not easily lend itself to production of a continuous scoring system such as the MACIS score. It was also reasoned that reassessment of age cut-points in TNM-like differentiated thyroid cancer staging could inform future AJCC/UICC systems. While it is unlikely that sharp demarcation in thyroid cancer biology and thus prognosis exists by age, a dichotomous age cut-point for TNM-like staging systems is favored over a more complicated system with multiple age cut-points for simplicity and usability. However, it is possible that staging systems using multiple cut-points could better predict individual patient prognoses; this was not assessed this in the present analysis.
Staging thyroid cancer at disease onset also fails to take into account the initial response to treatment, which has been shown to be important in determining risk of persistent disease and future recurrence (27). However, all anatomic staging is performed at diagnosis, and this characteristic of staging systems is unlikely to change in the near future. Furthermore, anatomic staging and response to treatment criteria offer complimentary disease information, and both approaches should continue to be assessed and optimized.
The main analysis was limited to variables currently in the NTCTCS staging systems. However, there may be subtypes of differentiated thyroid cancer associated with poor prognosis, such as the tall-cell variant of papillary thyroid cancer (24,28,29). As mentioned, examination of subtypes of papillary cancer beyond a preliminary analysis was not pursued because of potential lack of standardization of the histologic classification, inability to submit pathology to central review, the rarity of tall-cell variant, and the anticipated lack of support for adding a second nonanatomical factor in addition to age. A similar argument could be made for not adding sex to staging systems, despite some evidence this is a relevant variable (13). However, for an individual patient, including important histologic subtypes, and considering sex, may assist clinicians to predict a patient's prognosis, and the merits of this approach should be examined in future analyses of stage and proposals for thyroid cancer specific prognostic factor tools such as MACIS.
This NTCTCS analysis incorporates patients cared for from 1987 through to 2011 in 11 institutions. There have been significant changes in the presentation and management of thyroid cancer over this period (30,31), which may mean that patients diagnosed today could have a different disease trajectory than historic patients captured by the registry. In addition, developing treatments such as kinase inhibitors used to delay disease progression (32,33) or for re-differentiation therapy (34) may substantially alter prognosis for high-risk patients in the future. The large number of centers contributing to this registry is both a potential weakness and strength. While this would inevitably lead to variations in patient assessment, treatment, and follow-up for the included patients, it is also possible that any resultant variation better reflects outcomes in the wider community, as opposed to a single-institution registry with uniform but potentially nongeneralizable follow-up protocols.
Future analyses and further review of NTCTCS staging system
External validation of these results is encouraged. If other sizable data sets find that alternate age 50 (or 55) years cut-points for papillary and follicular cancer staging systems also perform numerically or significantly better than the current NTCTCS systems, strong consideration will be given to adopting the new systems. Such validation may also be considered by the AJCC/UICC in developing the eighth editions of their staging systems.
It is planned to perform analyses using the new alternate staging systems in relation to prognosis after treatment in order to determine whether the new alternate staging systems better separate prognosis by the type of surgery, presence or absence of radioiodine ablation, or degree of thyrotropin suppression. If this were the case, it would add further value to the new staging systems.
Footnotes
Acknowledgments
The NTCTCS has been supported in part by research grants from Genzyme, a Sanofi Company, and Pfizer, and by the University of Texas M.D. Anderson Cancer Center Support Grant (NCI Grant P30 CA016672). The Cancer Council Queensland (PhD Scholarship) and the National Health and Medical Research Council of Australia (APP1092153) supported D.M.
We, as principal investigators at each NTCTCS institution, thank the physicians and staff members who participated in the management and follow-up of these patients. We acknowledge the substantial contributions of the institutional research staff members who collected and submitted the data. We also appreciate the considerable assistance provided by Jeffrey Cui for the management of NTCTCS's databases. Finally, we acknowledge the efforts of the numerous physicians and scientists whose contributions were critical to either the creation or the maintenance of the registry effort for many years.
Author Disclosure Statement
D.M., J.J., J.B., D.C., P.L., M.S., M.X., H.M., D.L., and H.F. have nothing to declare. K.A. has received research grant support from Genzyme, a Sanofi Company. B.H. has received research funding from Veracyte and Genzyme, a Sanofi Company, and a one-time honorarium. J.M. is an employee of Genzyme, a Sanofi Company, and a shareholder in Sanofi. D.R. has received payments for consulting or honoraria from Genzyme, a Sanofi Company, Novo Nordisk, Bayer/Onyx, and Eisai. D.S. has received payments for consulting or honoraria from Genzyme, a Sanofi Company, Novo Nordisk, Bayer/Onyx, and Eisai. SS has research support from Genzyme, a Sanofi Company, and Pfizer, consulting relationships with Bayer, Eli Lilly, Eisai, Exelixis, NovoNordisk, and Veracyte, and has received honoraria from Genzyme and Onyx.
