Abstract
Background:
The approach for surgical treatment of patients with low-/intermediate-risk T1T2N0/Nx well-differentiated thyroid cancer (WDTC)—total thyroidectomy (TT) versus thyroid lobectomy (TL)—remains a controversial topic. Conducting a randomized controlled trial (RCT) would be the gold standard to address this issue. However, this is challenging due to excellent survival outcomes, and therefore, high number of patients and long-term follow-up would be required. As an alternative to RCT, we have used propensity score (PS) matching to determine if T1T2N0/Nx patients selected to have TL had equivalent outcomes to a similar group treated with TT.
Methods:
After institutional review board approval, a database of 6259 patients with WDTC treated with primary surgery at our institution between 1985 and 2016 was analyzed to identify patients with T1T2N0/Nx cancers. Of 3756 patients identified, 943 were managed by TL and 2813 by TT. To control for possible confounders and reduce potential bias, we selected age, sex, histology, 131I therapy, American Thyroid Association risk, and American Joint Committee Cancer stage as our PS matching criteria. Subsequently, 918 TL patients were successfully matched with 918 TT patients. The Pearson χ2 test or Fisher's exact test was used to compare categorical covariates, and Student's t-test was used for comparison of continuous variables between the two groups. Disease-specific survival (DSS), overall survival (OS), and recurrence-free survival (RFS) were calculated using the Kaplan–Meier method and compared using the log-rank test.
Results:
After PS matching, there were no significant differences between TL and TT patients for OS (10-year OS: 92.2% vs. 91.3%, p = 0.9668), DSS (10-year DSS: 100% vs. 99.1%, p = 0.1967), or RFS (10-year RFS: 99.5% vs. 98.3%, p = 0.079).
Conclusions:
For low-/intermediate-risk patients with intrathyroidal thyroid cancer <4 cm, patients selected for TL have similar survival outcomes to a comparable group treated by TT.
Introduction
In 2021
Several other studies have been published, some of which support the conclusion from the Bilimoria study, but other studies contest this showing TL has equivalent outcomes (4 –10). There is therefore still no global consensus regarding the guidelines about TL versus TT; decisions regarding extent of thyroidectomy vary according to physician and patient preferences (11). The gold standard methodology to address this question is to perform a randomized controlled trial (RCT). Patients with low-/intermediate-risk T1T2N0/Nx well-differentiated thyroid cancer (WDTC) have excellent outcomes with long survival times.
To determine if the extent of thyroidectomy has any impact on survival, one would have to randomize several thousand patients and follow these patients for several years. Such a study would be prohibitive due to excessive costs and time. However, statistical methods such as propensity score (PS)-based analyses can be used to reduce the bias created by group imbalance in the absence of a randomization of patients to treatment allocation, one of the main constraints to the validity of observational studies.
The PS is the probability of receiving a specific treatment conditional on a set of observed baseline variables. As originally described by Rosenbaum and Rubin, the PS is a balancing score that will yield a close distribution of baseline characteristics between patients from different treatment groups given certain assumptions hold (12). This is a major advantage, as comparing oncological outcomes from nonequivalent groups is a frequent challenge in retrospective studies.
The goal of our study was therefore to use PS matching to evaluate whether T1T2N0/Nx patients selected to have TL had equivalent outcomes to a similar group treated with TT.
Materials and Methods
Study cohort
Institutional review board approval was obtained for this retrospective study with a waiver of informed consent. We selected our cohort from our institutional database that composed of patients with WDTC who had their initial surgery at Memorial Sloan Kettering Cancer Center (MSKCC) between 1986 and 2015 (n = 6259). Patients were followed until December 2019. At MSKCC, there is a synoptic reporting system in place, which encompasses all details of pathology relevant to thyroid cancer. In addition, all thyroid pathology is reported by a dedicated pathologist specializing in thyroid pathology or head neck pathologist who followed the synoptic reporting system.
Patients were reviewed and staged according to the American Joint Committee Cancer Manual, Eighth edition, with the best response to initial therapy assessed by the American Thyroid Association (ATA) dynamic risk stratification system based on the information collected in the first year after surgery. Patients were classified as low, intermediate, or high risk according to the 2015 ATA guidelines. Patients were classified as intermediate risk if any of the following criteria was identified: microscopic invasion of tumor into the perithyroidal soft tissues; aggressive histology (e.g., tall cell, hobnail variant, columnar cell carcinoma); PTC with vascular invasion; or clinical N1 or >5 pathological N1 (all <3 cm).
Patients were considered ATA high risk if gross extrathyroidal extension (ETE), incomplete tumor resection, distant metastasis, pathological N1 with a lymph node >3 cm, or follicular thyroid cancer with >4 foci of vascular invasion was identified. All the others were considered ATA low risk. Exclusion criteria were distant metastasis at presentation (n = 82), pathological positive lymph nodes (n = 1902), pathological T (pT) stage* T3, T4, Tx, or T0 (n = 420), patients classified as ATA high risk (n = 41), surgery other than TT or TL (n = 55), and patients who received contralateral lobe ablation (n = 3). All patients in the study were pathological N0 or pathological Nx (i.e., clinically N0 based on preoperative ultrasound and/or intraoperative palpation). A total of 3756 patients were included in our initial cohort (Fig. 1).

CONSORT flow diagram showing the inclusion and exclusion criteria. *Staging system according to the American Joint Committee Cancer, Eighth edition. **Risk stratification according to the 2015 American Thyroid Association Guidelines. pT, pathological tumor.
Patient selection for TL or TT
The following criteria was used at our institution to select suitable patients for TL: intrathyroidal tumors lesser than 4 cm in size, no contralateral lobe nodules, and no suspicious central compartment lymph nodes either on preoperative ultrasound or on intraoperative palpation. The TT patients with T1T2N0/Nx patients treated at our institution comprise low-, intermediate-, and also high-risk ATA patients. We used propensity matching and not inverse probability of treatment weighting (IPTW) to identify patients who had similar characteristics to the TL patients.
Propensity matching and statistical analysis
A full PS-based analysis consists of the following steps: identifying important covariates to obtain PS; choosing the estimand of interest; modeling with an appropriate technique to derive PS; selecting the method that groups will be compared; assessing the balance between groups; and estimating the treatment effect. Balance is usually indicated by the average standardized absolute mean difference (Fig. 2). Values <0.2 are commonly acceptable (13). Our study used PS matching to create matched sets of treated and control patients who share a similar PS. PS matching of the TL patients with an appropriately matched patient managed by TT using the following methodology: statistical analysis was performed using R (version 3.6.2; R Foundation for Statistical Computing, Vienna, Austria) and SAS 9.3 (SAS Institute, Cary, NC).

The standardized mean difference in the entire cohort and the matched cohorts. This figure demonstrates that there are balanced covariates between the lobectomy and total thyroidectomy groups in the matched cohort. RAI, radioactive iodine.
To select the group of patients who had TT and to compare with those who had TL by controlling possible confounders and reducing potential bias, we selected age, sex, histology, 131I therapy, ATA risk, pT stage, and pathological N stage as our PS matching criteria. R package “MatchIt” and “Nearest Neighbor Matching” were used at a 1:1 ratio (14). Greedy nearest neighbor matching was used in which each treated unit is sequentially matched with the k-nearest control units (k = 1) with the closest PS. The process is then repeated until all treated subjects are matched. In the original master data set, 943 patients had TL. After matching, 918 TL patients were successfully matched with 918 TT patients. The Pearson χ2 test or Fisher's exact test was used for the comparison of categorical covariates between the two groups, and Student's t-test was used for comparison of continuous variable age. A multivariate analysis of the cohort after propensity matching was performed using Cox proportional hazards regression.
The main outcomes of interest were OS, disease-specific survival (DSS), and RFS in patients who underwent TL and TT. Recurrence was determined by the presence of a suspicious abnormality on imaging and was confirmed with cytological and/or pathological analysis. Local recurrence refers to either tumor in the thyroid bed or a newly identified disease in the contralateral lobe when TL was performed. Regional recurrence refers to central compartment lymph node and lateral neck lymph node recurrence combined. The Kaplan–Meier method was used to compare outcomes of interest, and a log-rank test was performed to compare the two groups. A p-value of <0.05 was considered statistically significant.
Results
Characteristics of whole cohort of pT1T2N0/Nx patients
The clinical and pathological characteristics of the 3756 patients with pT1T2N0/Nx cancers are shown in Table 1. The approach to surgical management of such low-risk patients within our institution is to offer TL as an alternative to TT in properly selected low- or intermediate-risk patients. We currently recommend TT for patients with nodules in the contralateral lobe (over 0.5 cm detected clinically or on ultrasonography), clinically significant lymph node metastasis, gross ETE, and evidence of distant metastasis. Due to increased availability of high-definition preoperative ultrasonography, detection of contralateral nodules has improved, resulting in a higher rate of TT procedures in recent years. There were 943 patients managed by TL and 2813 managed by TT.
Clinical and Pathological Characteristics of the Entire Cohort
According to the American Joint Committee Cancer, Eighth edition.
ATA, American Thyroid Association; ETE, extrathyroidal extension; pT, pathological tumor.
Patients who underwent initial lobectomy and had completion thyroidectomy within 12 months were categorized in the TT group for outcomes analysis (n = 53). As expected, using our selection criteria, the TL group, compared with the TT group, had a higher proportion of follicular carcinoma (4% vs. 1%, p < 0.001) and low-risk ATA patients (77% vs. 69%, p < 0.001). The TL group also had fewer patients with tall-cell variant PTC (5% vs. 8%, p < 0.001), multifocality (26% vs. 51%, p < 0.001), microscopic ETE (9% vs. 16%, p < 0.001), vascular invasion (6% vs. 8%, p < 0.001), and positive margins (2% vs. 4%, p > 0.001). In the patient group that underwent TT, 573 (20%) received postoperative 131I therapy.
Characteristics of matched cohort of pT1T2N0/Nx patients
After matching for age, gender, histology, pT stage, 131I therapy, and ATA risk, our cohort had 1836 patients, 918 patients in each group. Table 2 shows the characteristics of each surgical group. There were no differences in age (mean age in the TT group was 46.6 years and in the TL group 46.2 years; p = 0.5056), gender (p = 1), histology (p = 1), T stage (p = 1), and ATA risk classification (p = 0.2144). Because no patient in the TL group received 131I therapy, no patient in the TT balanced cohort received it either. The mean follow-up for the TL patients was 56.34 months. The mean follow-up for the TT patients was 56.85 months. Figure 2 shows the standardized mean difference between groups after propensity matching illustrating good balance between the two surgical groups.
Patient Demographics Balancing Table
According to the American Joint Committee Cancer, Eighth edition.
NA, not applicable; SD, standard deviation.
Outcomes analysis
There were no significant differences in OS for TL versus TT (10-year OS: 92.2% vs. 91.3%, p = 0.967) (Fig. 3) or DSS (10-year DSS: 100% vs. 99.1%, p = 0.197) (Fig. 4). There were 67 deaths in the TL group, none related to thyroid cancer. In the TT group, there were 45 deaths, of which only one patient died of thyroid cancer. The median time from surgery to death from other causes was 59.7 months (interquartile range [IQR] 30.5–112.7) for the TL group and 57.3 months (IQR 28.2–89.5) for the TT group. A multivariate analysis of the cohort after propensity matching showed that among the variables age, vascular invasion, microscopic ETE, margins (negative, positive and close), and extent of thyroid surgery, only age (>55 years) was a statistically significant predictor for OS (Hazard ratio = 5.773, p < 0.0001).

Kaplan–Meier plots for overall survival stratified by surgery after propensity matching. There was no statistically significant difference between the two groups.

Kaplan–Meier plots for disease-specific survival stratified by surgery after propensity matching. There was no statistically significant difference between the two groups.
There were no significant differences in RFS for TL versus TT (10-year RFS: 99.5% vs. 98.3%, p = 0.079) (Fig. 5). The slight trend to significance in RFS, with the increased risk of recurrence in the TT group, only highlights that lobectomy has excellent outcomes in appropriately selected patients. Only 15 patients had a recurrence, 10 (1.08%) in the TT group and 5 (0.05%) in the TL group. The median time from surgery to recurrence was 44.2 months (IQR 14.9–89.8) for the TL group and 44.1 months (IQR 12.3–75.2) for the TT group. Recurrence sites evaluated separately also showed no statistically significant differences between the two groups. There were 2 (0.02%) local recurrences, both of which were in the TT group.

Kaplan–Meier plots for recurrence-free survival stratified by surgery after propensity matching. There was no statistically significant difference between the two groups.
The median time from surgery to local recurrence for the TT group was 44.4 months (IQR 12.6–75.7). There were 11 regional recurrences: 5 (0.05%) in the TL group and 6 (0.06%) in the TT group. The median time from primary surgery to regional recurrence was 44.2 months (IQR 14.9–89.8) for the TL group and 44.3 months (IQR 12.5–75.6) for the TT group. For central recurrence only, there was 1 patient in the TL group and 1 patient in the TT group. For lateral recurrence only, there were 4 cases in the TL group and 3 cases in the TT group. Two patients in the TT group had both central and lateral recurrences. There were 2 distant recurrences in the TT group and none in the TL group.
Discussion
Surgical management of patients with intrathyroidal well-differentiated cancers of the thyroid gland <4 cm in size (T1T2N0/Nx) remains a topic of debate. This controversy has largely been generated by Bilimoria et al.'s 2007 article, which reported poorer survival for patients treated with TL (3). Other studies have also reported poorer outcomes for TL (9,15). However, more recent studies have reported no difference in survival between TL and TT (16,17). A study from our own group by Nixon et al. in 2012 analyzed 889 patients and showed no statistical differences between the TT and TL groups when comparing OS, DSS, and RFS calculated using the Kaplan–Meier method and compared using log-rank test (16).
Other retrospective studies by Haigh et al. (4), Matsuzu et al. (18), Barney et al. (19), and Mendelsohn et al. (20) have reported similar outcomes. A follow-up study by Adam et al. (17) evaluated the same NCDB data as Bilimoria et al., then with 61,775 patients with PTC. No difference was reported in the OS in thyroidectomy and lobectomy patients after adjusting for patient demographic and clinical factors, including comorbidities, ETE, multifocality, nodal and distant metastases, and radioactive iodine treatment using a Cox proportional hazards model (17). As a consequence, the latest edition of the ATA guidelines states that TL is an acceptable treatment option for patients with well-differentiated thyroid tumors smaller than 4 cm, no clinical positive lymph nodes, and no gross ETE (21).
The other potential advantage of carrying out more conservative surgery is the reported lower incidence of complications associated with TL. A meta-analysis evaluating 50,445 patients showed significant differences in the pooled relative risk (RR) for patients who received TT compared with TL, such as transient hypoparathyroidism (RR 3.17), definitive hypoparathyroidism (RR 1.69), temporary recurrent laryngeal nerve injury (RR 1.85), and permanent vocal cord palsy (RR 2.58); in addition, TL has virtually no risk of bilateral nerve palsy with subsequent respiratory distress and possible tracheostomy (22). Conversely, TT has the advantage of lower chances of reoperation and higher sensitivity for postoperative serum thyroglobulin levels to predict persistence or recurrence (23).
The controversy about the most appropriate surgical treatment for low-/intermediate-risk T1T2N0/Nx thyroid cancer has been further renewed by a recent study by Rajjoub et al. in 2018 (24). In this NCDB study, the authors reported that TL and TT had equivalent outcomes for patients with follicular variant of PTC but that in patients with conventional PTC with tumors 2–3.9 cm in size TT had a superior OS than TL. The gold standard methodology to address this question is to carry out an RCT. However, to carry out such a trial in this case is challenging due to the cost and the length of time required since patients with low-/intermediate-risk thyroid cancer have long survival times up to 95% at 10 years (21).
In the Bilimoria et al. report that retrospectively studied 52,173 patients with PTC from the NCDB, the 10-year survival for patients managed by TT was 98.4% compared with 97.1% for patients treated by TL. If we carry out a power analysis with 80% statistical power, 5% type I error, and survival difference of 1.3%, it is estimated that for an RCT a total of 267,158 patients would need to be recruited with a follow-up time of 10 years to show this difference. The large sample size and high cost to perform a multi-institutional RCT with adequate statistical power to evaluate TL versus TT is therefore not feasible.
However, there are statistical methods that are available in the absence of an RCT to minimize the impact of bias. For instance, PS-based analyses have gained popularity in recent years. Although some studies have used this strategy with PS to evaluate the impact of extent of thyroid surgery on outcome, they are smaller than our study and had slightly different aims. For example, Lee et al. studied TL versus TT in patients with papillary microcarcinoma. After adjusting baseline characteristics using PS matching for the initial 2014 patients, there were 506 patients in each group TL versus TT. Patients were further divided into two subgroups: patients with tumor ≤0.5 cm versus >0.5 cm.
There were no significant differences in OS and locoregional recurrence between groups (25). Kuba et al. compared the outcomes of 173 patients with node negative PTC size 1–5 cm treated with surgery, either TL or TT. After PS matching, there were 33 patients in each group, with equivalent outcomes in both groups and less adverse events in the TL group (26). Finally, Song et al. evaluated recurrence rates based on extent of surgical procedure, comparing TL versus TT in patients with PTC ≥1 cm and <4 cm. After matching for age, gender, tumor size, ETE, multifocality, and cervical lymph node metastasis, there were 381 patients in each group TL versus TT. No significant differences in disease-free survival were found (27).
As an alternative to RCT, we used propensity matching score drawing from a detailed single-institution database with a long history of thyroid cancer management. Matching and weighting are the most common methods in utilizing PS. IPTW generates a pseudo-population where the baseline covariates are similar, which estimates the “average treatment effect” of patients as they have all received the same treatment. This differs from the “average treatment effect on the treated” (ATT)-based PS method, which examines the effect of the treatment only as it was applied, to those patients who were actually treated (i.e., what was the effect of lobectomy compared with TT in similar patients to those who had a lobectomy). Therefore, using the ATT-based PS method in our study allows an outcome comparison between TL and TT in more balanced groups, with reduced bias that is usually created by group imbalance when there is no randomization of patients to treatment allocation.
Our study demonstrated that lobectomy alone is an appropriate management for selected patients with T1T2N0/Nx low-/intermediate-risk patients. Patients were carefully selected for TL following specific criteria: intrathyroidal tumors <4 cm in size, no contralateral lobe nodules, and no suspicious central compartment lymph nodes either on preoperative ultrasound or on intraoperative palpation. This constitutes our 914 lobectomy patients. The TT patients with T1T2N0/Nx patients treated at our institution comprise low-, intermediate-, and also high-risk ATA patients.
We used propensity matching and not IPTW to identify patients who had similar characteristics to the TL patients. Importantly, our database has detailed histological review, which enables us to match histology in the TL and TT groups. We report no statistically significant differences in OS, DSS, and RFS between the groups. Although prior studies have addressed the same research question with different statistical methods, our study is the first to analyze this issue with such a large validated cohort and PS matching. Compared with an RCT, PS matching does not have the same level of evidence; however, it is the best approach to analyze rare outcomes retrospectively.
Our study is not without its limitations. An important limitation is the generalizability of the results to the community, particularly to institutions where surgeons have less experience due to low surgical volume and ultrasonographers who lack the expertise in identifying all the nuances related to thyroid cancer. In our institution, the preoperative imaging evaluation and surgeries were carried out by experienced sonographers and surgeons, respectively. The results from a tertiary cancer center may not correlate with the results of low-risk cancers treated in a community setting. Other limitations pertain to the pros and cons associated with PS regardless of which method or estimand is selected.
The PS is generated based on available covariates, but this does not eliminate the possible bias created by unmeasured cofounders. PS does not necessarily provide superior results to multivariate outcome regression models, but it does allow a straightforward assessment of whether the treated and control groups are comparable after applying the PS and allows a separation of modeling and outcome analysis. Finally, some may argue that the utilization of PS to estimate randomness will increase imbalance, bias, model dependence, and inefficiency. Nevertheless, using PS, we can reduce dimensionality of covariates and demonstrate that the treated group is similar to the control group on measured covariates, which is imperative for rare outcomes.
Conclusions
In conclusion, using a large single-institution database with detailed clinical, pathological, and outcomes information, we have been able to carry out a detailed propensity matching study and corroborate that for selected low-/intermediate-risk node negative patients with intrathyroidal thyroid cancer <4 cm, TL have equivalent outcomes to a similar group treated by TT.
Footnotes
Authors' Contributions
A.Y. had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. Study concept and design: I.G. and A.Y. Acquisition, analysis, or interpretation of data: V.H., D.M., A.Y., S.G.P., and I.G. Drafting of the article: D.M., A.Y., and I.G. Critical revision of the article for important intellectual content: D.M., A.Y., A.R.S., R.M.T., S.G.P., J.P.S., and I.G. Statistical analysis: A.Y., D.M., and I.G. Administrative, technical, or material support: A.Y. and S.G.P. Study supervision: S.G.P., A.R.S., J.P.S., and I.G.
Author Disclosure Statement
The authors have no financial or personal relationships that could potentially influence this work.
Funding Information
This research was funded in part through the National Institutes of Health/National Cancer Institute Cancer Center Support Grant P30 CA008748; New Therapies in Head and Neck Cancer Fund Number 51550-15979.
