Abstract
Background:
Thyroid nodules are a very common often incidental finding on physical examination or imaging. Of those who undergo fine needle aspiration, cytology is indeterminate in up to 15%. Molecular testing is increasingly being used to help identify which nodules may be high risk for malignancy and guide management with regard to clinical follow-up or surgical intervention. Recently there has been an increase in publication of independent studies assessing the performance of these molecular tests and comparing “real-world” data with the validation studies.
Methods:
This retrospective study identified all thyroid nodules at our institution that had Afirma gene expression classifier (GEC), genomic sequencing classifier (GSC), or Thyroseq v3 molecular testing from January 2014 to January 2020 and compared measurements of test performance between them at our institution, and then with the original validation studies and other published institutional data.
Results:
Overall, the benign call rate was highest in the Afirma GSC group (78%) compared with the GEC group (60%) and Thyroseq group (66%). Surgical histopathology revealed malignancy in 6 of 31of biopsied nodules in the GEC group, 8 of 13 in the GSC group, and 3 of 16 in the Thyroseq v3 group. Based on our data, the GSC specificity (73.7%) and positive predictive value (PPV) (61.5%) were higher than the GEC specificity (60.4%) and PPV (22.2%) as well as Thyroseq v3 specificity (55.2%) and PPV (18.8%).
Conclusions:
From our short-term institutional experience, we found that the GSC classified more cytologically indeterminate nodules as benign compared with the Afirma GEC, and had improved specificity and PPV, which is similar to the validation study and other institutions' reported experiences. We also found that the Thyroseq v3 was similar to the Afirma GEC in terms of specificity and PPV, both of which are much lower than the validation studies.
Introduction
Thyroid nodules are a common finding on physical examination and imaging, with a prevalence of 2–6% on physical examination and 10–35% on ultrasound imaging (1,2). The incidence of thyroid nodules increases with age and radiation exposure and is higher in women.
Although a vast majority of thyroid nodules are benign, thyroid cancer occurs in 5–15% of nodules. The incidence of thyroid cancer has been increasing over the past several years with an estimated 44,280 new cases in 2021 (2). The age-adjusted incidence of thyroid cancer in Massachusetts is 19.9/100,000. The incidence in Hampden County, the county in which our institution is located, is 13.96/100,000 (3,4).
According to the 2015 American Thyroid Association guidelines, ultrasound-guided fine needle aspiration is recommended for certain nodules based on size, sonographic features, and temporal variation (6). The Bethesda System for Reporting Thyroid Cytopathology (TBSRTC), most recently revised in 2017, is used to report whether the thyroid cytological specimen is benign or malignant on fine needle aspiration cytology (FNAC). It is divided into six categories.
A 2018 review of the literature determined that 15.3% of the samples fell into the indeterminate category, 12.4% of samples were Bethesda III (atypia of undetermined significance, follicular lesion of undetermined significance), and 2.9% were Bethesda IV (follicular neoplasm, suspicious for follicular neoplasm, or Hurthle cell neoplasm) (7), with an estimated risk of malignancy of 5–15% and 15–30%, respectively (6). In our institution between 2014 and January 2020, 16% of samples were Bethesda III, and 4% were classified as Bethesda IV.
Clinical decision making is challenging in patients with Bethesda III and IV on FNAC. Before the molecular testing era, surgical management was the next step. In a study by Alshaikh et al., 119 patients, with a total of 126 nodules spanning all six Bethesda categories, who had thyroid surgery were evaluated. Their findings showed an overall rate of malignancy of 27.8% (35/126 nodules) for all Bethesda categories, and a rate of malignancy of 28% and 22%, respectively, for Bethesda III and IV (7).
Between December 2011 and January 2013, before the use of molecular testing in our institution, 52% of patients with Bethesda III and IV FNAC underwent surgery, with a malignancy rate of 31% for Bethesda III and 32% for Bethesda IV on histopathology. As such, correct identification of malignant nodules and avoidance of unnecessary repeat FNAC and surgical procedures for benign nodules have emerged as the cornerstone of management.
Molecular testing has more recently established its role in patients with indeterminate cytology (Bethesda categories III and IV) and updated guidelines recommend their use. Molecular testing is classified as predictive testing. One such test is the now revamped Afirma gene expression classifier (GEC) introduced in 2011, which analyzes the mRNA expression of 167 genes. The more recent Afirma genomic sequencing classifier (GSC) available since 2017 analyzes >10,000 nuclear and mitochondrial genes and demonstrated improved specificity compared with the GEC (8). The results are classified as benign or suspicious in this commercial test.
ThyroSeq v3, commercially available since late 2017, analyzes 112 genes, providing information regarding a variety of genetic alterations, including point mutations, insertions/deletions, gene fusions, copy number alterations, and abnormal gene expression (9). These results are classified as negative or positive, and provide potential management based on the genetic alteration identified. The results then can be used to guide clinical decision making regarding surgical intervention versus clinical monitoring. Recently there has been an increase in publication of independent studies assessing the performance of these molecular tests and comparing “real-world” data with the validation studies.
The purpose of this study is to review indeterminate FNAC samples that utilized one or more of the three aforementioned molecular tests, determine the outcomes with respect to the genomic testing results, and compare them with histological outcomes for those who underwent surgical resection. This will allow us to compare the clinical performance of the three genomic classifiers used at our institution.
Materials and Methods
We conducted a retrospective diagnostic accuracy study of all the thyroid nodules that had molecular testing with Afirma GEC or GSC or Thyroseq v3 at Baystate Medical Center (Springfield, MA) from January 2014 to January 2020. Cytologically indeterminate nodules with Bethesda category III or IV on initial or repeat fine needle aspiration were included in the study population. We described patient and thyroid nodule characteristics of those that were cytologically indeterminate and had molecular testing.
The choice of commercial molecular test used was based on serial change in institutional subscription at the time the FNAC was performed. Patients seen between 2014 and October 2017 had Afirma GEC. Those who underwent FNAC between October 2017 and mid 2018 had Thyroseq v3 and between mid 2018 and the present had Afirma GSC. Currently, our institution solely uses the Afirma GSC. The study was approved by the Baystate Health Institutional Review Board. The study did not receive any financial support or review by any commercial entity.
Tissue samples
Thyroid FNAC specimens were obtained under ultrasound guidance by endocrinologists or radiologists affiliated with Baystate Health. All cytology evaluations were performed at Baystate Health by board-certified cytopathologists using the TBSRTC system. Molecular analyses were performed in either the Veracyte or Thyroseq's Clinical Laboratory.
The decision whether to perform surgery or to clinically monitor thyroid nodules was made according to the clinical judgment of the treating physician and patient preference. Histopathologic data were collected on all nodules that were surgically resected to calculate measures of test performance.
Statistical analyses
Continuous measures (patient age) are described using means and standard deviations. Categorical variables are reported using numbers and percentages. Comparisons across molecular tests were conducted using analysis of variance (ANOVA) for continuous measures and Fisher exact test for categorical measures. Statistical comparisons were considered exploratory in nature with p-values <0.20 being suggestive of possible differences across distributions.
Among subjects with a positive molecular test and surgical resection or subjects with a negative molecular test, we calculated measures of test performance. In particular, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) along with 95% confidence intervals were calculated for each molecular test using surgical histopathology as the gold standard.
Of note, as a majority with negative molecular testing had no surgical resection, we only included those who had at least one stable follow-up ultrasound one year or more after molecular testing as true negatives. We excluded those who did not have follow-up imaging or had interval significant growth with unknown or pending repeat FNAC or surgical evaluation. As our focus is on the performance of the molecular test with regard to the specific nodule that had the FNAC, any incidental malignancies were considered benign on histopathology. In addition, noninvasive follicular thyroid neoplasms with papillary-like nuclear features (NIFTPs) were not classified as malignant. Analysis was done using Stata MP v16.1 (College Station, TX).
Results
The total number of thyroid FNACs at our institution between January 2014 and January 2020 was 2752, with 16% Bethesda III (440) and 4% Bethesda IV (110). A total of 224 thyroid nodules in 214 patients were eligible for inclusion in our study based on molecular testing being performed (Afirma GEC n = 92, Afirma GSC n = 73, Thyroseq v3 n = 59) (Table 1). The total number of nodules with indeterminate cytopathology is larger than the 224 that were eligible since many nodules had >1 FNAC before molecular testing. There were 10 patients who had >1 nodule that had an FNAC and subsequent molecular testing; therefore, the total number of nodules is greater than the total number of patients. The mean age in the GEC group (n = 89) was 56.6 years, that in the GSC group (n = 70) was 56.4 years, and that in the Thyroseq group (n = 55) was 55.3 years. Seventy-one percent of the patients were female in the GEC group, 83% in the GSC group, and 78% in the Thyroseq group.
Patient Characteristics and Thyroid Nodule Data
Patient characteristic statistics presented as mean ± SD or n (%). Thyroid nodule data presented as n (%). One-way ANOVA was used to calculate p-value for age. Otherwise Fisher exact test was used to calculate remaining p-values.
Missing two nodule measurements—FNA done by radiology and no measurements were reported.
Three were nondiagnostic (Bethesda I), three were benign (Bethesda II) on first FNA.
Five were benign (Bethesda II), 1 was nondiagnostic (Bethesda I), 1 was suspicious for malignancy (Bethesda V) on first FNA.
One was benign (Bethesda II) on first FNA.
ANOVA, analysis of variance; FNA, fine needle aspiration; GEC, gene expression classifier; GSC, genomic sequencing classifier; SD, standard deviation.
All nodules, with the exception of 3 (in the GSC group), were >1 cm in size. Seventy-six percent of nodules in the GEC group, 66% in the GSC group, and 78% in the Thyroseq group were between 1 and 2.9 cm. All patients were euthyroid.
Nodules were included if they had one indeterminate FNAC (Bethesda III or IV) and underwent molecular testing even if the initial or subsequent repeat FNAC was in Bethesda category I or II (n = 26). Of those who were Bethesda II on repeat FNAC, 5 were suspicious on molecular testing and 2 revealed papillary microcarcinoma on final surgical histopathology. Patient characteristics and thyroid nodule data are summarized in Table 1.
Benign call rates
The number of benign/negative and suspicious/positive results obtained with Afirma GEC, GSC, and Thyroseq v3 are given in Table 2. Overall, including all samples, the GEC benign call rate (BCR) was 60%, GSC was 78%, and Thyroseq v3 was 66%. Using only those with benign/negative molecular testing and follow-up imaging, the BCR for GEC was 55%, GSC was 51.9%, and Thyroseq v3 was 51.5%.
Comparison of Benign Versus Suspicious/Positive Results
Data presented as n (%). Fisher exact test used to calculate p-values.
Three were nondiagnostic (Bethesda I), three were benign (Bethesda II) on first FNA.
Five were benign (Bethesda II), 1 was nondiagnostic (Bethesda I), 1 was suspicious for malignancy (Bethesda V) on first FNA.
One was benign (Bethesda II) on first FNA.
Surgical outcomes
In the GEC group, 84% (31/37) of the nodules that were suspicious were resected, compared with 81% (13/16) in the GSC group and 80% (16/20) in the Thyroseq v3 positive group. Surgical histopathology revealed malignancy in 6 of 31 in the GEC group, 8 of 13 in the GSC group, and 3 of 16 in the Thyroseq v3 group (Table 3).
Malignancy Rate by Test Type and Surgical Histopathology and Measurements of Tests Performance
Data presented as n (%).
Thirty-seven total suspicious, six did not have surgery.
Sixteen total suspicious, three did not have surgery.
Twenty total suspicious, four did not have surgery.
Four of the microcarcinomas were incidental. Calculations considered the four incidental papillary microcarcinomas in the GEC group as false positives.
Since not all nodules were surgically resected, only those with benign/negative molecular testing who had at least one stable follow-up ultrasound ≥1 year after molecular testing were counted as true negatives. Those who did not have a follow-up ultrasound or had interval growth with unknown or pending repeat fine needle aspiration cytology were excluded. Unoperated suspicious/positive nodules were excluded from calculations.
p-value = 0.05 Fisher exact test.
CI, 95% confidence interval.
There were six in the GEC group who were suspicious but did not have surgery. Two patients deferred surgery, one of whom did have a follow-up ultrasound that revealed stable nodular size and characteristics over a six-month period. The other had carotid artery duplex scans a few years later that commented on size of thyroid nodule and was stable over a four-year period. One patient cancelled surgery, one passed away from diffuse B cell lymphoma, and the other one had surgery planned but was unfortunately diagnosed with metastatic neuroendocrine carcinoma of the lung and surgery was cancelled. One patient is still awaiting surgery to date.
There were three nodules (two individuals) in the GSC group that were suspicious but did not have surgery. One individual did not show up for appointment with the surgeon. The other individual, who had two indeterminate nodules, deferred surgery, however, recent ultrasound showed interval nodule growth and evaluation with endocrine surgeon for thyroidectomy was again recommended.
Of the 3 who had a positive Thyroseq v3 and subsequent malignant surgical histopathology, the estimated risk of malignancy was intermediate for the 2 with NRAS mutations, and intermediate–high for the 1 with detected HRAS mutation.
Of the 13 who had positive Thyroseq v3 and subsequent benign surgical histopathology, 3 were considered high risk (2 with TERT mutation, 1 with PAX8-PPARG fusion), 8 were intermediate–high risk (2 with NRAS, 4 with THADA-IGF2BP3 fusion, 1 with HRAS, 1 with multiple chromosomal copy alterations), 1 was intermediate risk (NRAS mutation), and 1 was intermediate–low risk (DICER 1 mutation).
There were 3 individuals (4 different nodules) in the Thyroseq v3 group who were positive who did not have surgery. Two did not show for their surgery appointments and one patient moved out of state.
Among those with benign molecular testing that were surgically resected (n = 3 from GEC group, n = 2 from the GSC group and n = 2 in Thyroseq v3 group), one from the GEC group and one from the Thryoseq v3 group were malignant on final histopathology. From the GEC group, this was a 2.3 cm nodule in a 34-year-old female who was Bethesda IV on FNAC but GEC benign. Final histopathology revealed grossly encapsulated follicular carcinoma (2.8 cm) with angioinvasion. From the Thryoseq v3 group, this was a 2.7 cm right-sided nodule in a 41-year old female that was Bethesda III on FNAC and Thyroseq negative. She subsequently sought a second opinion at an outside institution and ultimately had total thyroidectomy there that revealed a 3.3 cm follicular neoplasm.
Table 4 summarizes the final histopathology of the nodules with Afirma suspicious or Thyroseq v3 positive molecular testing who had surgery.
Final Histopathology of Nodules with Suspicious/Positive Molecular Testing Who Had Surgery
Five were papillary microcarcinomas, four of which were incidental.
One was papillary microcarcinoma.
One was a papillary microcarcinoma.
Two were papillary microcarcinomas.
NIFTP, noninvasive follicular thyroid neoplasms with papillary-like nuclear feature.
Measurements of test performance
Overall, the GSC specificity (73.7%), sensitivity (100%), and PPV (61.5%) were higher than the GEC specificity (60.4%), sensitivity (85.7%), and PPV (22.2%) and Thyroseq v3 specificity (55.2%), sensitivity (75%), and PPV (18.8%) (Table 3). There were 10 total malignancies found in the GEC suspicious cohort on surgical histopathology, however, 4 of them were incidental papillary microcarcinomas not in the nodule that had the molecular testing and, therefore, were considered false positives.
Discussion
Molecular testing has become increasingly common for the management for nodules with indeterminate cytology. Ideally the results of molecular testing would have both a high sensitivity and specificity and could be used both as a rule out and rule in test, respectively. Afirma and Thyroseq are two companies that offer commercially available molecular testing.
There have been recent publications from varying institutions reporting “real-world” experience with these molecular tests that have varied from the original validation studies. Similar to other studies, we compared our results from Afirma GEC and GSC, but we also were able to compare these results with our use of Thyroseq v3. Jug et al. evaluated the performance of Afirma GEC and Thyroseq (7-gene and V2) (10), however, as far as we know, no other authors have compared Afirma GSC with Thyroseq v3 to date. We acknowledge that there is an ongoing randomized clinical trial comparing performance of Afirma GSC and Thyroseq v3 within the UCLA Health System, with anticipated completion date of August 1, 2021 (11).
The Afirma GEC was initially designed as a “rule out” test with high sensitivity and NPV (94%). The Afirma GSC had improved specificity and PPV, while maintaining a high sensitivity and NPV. Validation studies for Afirma GEC reported a sensitivity of 90% and specificity of 52% for thyroid nodules with Bethesda III or IV cytology. The Afirma GSC had improved sensitivity of 91% and specificity of 68% (NPV 96%, PPV 47% at 24% cancer prevalence) (8).
In the past 1–2 years, there have been a small number of publications on institutional experiences with Afirma GEC and GSC. Harrell et al. published 11 months of clinical outcomes experience from a community endocrine surgical practice with the GSC and compared them with their 6.5-year experience with the GEC. They found that the GSC identified less indeterminate cytology nodules as suspicious (38.8%) compared with GEC (58.4%), and the surgery rate fell from 56% in the GEC group to 31% in the GSC group (12). Overall, the study concluded an improvement in specificity of the GSC (44% compared with 32% in GEC) without compromising sensitivity.
Similarly, Endo et al. found that GSC had a statistically significant higher BCR than the GEC (76.2% vs. 48.1%, p < 0.001), with a decrease in surgery rate from 52.5% to 17.6% (p < 0.001). GSC also had improved PPV (60% vs. 33.3%, p = 0.01) and specificity (94.3% vs. 61.4%, p < 0.001) compared with GEC (13). Angell et al. (14) and Wei et al. (15) also found a higher BCR for GSC compared with GEC that was statistically significant, with improvement in PPV. Most recently, San Martin et al. from Cleveland Clinic published a comparison of the GEC and GSC and reported a significant difference in BCR (BCR was 41% for GEC, 67.8% for GSC with p < 0.001) and improved specificity and PPV of the test, resulting in fewer diagnostic surgeries (surgery rates decreased from 47.8% to 34.7%) (16). The reported specificity for the GSC was higher at 94% compared with GEC at 60% (p < 0.001). Similarly, the PPV of GSC was higher at 85.3% compared with 40% for the GEC (p < 0.001) (16). In our institution, we also found a higher BCR with the GSC (78%) than with GEC (60%). Of note, our BCR for GSC was higher than those previously described at other institutions. We also found improved specificity with the GSC, 73.7% compared with 60.4% with GEC, and improved PPV of 61.5% compared with 22.2%.
In the Thyroseq v3 validation study, the sensitivity and specificity were 98% and 82%, respectively, in 175 FNAC samples with known surgical follow-up (9). A subsequent prospective double-blinded multicenter clinical validation study reported a sensitivity of 94%, specificity of 82%, NPV of 97%, and PPV of 66% on 247 Bethesda category III and IV nodules of known final surgical histopathology (disease prevalence of 28%). The BCR was 61% for thyroid FNAC classified as Bethesda category III and IV (17).
A recently published institutional experience using Thyroseq v3 from the University of Pennsylvania reports an NPV of 99.5% for Bethesda III nodules and 95.4% for Bethesda IV nodules, with an overall BCR of 71% (18). One hundred and twenty seven of the cohorts underwent surgical resection (96 of the ThyroSeq positive and 31 of the ThyroSeq negative nodules) and, therefore, had surgical histopathology. Fifty-five percent of the 127 were malignant (70/127), which is similar to the rate of malignancy in the original Thyroseq v3 validation study that had a malignancy rate of 53% out 175 FNAC samples with known histopathology (9).
If we assess measures of test performance only on those 127 with surgical histopathology, based on their reported numbers (65 true positives, 31 false positives, 26 true negative, and 5 false negatives) (18), this will give a sensitivity of 92.8% but a specificity of 45.6%, which is quite lower than the validation study, which reported a specificity of 81.8%.
Our BCR was similar to the mentioned study at 66%; however, we had a specificity of 55.2% with PPV of only 18.8%. Of note, NIFTP was classified as malignant in the cohort by Desai et al., whereas it was classified as benign in our cohort.
Upon comparison of our GEC, GSC, and Thyroseq v3 data, the GSC had the highest specificity and PPV at our institution, while the Thyroseq v3 was comparable with GEC with regard to specificity and PPV.
Further long-term follow-up studies are needed to be able to include those nodules that are currently pending surgery from our cohort, have longer clinical follow-up on those nodules that had benign molecular testing, and to be able to include more samples overall from the GSC group to better assess the test performance. We are unable to explain the difference in performance of the Thyroseq v3 in our institution compared with some of the published data. Far from being a conclusion in and of itself, we have raised more questions and identified a need for additional institutional experiences with Thyroseq v3 and a larger study to perform a real-life comparison of the performance between Afirma GSC and Thyroseq v3. The results of the clinical trial that is currently underway in the UCLA Health System will hopefully aid in the diagnostic utility and choice of molecular testing (11).
We do acknowledge some limitations to our study. Overall, we had a small sample size, and due to the more recent use of the Afirma GSC and Thyroseq v3 compared with GEC, there were overall fewer samples for those two categories. In addition, there were individuals with nodules in each group that were suspicious/positive who did not pursue surgery. The addition of the final histopathology of these nodules would likely impact the tests' performances. Due to the small numbers, we were unable to further separate out Hurthle cell dominant nodules. Lastly, and similar to other institutional data, we do not have the final histopathology on all cases, and unoperated GEC, GSC, and Thyroseq v3 benign nodules were assumed to be true negatives if they had at least one follow-up stable ultrasound. Those who did not have follow-up imaging were excluded.
In summary, from our short-term institutional experience, we found that the GSC classified more cytologically indeterminate nodules as benign than the Afirma GEC, sparing more patients from surgery, and had improved specificity and PPV, which is similar to other institutions' reported experiences (10 –14). We also found that the Thyroseq v3 was similar to the Afirma GEC in terms of specificity and PPV.
Footnotes
Authors' Contributions
M.G. carried out data entry, methods and results, statistics calculations, and conclusion. K.F. carried out introduction, data entry, and statistics review. I.O. performed data entry, review and editing of the article, statistics calculations, table revisions, article revisions, and submission and supervision.
Acknowledgment
We thank our statistician, Alexander Knee, who took the time to help us with our data.
Author Disclosure Statement
No competing financial interests exist.
Funding Information
No funding was received for this article.
