A Systematic Review of the Methods of Diagnostic Accuracy Studies of the Afirma Gene Expression Classifier

Abstract

Background:

The Afirma^® Gene Expression Classifier (GEC) risk stratifies The Bethesda System for the Reporting of Thyroid Cytopathology class III/IV (indeterminate) thyroid nodules (ITNs) as suspicious for malignancy or benign. Several authors have published studies describing the diagnostic accuracy of the GEC. However, the quality of these methods has not been rigorously examined.

Summary:

In this study, MEDLINE and EMBASE were searched for studies published between January 1, 2010, and June 30, 2016, examining the sensitivity, specificity, negative predictive value, and positive predictive value of the GEC. The Quality of Diagnostic Accuracy Studies 2 was customized to evaluate the methods of included studies in each of four domains: nodule selection, index test execution, reference standard assignment, and flow and timing. Signaling questions were used to identify sources of potential bias in calculation of diagnostic accuracy, and issues of applicability were assessed. Three panelists applied the Quality of Diagnostic Accuracy Studies 2 tool to each study included, and divergence was resolved in conference. In 12 studies evaluated, the most common methodologic flaw was lack of reference standard diagnosis assignment to un-excised GEC-benign ITNs. Exclusion of these ITNs from the analyses resulted in unreliable estimates of specificity and negative predictive value. Other flaws identified included restriction to ITNs that had already been selected for referral for thyroidectomy or lobectomy.

Conclusions:

Future studies should define and assign a “true negative” label to GEC-benign nodules that do not develop malignant signs or symptoms during a pre-specified period of follow-up, and these nodules should be included in calculations of diagnostic accuracy.

Introduction

The Bethesda System for the Reporting of Thyroid Cytopathology (TBSRTC) provides standardized reporting criteria for fine-needle aspiration (FNA) cytopathology from thyroid nodules (1). TBSRTC categories include non-diagnostic (I), benign (II), atypia of undetermined significance/follicular lesion of undetermined significance (AUS/FLUS; III), follicular neoplasm or suspicious for follicular neoplasm (FN/SFN; IV), suspicious for malignancy (V), or malignant (VI). Among FNA specimens deemed TBSRTC category II, risk of malignancy is considered sufficiently low (<5%) to justify clinical surveillance without excision, while that for TBSRTC categories V and VI is considered sufficiently high (60–99%) to warrant lobectomy or total thyroidectomy (1,2). Malignancy risk in TBSRTC categories III and IV (indeterminate thyroid nodules [ITNs]) is between 5% and 30% (1,3). This is higher than the threshold at which clinicians and patients would typically be comfortable observing without excision but lower than a threshold warranting thyroidectomy or lobectomy in all cases, as 27% or more ITNs do not undergo excision (4). The challenge for providers and patients is accurate risk stratification of ITNs into those that likely represent malignancy and thus should be triaged for surgery, and those that are likely benign and could be safely followed clinically without excision.

The Afirma^® Gene Expression Classifier (GEC) measures the expression of 167 gene transcripts, and was developed in ITNs ≥1 cm in diameter undergoing their first evaluation (5). In a prospective, double-blinded, multicenter validation trial, the GEC demonstrated a sensitivity of 90% and a negative predictive value (NPV) of 94% in TBSRTC III/IV nodules (6). Because of the test's high sensitivity and NPV, providers using the GEC to risk stratify ITNs can be confident that a GEC-benign nodule has a risk of malignancy low enough that excision can be safely avoided. Studies have demonstrated a low rate of excision in ITNs that are GEC-benign, with one study demonstrating no difference in operative rates between ITNs that are GEC-benign and TBSRTC category II nodules (7 –14).

Since January 2011, the GEC has been available in the United States, and the number of institutions testing some or all ITNs has grown. Several authors have examined the diagnostic accuracy of the GEC by reporting on its sensitivity, specificity, positive predictive value (PPV), and NPV at their institutions (15 –26). Few prior authors have rigorously evaluated the methods of these studies to determine the reliability of diagnostic accuracy calculations reported and to determine whether the findings are applicable outside the clinical scenarios described therein (27). The Quality of Diagnostic Accuracy Studies (QUADAS-2) tool is one method by which reviewers can systematically evaluate the quality of methods of studies evaluating the accuracy of a diagnostic test (28,29). The tool can be used to restrict a systematic review or meta-analysis to those reports that meet a predefined minimum standard for quality. This study sought to evaluate systematically the quality of published studies that purported to report the sensitivity, specificity, NPV, and PPV of the GEC in observational clinical settings using a customized QUADAS-2 tool.

Review

Data sources and search

MEDLINE and EMBASE were searched for studies that reported diagnostic accuracy of the GEC in observational clinical settings published between January 1, 2010, and June 30, 2016. A search strategy previously used to capture diagnostic accuracy studies of the GEC was employed, and titles of retrieved articles for studies that met the inclusion criteria were scanned (27). References of retrieved articles were searched for additional studies that met the inclusion criteria.

Study selection

All studies that reported diagnostic accuracy of the GEC in observational clinical settings were included. Studies reporting enough information to calculate estimates for sensitivity, specificity, NPV, and PPV were included. Studies that evaluated the diagnostic accuracy of the GEC in thyroid nodules other than TBSRTC III or IV were included, as long as they reported results in ITNs separately from other categories of tested thyroid nodules. In some clinical contexts, ultrasonographic features may be used to triage which ITNs undergo GEC testing, with the role for genetic testing being additive in informing decisions regarding excision. However, the studies examined did not explicitly report ultrasonographic features used for selection of ITNs for GEC testing, other than the >1 cm requirement for FNA. Therefore, it was not possible to report on to what extent ITNs tested were selected using ultrasound criteria, or what those criteria would be. Studies were excluded that only examined thyroidectomy rates for GEC-tested thyroid nodules and did not report on the diagnostic accuracy of the test. Studies were also excluded that only reported long-term follow-up of patients with ITNs that were GEC-benign because these studies did not include information regarding GEC-suspicious ITNs, and therefore diagnostic accuracy could not be calculated. One study was excluded with very few ITNs (14 nodules) tested using GEC, with three of these nodules from the same patient (14).

The QUADAS-2 Tool

Multiple studies evaluating the accuracy of a single novel diagnostic test often report markedly heterogeneous results. This variability in diagnostic accuracy calculations stems from underlying variability in the methods employed in these studies, and confounds efforts to aggregate diagnostic accuracy estimates systematically across multiple studies examining the same test. In response, the QUADAS tool was developed to evaluate the methods of reporting the diagnostic accuracy of a test rigorously and systematically (28). QUADAS-2 was developed in response to problems with application of the original QUADAS tool for certain clinical questions, including situations in which the “gold” reference standard diagnosis involves clinical follow-up. In Phases 1–3 of the QUADAS-2 process, reviewers define the clinical question and develop a flow diagram depicting the ideal design for a study evaluating the accuracy of the diagnostic test. In Phase 4, the QUADAS-2 tool is used to assess studies included in the review in each of four domains: patient selection, index test, reference standard, and flow and timing. For each of these domains, reviewers ask one or more signaling questions about the methods of each study. Reviewers customize the QUADAS-2 tool prior to its application to ensure the signaling questions are relevant to the clinical scenario under evaluation.

For this review, the QUADAS-2 tool was customized to define the intended use population of thyroid nodules tested, the recommended execution of the index test (the GEC), and definitions of reference (“gold”) standard diagnoses. In a flow chart, the ideal design for a study examining the performance of the GEC was diagrammed. Customized signaling questions were developed for each of the four QUADAS-2 domains. Three co-authors (Q.D., H.G., and G.R., the “panel”) independently applied the customized QUADAS-2 tool to the studies that met the inclusion criteria. For each signaling question, panelists were required to answer “yes” or “no/unclear.” C.R.T. aggregated the responses to the signaling questions from the independent reviews of each study and identified questions for which there was divergence. “Divergent” answers were those for which there was not three-panelist consensus of either “yes” for a question, or “no/unclear” for a question. Divergent answers among panelists were resolved through teleconference discussions including all three panelists. In some cases, panelists would change their answers for a question on which they previously converged after discussion resulted in clarity regarding the meaning of that question. For all questions, however, the final accepted answers from all three panelists converged (final answer for each either all three “yes” or all three “no/unclear”).

Data analysis and synthesis

For each study evaluated, consensus answers for signaling questions identified whether there were methodologic flaws biasing diagnostic accuracy calculations, and/or whether there were concerns about applicability of the study's findings to clinical scenarios outside of that reported in the study. “No” or “unclear” answers to one or more signaling questions within a domain for a study indicated that methodologic flaws could have resulted in unreliable estimates of test performance, or that findings from the study may not be applicable to the intended use population of thyroid nodules for which the GEC was developed. Finally, the study reports how biases introduced by methodologic flaws in diagnostic accuracy studies of the GEC could impact calculations of sensitivity, specificity, NPV, and PPV (increase, decrease, or unpredictable for each parameter).

Summary

Twelve studies met the inclusion criteria for the review (15 –26). Figure 1 depicts the ideal design of a study examining the diagnostic accuracy of the GEC in an observational setting. The flow chart emphasizes selection of TBSRTC III/IV (indeterminate) nodules ≥1 cm in diameter, undergoing first-line evaluation. The flow chart also allows for assignment of benignity (“true negative”) as the reference standard diagnosis to GEC-benign ITNs that do not undergo thyroidectomy or lobectomy, if follow-up with repeat ultrasounds or repeat FNA does not support a diagnosis of malignancy.

FIG. 1.

Flow chart reflecting idea design for a study of the diagnostic accuracy of Gene Expression Classifier (GEC) testing.

One to three questions were developed for each QUADAS-2 domain; these nine signaling questions are listed in Table 1. Using the signaling questions, the panel identified no major methodologic flaws in just 1/12 studies evaluated (Supplementary Table S1; Supplementary Data are available online at www.liebertpub.com/thy). Of the remaining 11 studies, 10 had five or more major methodologic flaws identified. For patient/nodule selection (Domain I), a common methodologic flaw identified by the panel was the restriction of the studied population to ITNs that had already been evaluated and selected for thyroidectomy or lobectomy. In many studies, the panel also identified inclusion of thyroid nodules for which the GEC is not indicated, including those <1 cm in diameter or of TBSRTC categories other than III or IV. For execution of the index test (Domain II), the panel noted a lack of explicit description of execution of the GEC in 7/12 studies. Finally, the most common methodologic flaw identified pertained to assignment of reference standard diagnoses (Domains III and IV), with 10/12 studies failing to assign a reference standard diagnosis of any kind to 89% (range 75–100%) of GEC-benign ITNs.

Table 1.

Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) Domain-Specific Signaling Questions Customized for a Review of Studies Examining the Diagnostic Accuracy of GEC Testing

Domain	Review of performance studies for Afirma questions
Patient/nodule selection	Does the sampled population represent patients undergoing first-line evaluation of a thyroid nodule, who are found to have TBSRTC III/IV (indeterminate) cytology on FNA?
	Do the eligibility criteria for GEC testing, including proportions of patients deemed TBSRTC III/IV on FNA aspirates by cytopathologists, reflect that of most centers?
	Is the study restricted to appropriately included nodules, those ≥1 cm, and with TBSRTC III/IV (indeterminate) cytopathology on first FNA? If other TBSRTC classes are included, are performance calculations by TBSRTC class subgroup reported?
Index test	Is the test executed as recommended (i.e., two needle passes dedicated for Afirma testing)?
Reference standard	Is there an effort to provide a reference standard diagnosis for most patients included in the study, both GEC-benign and GEC-suspicious?
	Are subjects who do not undergo surgery immediately after GEC testing followed clinically?
Flow and timing	Do all patients receive a reference standard diagnosis, from either surgical pathology or clinical follow-up?
	Do the number of patients excluded from the analysis due to lack of reference standard distribute evenly across TP/TN/FP/FN?
	Are no subjects reassigned between TP/TN/FP/FN categories based on additional post-testing clinical knowledge prior to analysis?

GEC, Gene Expression Classifier; TBSRTC, The Bethesda System for the Reporting of Thyroid Cytopathology; FNA, fine-needle aspiration; TP, true positive; TN, true negative; FP, false positive; FN, false negative.

Using the consensus answers to the signaling questions for all four domains, the panel identified whether there was potential bias (Table 2). Answers of “no” or “unclear” to one or more questions within each QUADAS-2 domain flagged a study's findings as biased. Eleven of 12 studies had one or more sources of potential bias. For 10 studies, exclusion of GEC-benign ITNs that did not undergo excision likely resulted in biased calculations due to disproportionate exclusion of “true negatives” from calculations of specificity and NPV. This disproportionality resulted because most of the un-excised (therefore no reference standard) ITNs were GEC-benign (77%; range 50–100%). Other major sources of potential bias included inappropriate patient or nodule selection, with the GEC being applied to populations other than the test's intended use population (TBSRTC III/IV nodules ≥1 cm undergoing the first evaluation for management). The most common inappropriately selected population of ITNs for use of the GEC was nodules that had already been referred for thyroidectomy or lobectomy.

Table 2.

Impact of QUADAS-2 Assessment of Included Studies on Risk of Bias and Applicability concerns

		Risk of bias				Applicability concerns
Study	Month/Year	Patient selection	Index test	Reference standard	Flow and timing	Patient selection	Index test	Reference standard
Harrell	April/2013	H	H	H	H	H	H	H
Alexander	October/2013	L	L	H	H	L	L	H
Aragon Han	February/2014	H	H	L	L	H	H	L
McIver	April/2014	H	L	H	H	H	L	H
Lastra	May/2014	H	L	H	H	H	L	H
Brauner	July/2015	H	L	H	H	H	L	H
Marti	April/2015	H	H	H	H	H	H	H
Celik	July/2015	H	H	H	H	H	H	H
Yang	August/2015	H	H	H	H	H	H	H
Witt	August/2015	L	L	L	L	L	L	L
Noureldine	November/2015	H	H	H	H	H	H	H
Chaudhary	June/2016	H	H	H	H	H	H	H

L, low risk; H, high or unclear risk.

Answers of “no” or “unclear” to one or more signaling questions in the patient/nodule selection, index test execution, and reference standard domains flagged a study as having potential applicability issues; this was the case for all but one study (see Table 2). For Domain I (patient/nodule selection), restriction to ITNs already referred for thyroidectomy or lobectomy resulted in findings applicable to selected populations with higher malignancy prevalence, and not to most clinical situations in which ITNs are first evaluated and management decisions made. Estimates of malignancy prevalence (including GEC-tested ITNs only, and assuming that there were no additional malignancies in un-excised GEC-benign ITNs) were 21% in TBSRTC III and 31% in TBSRTC IV nodules, but ranged as high as 66% and 50%, respectively, in studies reporting results by TBSRTC subgroups. For Domains III and IV, lack of reference standard diagnosis assignment to the majority of GEC-benign ITNs resulted in applicability concerns for 10/12 studies evaluated. Because of the high sensitivity and NPV of the GEC, most GEC-benign ITNs were not referred for thyroidectomy or lobectomy and thus did not have surgical histopathology available for the assignment of a reference standard diagnosis (6). By restricting analyses to include only ITNs with histopathology, the majority of authors reporting the diagnostic accuracy of the GEC essentially restricted their analysis only to GEC-suspicious ITNs, not only biasing calculations of sensitivity and specificity but also making their findings inapplicable in settings in which both GEC-benign and GEC-suspicious ITNS are managed.

Table 3 describes the predicted impact of methodologic flaws identified using the customized QUADAS-2 tool. For patient/nodule selection (Domain I), inappropriate selection of ITNs for GEC testing changes the underlying prevalence of malignancy and can impact calculated NPV and PPV. For instance, application of the GEC in a cohort of ITNs already evaluated and selected for thyroidectomy or lobectomy may result in concentration of malignancies (increase in prevalence) in the evaluated cohort, which would decrease calculated NPV and increase calculated PPV. Calculation of diagnostic accuracy essentially on “test-positives” (GEC-suspicious ITNs) alone results in reporting falsely low specificity and NPV calculated on a small, non-representative subgroup of GEC-benign ITNs selected for excision. Other methodologic flaws, including testing ITNs <1 cm, testing thyroid nodules of TBSRTC categories other than III and IV, and re-assigning reference standard diagnoses to tested ITNs because of additional clinical information other than the GEC result, can all result in unpredictable impacts on estimates of sensitivity, specificity, NPV, and PPV.

Table 3.

Expected Impact of Methodologic Flaws on Diagnostic Accuracy Estimates

		Impact on measures of test performance
Domain	Example of flaw triggering “no” or “unclear” to signaling question	Sensitivity	NPV	Specificity	PPV
I	Subjects with indeterminate nodule already referred for surgery by endocrinologist likely to have atypically high underlying prevalence of malignancy (20)	—	↓	—	↑
	Center-specific cytopathology practices triage malignant nodules into TBSRTC classes other than III/IV, which decreases underlying prevalence of malignancy in GEC-tested sample (13)	—	↑	—	↓
	Subjects referred for GEC testing with TBSTRC II or VI nodules on FNA; the GEC has unknown performance in such nodules (17)	Unpredictable	Unpredictable	Unpredictable	Unpredictable
II	Proportion of GEC indeterminates increases because two dedicated needle passes not provided for test (10)	Unpredictable	Unpredictable	Unpredictable	Unpredictable
III	Reference standard only determined for patients who underwent surgery, so many patients do not have reference diagnosis determined (10,11,13 –18,20,21)	—	↓	↓	—
	Lack of clinical follow-up for unoperated patients results in many subjects who do not have reference diagnosis determined (10,11,13 –18,20,21)	—	↓	↓	—
IV	Because of lack of reference standard diagnosis for many patients, analyses performed on nonrepresentative subset of the study's cohort (10,11,13 –18,20,21)	—	↓	↓	—
	GEC-benign (potential true negatives) are disproportionately excluded from the analyses, resulting in calculation of performance on nonrepresentative subgroup of operated, GEC-benign subjects (10,11,13 –18,20,21)	—	↓	↓	—
	An untested larger nodule disparate from the GEC tested nodule motivated author to reassign an inappropriately tested subcentimeter nodule from false negative to true negative (10)	Unpredictable	Unpredictable	Unpredictable	Unpredictable

QUADAS-2 domains are: I, patient/nodule selection; II, index test; III, reference standard; and IV, flow and timing.

Conclusions

Using a customized QUADAS-2 tool, this study found serious methodologic flaws in the majority of 12 diagnostic accuracy studies that evaluated the GEC. These methodologic flaws likely resulted in potential bias in the calculation of diagnostic accuracy parameters, and raise concerns about the applicability of many authors' findings outside of their own clinical contexts. In a prospective, blinded, multicenter trial, the GEC was validated using surgical histopathology for both GEC-benign and GEC-suspicious ITNs (6). Given the high sensitivity and NPV demonstrated, clinical surveillance of GEC-benign ITNs is now an acceptable standard of care. A better understanding of the diagnostic accuracy of the GEC as it is used in real-world observational settings would ideally include information about GEC-benign ITNs that do not undergo thyroidectomy or lobectomy.

This review updates that of prior authors who have reviewed studies of the diagnostic accuracy of the GEC, from a stop date of August 30, 2015, to a stop date of June 30, 2016 (27). The finding of lack of rigor in the assignment of reference standard diagnoses as a potential source of bias corroborates that of prior authors for five of the six studies included in both reviews (15 –19,21). Despite concerns about reference standard-related biases that both reviews identified, those authors chose to proceed with a meta-analysis and produced pooled estimates of sensitivity and specificity using a bivariate normal model for the logit transforms of these parameters. Given the lack of reference standard diagnoses for GEC-benign ITNs that were not assigned a reference standard diagnosis and thereby excluded from the analysis, their estimate for specificity is likely falsely low.

Heterogeneity in the methods of diagnostic accuracy studies of the GEC has resulted in estimates of test performance that are unreliable and therefore differ from those reported in the original prospective blinded multicenter trial (6). The most common methodologic flaw in the studies reviewed was disproportionate lack of reference standard diagnosis for the GEC-benign ITNs. Figure 1 presents a solution to the lack of histopathology in GEC-benign ITNs, namely the assignment of “true-negative” designation to un-excised GEC-benign ITNs that demonstrate no features consistent with malignancy upon repeated ultrasound or cytopathologic assessment. Another common methodologic flaw identified was the restriction of tested ITNs to those already referred to surgery. Inclusion of GEC-tested ITNs undergoing first-line evaluation at the source clinic (e.g., endocrinology or primary-care clinic) from which ITNs are triaged for surgery would better sample the population to which the GEC is applicable and would result in diagnostic accuracy estimates that are more reliable.

All but one of the studies reviewed did not meet a minimal quality standard in terms of assignment of reference label and appropriate population selection for inclusion in a meta-analysis, so the study did not proceed with generation of pooled sensitivity and specificity estimates. Such an analysis awaits adoption by future investigators of a less flawed, more standardized approach to the question of diagnostic accuracy of the GEC in observational settings in which many GEC-benign ITNs do not undergo excision. It is proposed that GEC-benign ITNs undergoing repeat ultrasounds once six months after the initial evaluation and once again between 12 and 18 months after the first surveillance ultrasound should be considered as true negatives if there are no suspicious changes on these surveillance ultrasounds. GEC-benign ITNs that do develop high suspicious ultrasonographic features could undergo repeat FNA, with repeat TBRSTC categories of V or above considered false negatives but those with repeat TBRSTC of II, III, or IV remaining as true negatives with ongoing sonographic surveillance every 6–12 months (30). Follow-up of GEC-benign ITNs undergoing surveillance extends to 40 months in some cases. Modifications to this provisional surveillance strategy are expected and would be supported, informed by published updates of these and other followed cohorts of GEC-benign ITNs (7 –9). Long-term follow-up data will be important in the development of surveillance strategy because false-negative findings may not become obvious until after long-term follow-up of a slow-growing carcinoma. That said, previous research has demonstrated that TBSRTC II (benign) thyroid nodules that undergo more than three years of follow-up are diagnosed with malignancy no more frequently than those that undergo less than three years of follow-up, and GEC-benign ITNs have demonstrated similar growth to TBSRTC II benign nodules (7,31).

Malignancies that are captured with a suspicious GEC result vary in their aggressiveness, with some having an indolent or even exceedingly indolent clinical course. The noninvasive follicular thyroid neoplasm with papillary-like features (NIFTP), formerly the noninvasive follicular variant of papillary thyroid carcinoma (NFVPTC), is an example of the latter. In a cohort of 63 ITNs with a suspicious GEC result that underwent excision, 64% of the carcinomas diagnosed were NIFTPs (32). Despite the excellent expected prognosis of these cancers, excision is considered appropriate management, even if a more conservative approach for that surgery is warranted (i.e., lobectomy instead of total thyroidectomy). Therefore, indolent carcinomas, including NIFTPs, should continue to be classified as suspicious (i.e., true positives, by the GEC).

The average excision rate for ITNs evaluated without molecular testing across all clinical settings, including primary care, endocrine, and surgical clinics, is not well studied. Studies reporting the malignancy rates for TBSRTC categories suggest that excision rates for untested ITNs could be as low as 55–70%, implying that there is a risk-stratifying process by which providers estimate which ITNs are more likely to be malignant and warrant referral for thyroidectomy or lobectomy, and which are not (4,33). The diagnostic accuracy of this process is not known. In parallel with more rigorous evaluation of the diagnostic accuracy of the GEC, including collection of follow-up information for un-excised GEC-benign ITNs, a better understanding of the sensitivity and specificity of risk stratification of ITNs without molecular knowledge is warranted. In-depth evaluation of other molecular classifiers in ITNs is beyond the scope of this review. However, evaluation of the diagnostic accuracy of such tests as their use expands is warranted and should proceed with the rigor inherent in a QUADAS-2-based review.

In conclusion, a customized QUADAS-2 tool was used to evaluate the quality of studies reporting the diagnostic accuracy of the GEC. In doing so, serious methodologic flaws were found, likely resulting in unreliable, biased estimates that are not applicable in most clinical situations in which the GEC is indicated. Future efforts to examine the diagnostic accuracy of the test should incorporate information about un-excised GEC-benign ITNs, and avoid reporting of diagnostic accuracy calculated on a subset that includes mostly GEC-suspicious ITNs only.

Footnotes

Acknowledgments

The authors acknowledge Rebeca Campos Hunter, MPH, for her assistance with the literature search, and Robert Gallop, PhD, for his assistance in preparing the tables.

Author Disclosure Statement

C.R.T. is a consultant for Veracyte, Inc. Q.Y.D., N.L.B., H.G., and G.R. have nothing to declare.

References

Cibas

, Ali

. 2009. The Bethesda System for Reporting Thyroid Cytopathology. Thyroid, 19:1159–1165.

Renshaw

. 2010. An estimate of risk of malignancy for a benign diagnosis in thyroid find-needle aspirates. Cancer Cytopathol, 118:190–195.

Krauss

, Mahon

, Fede

, Zhang

. 2016. Application of the Bethesda classification for thyroid fine-needle aspiration. Arch Pathol Lab Med, 140:1121–1131.

Cibas

, Baloch

, Fellegara

, LiVolsi

, Raab

, Rosai

, Diggans

, Friedman

, Kennedy

, Kloos

, Lanman

, Mandel

, Sindy

, Steward

, Zeiger

, Haugen

, Alexander

. 2013. A prospective assessment defining the limitations of thyroid nodule pathologic evaluation. Ann Intern Med, 159:325–332.

Chudova

, Wilde

, Wang

, Rabbee

, Egidio

, Reynolds

, Tom

, Pagan

, Rigl

, Friedman

, Wang

, Lanman

, Zeiger

, Kebebew

, Rosai

, Fellegara

, LiVolsi

, Kennedy

. 2010. Molecular classification of thyroid nodules using high-dimensionality genomic data. J Clin Endocrinol Metab, 95:5296–5304.

Alexander

, Kennedy

, Baloch

, Cibas

, Chudova

, Diggans

, Friedman

, Kloos

, LiVolsi

, Mandel

, Raab

, Rosai

, Steward

, Walsh

, Wilde

, Zeiger

, Lanman

, Haugen

. 2012. Preoperative diagnosis of benign thyroid nodules with indeterminate cytology. N Engl J Med, 367:705–715.

Angell

, Frates

, Medici

, Liu

, Kwong

, Cibas

, Kim

, Marqusee

. 2015. Afirma benign thyroid nodules show similar growth to cytologically benign nodules during follow-up. J Clin Endocrinol Metab, 100:E1477–1483.

Sipos

, Blevins

, Chamberlain Shea

, Duick

, Lakhian

, Michael

, Thomas

, Soasa

. 2016. Long-term non-operative rate of thyroid nodules with benign results on the Afirma gene expression classifier. Endocr Pract, 22:666–672.

Singer

, Hanna

, Visaria

, Gu

, McCoy

, Kloos

. 2016. Impact of the gene expression classifier on the long-term management of patients with indeterminate thyroid nodules. Curr Med Res Opin, 32:1225–1232.

10.

Duick

, Klopper

, Diggans

, Friedman

, Kennedy

, Lanman

, McIver

. 2012. The impact of benign gene expression classifier test results on the endocrinologist-patient decision to operate on patients with thyroid nodules with indeterminate fine-needle aspiration cytopathology. Thyroid, 22:996–1001.

11.

Dhingra

. 2016. Office-based ultrasound-guided FNA with molecular testing for thyroid nodules. Otolaryngol Head Neck Surg, 155:564–569.

12.

Zhu

, Faquin

, Samir

. 2015. Relationship between sonographic characteristics and Afirma Gene Expression Classifier results in thyroid nodules with indeterminate fine-needle aspiration cytopathology. AJR Am J Roentgenol, 205:861–865.

13.

, Lam

, Rao

, Sullivan

, Yeh

. 2016. Effect of malignancy rates on cost-effectiveness of routine Gene Expression Classifier testing for indeterminate thyroid nodules. Surgery, 159:118–126.

14.

Sullivan

, Hirschowitz

, Fung

, Apple

. 2014. The impact of atypia/follicular lesion of undetermined significance and repeat fine-needle aspiration: 5 years before and after implementation of the Bethesda system. Cancer Cytopathol, 122:866–872.

15.

Harrell

, Bimston

. 2014. Surgical utility of Afirma: effects of high cancer prevalence and oncocytic cell types in patients with indeterminate thyroid cytology. Endocr Pract, 20:364–369.

16.

Alexander

, Schorr

, Klopper

, Kim

, Sipos

, Nabhan

, Parker

, Steward

, Mandel

, Haugen

. 2014. Multicenter clinical experience with the affirm gene expression classifier. J Clin Endocrinol Metab, 99:119–125.

17.

Aragon Han

, Olsen

, Fazeli

, Prescott

, Pai

, Schneider

, Tufano

, Zeiger

. 2014. The impact of molecular testing on the surgical management of patients with thyroid nodules. Ann Surg Oncol, 21:1862–1869.

18.

McIver

, Castro

, Morris

, Bernet

, Smallridge

, Henry

, Kosok

, Reddy

. 2014. An independent study of a gene expression classifier (Afirma) in the evaluation of cytologically indeterminate thyroid nodules. J Clin Endocrinol Metab, 99:4069–4077.

19.

Lastra

, Pramick

, Crammer

, LiVolsi

, Baloch

. 2014. Implications of a suspicious Afirma test result in thyroid fine-needle aspiration cytology: an institutional experience. Cancer Cytopathol, 122:737–744.

20.

Brauner

, Holmes

, Krane

, Nishino

, Zurakowski

, Hennessey

, Faquin

, Parangi

. 2015. Performance of the Afirma gene expression classifier in Hürthle cell thyroid nodules differs from other indeterminate nodules. Thyroid, 25:789–796.

21.

Marti

, Avadhani

, Donatelli

, Niyogi

, Wang

, Wong

, Shaha

, Ghossein

, Lin

, Morris

LGT

, Ho

. 2015. Wide inter-institutional variation in performance of a molecular classifier for indeterminate nodules. Ann Surg Oncol, 22:3996–4001.

22.

Celik

, Whetsell

, Nassar

. 2015. Afirma GEC and thyroid lesions: an institutional experience. Diagn Cytopathol, 43:966–970.

23.

Yang

, Sullivan

, Zhang

, Govind

, Levin

, Rao

, Moatamed

. 2016. Has Afirma gene expression classifier testing refined the indeterminate thyroid category in cytology?. Cancer Cytopathol, 124:100–109.

24.

Witt

. 2016. Outcome of thyroid gene expression classifier testing in clinical practice. Laryngoscope, 126:524–527.

25.

Noureldine

, Olsen

, Agrawal

, Prescott

, Zeiger

, Tufano

. 2015. Effect of the gene expression classifier on the surgical decision-making process for patients with thyroid nodules. JAMA Otolaryngol Head Neck Surg, 141:1082–1088.

26.

Chaudhary

, Hou

, Shen

, Hooda

, Li

. 2016. Impact of the Afirma gene expression classifier result on the surgical management of thyroid nodules with category III/IV cytology and its correlation with surgical outcome. Acta Cytol, 60:205–210.

27.

Santhanam

, Khthir

, Gress

, Elkadry

, Olajide

, Yaqub

, Driscoll

. 2016. Gene expression classifier for the diagnosis of indeterminate thyroid nodules: a meta-analysis. Med Oncol, 33:14.

28.

Whiting

, Rutjes

AWS

, Westwood

, Mallett

, Deeks

, Reitsma

, Leeflang

MMG

, Sterne

JAC

, Bossuyt

PMM

; the QUADAS-2 Group. 2011. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med, 155:529–536.

29.

Whiting

, Rutjes

, Reitsma

, Bossuyt

, Kleijnen

. 2003. The development of QUADAS: a tool for the quality assessment of studies of diagnostic accuracy included in systematic reviews. BMC Med Res Methodol, 3:25.

30.

Gharib

, Papini

, Garber

, Duick

, Harrell

, Hegedüs

, Paschke

, Valcavi

, Vitti

; the AACE/ACE/AME Task Force on Thyroid Nodules. 2016. American Association of Clinical Endocrinologists, American College of Endocrinology, and Associazione Medici Endocrinologi medical guides for clinical practice for the diagnosis and management of thyroid nodules—2016 update. Endocr Pract, 22:622–639.

31.

Lee

, Skelton

, Zheng

, Schwartz

, Perrier

, Lee

, Bassett

, Ahmed

, Krishnamurthy

, Busaidy

, Grubbs

. 2013. The biopsy-proven benign thyroid nodule: is long-term follow-up necessary?. J Am Coll Surg, 217:81–88.

32.

Wong

, Angell

, Strickland

, Alexander

, Cibas

, Krane

, Barletta

. 2016. Noninvasive follicular variant of papillary thyroid carcinoma and the Afirma Gene-Expression Classifier. Thyroid, 26:911–915.

33.

Bongiovanni

, Spitale

, Faquin

, Mazzucchelli

, Baloch

. 2012. The Bethesda System for Reporting Thyroid Cytopathology: a meta-analysis. Acta Cytol, 56:333–339.