Abstract
Background:
The Afirma® Gene Expression Classifier (GEC) risk stratifies The Bethesda System for the Reporting of Thyroid Cytopathology class III/IV (indeterminate) thyroid nodules (ITNs) as suspicious for malignancy or benign. Several authors have published studies describing the diagnostic accuracy of the GEC. However, the quality of these methods has not been rigorously examined.
Summary:
In this study, MEDLINE and EMBASE were searched for studies published between January 1, 2010, and June 30, 2016, examining the sensitivity, specificity, negative predictive value, and positive predictive value of the GEC. The Quality of Diagnostic Accuracy Studies 2 was customized to evaluate the methods of included studies in each of four domains: nodule selection, index test execution, reference standard assignment, and flow and timing. Signaling questions were used to identify sources of potential bias in calculation of diagnostic accuracy, and issues of applicability were assessed. Three panelists applied the Quality of Diagnostic Accuracy Studies 2 tool to each study included, and divergence was resolved in conference. In 12 studies evaluated, the most common methodologic flaw was lack of reference standard diagnosis assignment to un-excised GEC-benign ITNs. Exclusion of these ITNs from the analyses resulted in unreliable estimates of specificity and negative predictive value. Other flaws identified included restriction to ITNs that had already been selected for referral for thyroidectomy or lobectomy.
Conclusions:
Future studies should define and assign a “true negative” label to GEC-benign nodules that do not develop malignant signs or symptoms during a pre-specified period of follow-up, and these nodules should be included in calculations of diagnostic accuracy.
Introduction
T
The Afirma® Gene Expression Classifier (GEC) measures the expression of 167 gene transcripts, and was developed in ITNs ≥1 cm in diameter undergoing their first evaluation (5). In a prospective, double-blinded, multicenter validation trial, the GEC demonstrated a sensitivity of 90% and a negative predictive value (NPV) of 94% in TBSRTC III/IV nodules (6). Because of the test's high sensitivity and NPV, providers using the GEC to risk stratify ITNs can be confident that a GEC-benign nodule has a risk of malignancy low enough that excision can be safely avoided. Studies have demonstrated a low rate of excision in ITNs that are GEC-benign, with one study demonstrating no difference in operative rates between ITNs that are GEC-benign and TBSRTC category II nodules (7 –14).
Since January 2011, the GEC has been available in the United States, and the number of institutions testing some or all ITNs has grown. Several authors have examined the diagnostic accuracy of the GEC by reporting on its sensitivity, specificity, positive predictive value (PPV), and NPV at their institutions (15 –26). Few prior authors have rigorously evaluated the methods of these studies to determine the reliability of diagnostic accuracy calculations reported and to determine whether the findings are applicable outside the clinical scenarios described therein (27). The Quality of Diagnostic Accuracy Studies (QUADAS-2) tool is one method by which reviewers can systematically evaluate the quality of methods of studies evaluating the accuracy of a diagnostic test (28,29). The tool can be used to restrict a systematic review or meta-analysis to those reports that meet a predefined minimum standard for quality. This study sought to evaluate systematically the quality of published studies that purported to report the sensitivity, specificity, NPV, and PPV of the GEC in observational clinical settings using a customized QUADAS-2 tool.
Review
Data sources and search
MEDLINE and EMBASE were searched for studies that reported diagnostic accuracy of the GEC in observational clinical settings published between January 1, 2010, and June 30, 2016. A search strategy previously used to capture diagnostic accuracy studies of the GEC was employed, and titles of retrieved articles for studies that met the inclusion criteria were scanned (27). References of retrieved articles were searched for additional studies that met the inclusion criteria.
Study selection
All studies that reported diagnostic accuracy of the GEC in observational clinical settings were included. Studies reporting enough information to calculate estimates for sensitivity, specificity, NPV, and PPV were included. Studies that evaluated the diagnostic accuracy of the GEC in thyroid nodules other than TBSRTC III or IV were included, as long as they reported results in ITNs separately from other categories of tested thyroid nodules. In some clinical contexts, ultrasonographic features may be used to triage which ITNs undergo GEC testing, with the role for genetic testing being additive in informing decisions regarding excision. However, the studies examined did not explicitly report ultrasonographic features used for selection of ITNs for GEC testing, other than the >1 cm requirement for FNA. Therefore, it was not possible to report on to what extent ITNs tested were selected using ultrasound criteria, or what those criteria would be. Studies were excluded that only examined thyroidectomy rates for GEC-tested thyroid nodules and did not report on the diagnostic accuracy of the test. Studies were also excluded that only reported long-term follow-up of patients with ITNs that were GEC-benign because these studies did not include information regarding GEC-suspicious ITNs, and therefore diagnostic accuracy could not be calculated. One study was excluded with very few ITNs (14 nodules) tested using GEC, with three of these nodules from the same patient (14).
The QUADAS-2 Tool
Multiple studies evaluating the accuracy of a single novel diagnostic test often report markedly heterogeneous results. This variability in diagnostic accuracy calculations stems from underlying variability in the methods employed in these studies, and confounds efforts to aggregate diagnostic accuracy estimates systematically across multiple studies examining the same test. In response, the QUADAS tool was developed to evaluate the methods of reporting the diagnostic accuracy of a test rigorously and systematically (28). QUADAS-2 was developed in response to problems with application of the original QUADAS tool for certain clinical questions, including situations in which the “gold” reference standard diagnosis involves clinical follow-up. In Phases 1–3 of the QUADAS-2 process, reviewers define the clinical question and develop a flow diagram depicting the ideal design for a study evaluating the accuracy of the diagnostic test. In Phase 4, the QUADAS-2 tool is used to assess studies included in the review in each of four domains: patient selection, index test, reference standard, and flow and timing. For each of these domains, reviewers ask one or more signaling questions about the methods of each study. Reviewers customize the QUADAS-2 tool prior to its application to ensure the signaling questions are relevant to the clinical scenario under evaluation.
For this review, the QUADAS-2 tool was customized to define the intended use population of thyroid nodules tested, the recommended execution of the index test (the GEC), and definitions of reference (“gold”) standard diagnoses. In a flow chart, the ideal design for a study examining the performance of the GEC was diagrammed. Customized signaling questions were developed for each of the four QUADAS-2 domains. Three co-authors (Q.D., H.G., and G.R., the “panel”) independently applied the customized QUADAS-2 tool to the studies that met the inclusion criteria. For each signaling question, panelists were required to answer “yes” or “no/unclear.” C.R.T. aggregated the responses to the signaling questions from the independent reviews of each study and identified questions for which there was divergence. “Divergent” answers were those for which there was not three-panelist consensus of either “yes” for a question, or “no/unclear” for a question. Divergent answers among panelists were resolved through teleconference discussions including all three panelists. In some cases, panelists would change their answers for a question on which they previously converged after discussion resulted in clarity regarding the meaning of that question. For all questions, however, the final accepted answers from all three panelists converged (final answer for each either all three “yes” or all three “no/unclear”).
Data analysis and synthesis
For each study evaluated, consensus answers for signaling questions identified whether there were methodologic flaws biasing diagnostic accuracy calculations, and/or whether there were concerns about applicability of the study's findings to clinical scenarios outside of that reported in the study. “No” or “unclear” answers to one or more signaling questions within a domain for a study indicated that methodologic flaws could have resulted in unreliable estimates of test performance, or that findings from the study may not be applicable to the intended use population of thyroid nodules for which the GEC was developed. Finally, the study reports how biases introduced by methodologic flaws in diagnostic accuracy studies of the GEC could impact calculations of sensitivity, specificity, NPV, and PPV (increase, decrease, or unpredictable for each parameter).
Summary
Twelve studies met the inclusion criteria for the review (15 –26). Figure 1 depicts the ideal design of a study examining the diagnostic accuracy of the GEC in an observational setting. The flow chart emphasizes selection of TBSRTC III/IV (indeterminate) nodules ≥1 cm in diameter, undergoing first-line evaluation. The flow chart also allows for assignment of benignity (“true negative”) as the reference standard diagnosis to GEC-benign ITNs that do not undergo thyroidectomy or lobectomy, if follow-up with repeat ultrasounds or repeat FNA does not support a diagnosis of malignancy.

Flow chart reflecting idea design for a study of the diagnostic accuracy of Gene Expression Classifier (GEC) testing.
One to three questions were developed for each QUADAS-2 domain; these nine signaling questions are listed in Table 1. Using the signaling questions, the panel identified no major methodologic flaws in just 1/12 studies evaluated (Supplementary Table S1; Supplementary Data are available online at
GEC, Gene Expression Classifier; TBSRTC, The Bethesda System for the Reporting of Thyroid Cytopathology; FNA, fine-needle aspiration; TP, true positive; TN, true negative; FP, false positive; FN, false negative.
Using the consensus answers to the signaling questions for all four domains, the panel identified whether there was potential bias (Table 2). Answers of “no” or “unclear” to one or more questions within each QUADAS-2 domain flagged a study's findings as biased. Eleven of 12 studies had one or more sources of potential bias. For 10 studies, exclusion of GEC-benign ITNs that did not undergo excision likely resulted in biased calculations due to disproportionate exclusion of “true negatives” from calculations of specificity and NPV. This disproportionality resulted because most of the un-excised (therefore no reference standard) ITNs were GEC-benign (77%; range 50–100%). Other major sources of potential bias included inappropriate patient or nodule selection, with the GEC being applied to populations other than the test's intended use population (TBSRTC III/IV nodules ≥1 cm undergoing the first evaluation for management). The most common inappropriately selected population of ITNs for use of the GEC was nodules that had already been referred for thyroidectomy or lobectomy.
L, low risk; H, high or unclear risk.
Answers of “no” or “unclear” to one or more signaling questions in the patient/nodule selection, index test execution, and reference standard domains flagged a study as having potential applicability issues; this was the case for all but one study (see Table 2). For Domain I (patient/nodule selection), restriction to ITNs already referred for thyroidectomy or lobectomy resulted in findings applicable to selected populations with higher malignancy prevalence, and not to most clinical situations in which ITNs are first evaluated and management decisions made. Estimates of malignancy prevalence (including GEC-tested ITNs only, and assuming that there were no additional malignancies in un-excised GEC-benign ITNs) were 21% in TBSRTC III and 31% in TBSRTC IV nodules, but ranged as high as 66% and 50%, respectively, in studies reporting results by TBSRTC subgroups. For Domains III and IV, lack of reference standard diagnosis assignment to the majority of GEC-benign ITNs resulted in applicability concerns for 10/12 studies evaluated. Because of the high sensitivity and NPV of the GEC, most GEC-benign ITNs were not referred for thyroidectomy or lobectomy and thus did not have surgical histopathology available for the assignment of a reference standard diagnosis (6). By restricting analyses to include only ITNs with histopathology, the majority of authors reporting the diagnostic accuracy of the GEC essentially restricted their analysis only to GEC-suspicious ITNs, not only biasing calculations of sensitivity and specificity but also making their findings inapplicable in settings in which both GEC-benign and GEC-suspicious ITNS are managed.
Table 3 describes the predicted impact of methodologic flaws identified using the customized QUADAS-2 tool. For patient/nodule selection (Domain I), inappropriate selection of ITNs for GEC testing changes the underlying prevalence of malignancy and can impact calculated NPV and PPV. For instance, application of the GEC in a cohort of ITNs already evaluated and selected for thyroidectomy or lobectomy may result in concentration of malignancies (increase in prevalence) in the evaluated cohort, which would decrease calculated NPV and increase calculated PPV. Calculation of diagnostic accuracy essentially on “test-positives” (GEC-suspicious ITNs) alone results in reporting falsely low specificity and NPV calculated on a small, non-representative subgroup of GEC-benign ITNs selected for excision. Other methodologic flaws, including testing ITNs <1 cm, testing thyroid nodules of TBSRTC categories other than III and IV, and re-assigning reference standard diagnoses to tested ITNs because of additional clinical information other than the GEC result, can all result in unpredictable impacts on estimates of sensitivity, specificity, NPV, and PPV.
QUADAS-2 domains are: I, patient/nodule selection; II, index test; III, reference standard; and IV, flow and timing.
Conclusions
Using a customized QUADAS-2 tool, this study found serious methodologic flaws in the majority of 12 diagnostic accuracy studies that evaluated the GEC. These methodologic flaws likely resulted in potential bias in the calculation of diagnostic accuracy parameters, and raise concerns about the applicability of many authors' findings outside of their own clinical contexts. In a prospective, blinded, multicenter trial, the GEC was validated using surgical histopathology for both GEC-benign and GEC-suspicious ITNs (6). Given the high sensitivity and NPV demonstrated, clinical surveillance of GEC-benign ITNs is now an acceptable standard of care. A better understanding of the diagnostic accuracy of the GEC as it is used in real-world observational settings would ideally include information about GEC-benign ITNs that do not undergo thyroidectomy or lobectomy.
This review updates that of prior authors who have reviewed studies of the diagnostic accuracy of the GEC, from a stop date of August 30, 2015, to a stop date of June 30, 2016 (27). The finding of lack of rigor in the assignment of reference standard diagnoses as a potential source of bias corroborates that of prior authors for five of the six studies included in both reviews (15 –19,21). Despite concerns about reference standard-related biases that both reviews identified, those authors chose to proceed with a meta-analysis and produced pooled estimates of sensitivity and specificity using a bivariate normal model for the logit transforms of these parameters. Given the lack of reference standard diagnoses for GEC-benign ITNs that were not assigned a reference standard diagnosis and thereby excluded from the analysis, their estimate for specificity is likely falsely low.
Heterogeneity in the methods of diagnostic accuracy studies of the GEC has resulted in estimates of test performance that are unreliable and therefore differ from those reported in the original prospective blinded multicenter trial (6). The most common methodologic flaw in the studies reviewed was disproportionate lack of reference standard diagnosis for the GEC-benign ITNs. Figure 1 presents a solution to the lack of histopathology in GEC-benign ITNs, namely the assignment of “true-negative” designation to un-excised GEC-benign ITNs that demonstrate no features consistent with malignancy upon repeated ultrasound or cytopathologic assessment. Another common methodologic flaw identified was the restriction of tested ITNs to those already referred to surgery. Inclusion of GEC-tested ITNs undergoing first-line evaluation at the source clinic (e.g., endocrinology or primary-care clinic) from which ITNs are triaged for surgery would better sample the population to which the GEC is applicable and would result in diagnostic accuracy estimates that are more reliable.
All but one of the studies reviewed did not meet a minimal quality standard in terms of assignment of reference label and appropriate population selection for inclusion in a meta-analysis, so the study did not proceed with generation of pooled sensitivity and specificity estimates. Such an analysis awaits adoption by future investigators of a less flawed, more standardized approach to the question of diagnostic accuracy of the GEC in observational settings in which many GEC-benign ITNs do not undergo excision. It is proposed that GEC-benign ITNs undergoing repeat ultrasounds once six months after the initial evaluation and once again between 12 and 18 months after the first surveillance ultrasound should be considered as true negatives if there are no suspicious changes on these surveillance ultrasounds. GEC-benign ITNs that do develop high suspicious ultrasonographic features could undergo repeat FNA, with repeat TBRSTC categories of V or above considered false negatives but those with repeat TBRSTC of II, III, or IV remaining as true negatives with ongoing sonographic surveillance every 6–12 months (30). Follow-up of GEC-benign ITNs undergoing surveillance extends to 40 months in some cases. Modifications to this provisional surveillance strategy are expected and would be supported, informed by published updates of these and other followed cohorts of GEC-benign ITNs (7 –9). Long-term follow-up data will be important in the development of surveillance strategy because false-negative findings may not become obvious until after long-term follow-up of a slow-growing carcinoma. That said, previous research has demonstrated that TBSRTC II (benign) thyroid nodules that undergo more than three years of follow-up are diagnosed with malignancy no more frequently than those that undergo less than three years of follow-up, and GEC-benign ITNs have demonstrated similar growth to TBSRTC II benign nodules (7,31).
Malignancies that are captured with a suspicious GEC result vary in their aggressiveness, with some having an indolent or even exceedingly indolent clinical course. The noninvasive follicular thyroid neoplasm with papillary-like features (NIFTP), formerly the noninvasive follicular variant of papillary thyroid carcinoma (NFVPTC), is an example of the latter. In a cohort of 63 ITNs with a suspicious GEC result that underwent excision, 64% of the carcinomas diagnosed were NIFTPs (32). Despite the excellent expected prognosis of these cancers, excision is considered appropriate management, even if a more conservative approach for that surgery is warranted (i.e., lobectomy instead of total thyroidectomy). Therefore, indolent carcinomas, including NIFTPs, should continue to be classified as suspicious (i.e., true positives, by the GEC).
The average excision rate for ITNs evaluated without molecular testing across all clinical settings, including primary care, endocrine, and surgical clinics, is not well studied. Studies reporting the malignancy rates for TBSRTC categories suggest that excision rates for untested ITNs could be as low as 55–70%, implying that there is a risk-stratifying process by which providers estimate which ITNs are more likely to be malignant and warrant referral for thyroidectomy or lobectomy, and which are not (4,33). The diagnostic accuracy of this process is not known. In parallel with more rigorous evaluation of the diagnostic accuracy of the GEC, including collection of follow-up information for un-excised GEC-benign ITNs, a better understanding of the sensitivity and specificity of risk stratification of ITNs without molecular knowledge is warranted. In-depth evaluation of other molecular classifiers in ITNs is beyond the scope of this review. However, evaluation of the diagnostic accuracy of such tests as their use expands is warranted and should proceed with the rigor inherent in a QUADAS-2-based review.
In conclusion, a customized QUADAS-2 tool was used to evaluate the quality of studies reporting the diagnostic accuracy of the GEC. In doing so, serious methodologic flaws were found, likely resulting in unreliable, biased estimates that are not applicable in most clinical situations in which the GEC is indicated. Future efforts to examine the diagnostic accuracy of the test should incorporate information about un-excised GEC-benign ITNs, and avoid reporting of diagnostic accuracy calculated on a subset that includes mostly GEC-suspicious ITNs only.
Footnotes
Acknowledgments
The authors acknowledge Rebeca Campos Hunter, MPH, for her assistance with the literature search, and Robert Gallop, PhD, for his assistance in preparing the tables.
Author Disclosure Statement
C.R.T. is a consultant for Veracyte, Inc. Q.Y.D., N.L.B., H.G., and G.R. have nothing to declare.
