Abstract
Background:
Molecular tests for thyroid nodules with indeterminate fine needle aspiration results are increasingly used in clinical practice; however, true diagnostic summaries of these tests are unknown. A systematic review and meta-analysis were completed to (1) evaluate the accuracy of commercially available molecular tests for malignancy in indeterminate thyroid nodules and (2) quantify biases and limitations in studies that validate those tests.
Summary:
PubMed, EMBASE, and Web of Science were systematically searched through July 2021. English language articles that reported original clinical validation attempts of molecular tests for indeterminate thyroid nodules were included if they reported counts of true-negative, true-positive, false-negative, and false-positive results. We performed screening and full-text review, followed by assessment of eight common biases and limitations, extraction of diagnostic and histopathological information, and meta-analysis of clinical validity using a bivariate linear mixed-effects model. Forty-nine studies were included. Meta-analysis of Afirma Gene expression classifiers (GEC; n = 38 studies) revealed a sensitivity of 0.92 (confidence interval: 0.90–0.94), specificity of 0.26 (0.20–0.32), negative likelihood ratio (LR−) of 0.32 (0.23–0.44), positive LR+ of 1.24 (1.15–1.35), and area under the curve (AUC) of 0.83 (0.74–0.89). Afirma Genomic Sequencing Classifier (GSC; n = 10) had a sensitivity of 0.94 (0.89–0.96), specificity of 0.38 (0.27–0.50), LR− of 0.18 (0.10–0.30), LR+ of 1.52 (1.28–1.87), and AUC of 0.91 (0.62–0.92). ThyroSeq v1 and v2 (n = 10) had a sensitivity of 0.86 (0.82–0.90), specificity of 0.74 (0.59–0.85), LR− of 0.19 (0.13–0.26), LR+ of 3.52 (2.08–5.92), and AUC of 0.86 (0.81–0.90). ThyroSeq v3 (n = 6) had a sensitivity of 0.92 (0.86–0.95), specificity of 0.41 (0.18–0.69), LR− of 0.24 (0.09–0.62), LR+ of 1.67 (1.09–2.98), and AUC of 0.90 (0.63–0.92). Fourteen percent of studies conducted a blinded histopathologic review of excised thyroid nodules, and 8% made the decision to go to surgery blind to molecular test results.
Conclusions:
Meta-analyses reveal a high diagnostic accuracy of molecular tests for thyroid nodule assessment of malignancy risk; however, these studies are subject to several limitations. Limitations and their potential clinical impacts must be addressed and, when feasible, adjusted for using valid statistical methodologies.
Introduction
Diagnostic tests detect and monitor disease and are ubiquitous across fields of medicine. In 2017, molecular diagnostic tests' $3.7b market share in North America was composed primarily of infectious disease and genetic testing, and the development of advanced cancer diagnostic tests is projected to increase more drastically than any other segment of the market. 1 While diagnostic tests have shown benefits by reducing unnecessary surgeries and hospitalizations 2 –4 improving the accuracy of decisions made by practitioners, 5,6 they may not always be a helpful component of treatment. Indeed, diagnostic tests can be over-ordered, used inappropriately or without knowledge of limitations, and may be associated with medical risks and high costs that undermine their utility in guiding medical decision-making. 7
Sensitivity and specificity are key components of a diagnostic test with one prioritized over the other based on the clinical scenario. Diagnostic tests aim to minimize risk and cost, which are different in a false-negative diagnosis versus a false-positive diagnosis; one measure of validity can be prioritized over another to increase the test's utility, rather than accuracy. 8 Clinically, the impact of each test must be considered and can be further evaluated by calculating the negative likelihood ratio (LR−), for example, to evaluate the risk of a false-negative result. For instance, a false positive may result in unnecessary treatment; a false negative may result in a missed cancer.
True test performance may be masked by biases within the initial clinical utility validation, in addition to changes in underlying prevalence. In the absence of a prospective, double-blinded study design, prospective–retrospective studies with banked biospecimens, single-arm studies, prospective observational studies, or decision-analytic modeling techniques 9 may be conducted, increasing the associated risks of biases. As a result, the actual sensitivity, specificity, or area under the curve (AUC) of diagnostic tests may differ in truth from reported values, resulting in unnecessary treatment and health care utilization of patients without an actual underlying disease and, for misdiagnosed patients with underlying disease, delayed treatment and reduced treatment efficacy.
Accounting for the consequences of a false-negative test result (i.e., a missed cancer), molecular tests (MTs) for thyroid cancer were developed with a high sensitivity to rule out malignancy by assessing the presence of biomarkers and genetic mutations in nodules with indeterminate fine needle aspiration (FNA) results. 10 Bethesda III and IV, in the six-tiered Bethesda System for Reporting Thyroid Cytology, are considered indeterminate FNA results. 11 The American Thyroid Association recommended the use of molecular tests for indeterminate thyroid nodules in clinical practice in 2015. 12 Gene expression classifiers (GEC) were designed to have high sensitivity and a negative predictive value, thus ruling out malignancy and presumed avoidance of unnecessary surgery. 13 Veracyte developed the first widely used molecular test, the Afirma GEC molecular test in 2012. 13 The test was widely adopted after successful clinical trials. Other companies soon followed suit. 14
To determine which biases are present in the reports of diagnostic tests and to what degree, careful reviews of test protocols should be conducted. It is critical that the reported accuracy of these tests be interpreted within the context of potential biases. Previous reviews 15 –17 are yet to comprehensively and systematically identify and analyze these biases across all the diagnostic molecular tests for thyroid carcinoma.
In this study, we characterize and quantify limitations and biases within the studies used to validate molecular tests that evaluate cytologically indeterminate thyroid nodules.
Materials and Methods
This review follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. 18 This review was not preregistered and an associated protocol is not available.
Data sources and searches
A systematic literature search was conducted on PubMed, Embase, and Web of Science database by searching for articles reporting attempts to clinically validate commercially available molecular tests for thyroid nodules. The search strategy consisted of querying each database for combinations of the keywords “molecular testing,” “cytomolecular testing,” “pathology,” “genetic testing,” or “genetic expression”; AND “indeterminate,” “cytology,” “fine needle aspiration,” “thyroid nodule,” or “thyroid surgery”; AND “Bethesda category III,” “Bethesda category IV,” “benign thyroid nodule,” “papillary thyroid cancer,” “nonmedullary thyroid carcinoma,” or “thyroid cancer.” Full details of the search queries are presented in Supplementary Appendix Methods SAD. This allowed for a broad sample of articles that contained clinical validation results of molecular tests for indeterminate thyroid nodules.
Date limits were not used, and the search was finalized on July 29, 2021. Reference lists of included articles were examined to ensure the inclusion of all relevant articles.
Study selection
Original articles reporting clinical validity studies of commercially available molecular tests were selected for inclusion if they reported diagnostic results, that is, counts of true-negative (TN), true-positive (TP), false-negative (FN), and false-positive (FP) results. Clinical validity studies assess a diagnostic test's ability to distinguish between patients with and without a disease, 19 and the process of performing one such study is characterized elsewhere. 20 Commercially available molecular tests identified, which are either currently or have been previously used, include Afirma GEC, Afirma Genomic Sequencing Classifier (GSC), ThyroSeq versions 1–3, miRInform or ThyGenX Thyroid Oncogene Panel (miRInform), Quest Diagnostics Genetic Mutation Panel, Rosetta GX Reveal, and ThyraMIR. Articles written in a language other than English and conference abstracts were excluded.
Two researchers (C.D. and V.V.) independently assessed the titles, abstracts of a pilot sample of articles for inclusion, and full-text articles. In cases where there was a discrepancy in their inclusion assessments, a third researcher (C.C.L.) resolved conflicts. Articles with patient samples that overlap with another study validating the same test were excluded; in each case of overlapping patient samples, the study with a larger sample was retained.
Data extraction and quality assessment
For each included study, the following information was extracted independently by one researcher (C.D.): authors and year of publication; true-positive, false-positive, true-negative, and false-negative results; sensitivity and specificity; positive and LR−; positive and negative predictive value; patient and nodule sample size; nodule cytological result; institutional prevalence of malignancy, initial FNA results, and type of study (i.e., prospective or retrospective); commercial molecular test(s), country-level location of study, and number of institutions; noninvasive follicular thyroid neoplasm with papillary-like nuclear features (NIFTP) classification (only assessed in articles published after 2015); industry funding receipt; and molecular test results and final histopathology diagnoses.
Risk of bias of the included studies was assessed by one researcher (C.D.) using nine categorical criteria covering the following aspects: patient selection, inconsistent comparison bias, partial verification bias, diagnostic review bias, observer variability, reporting of indeterminate results, and institutional malignancy prevalence reported as a range (Supplementary Appendix Table SA1). Each study's design and execution were assessed, and the risk of each bias was recorded as a binary variable (Met or Not Met) for whether each criterion was met with the option to partially meet a criterion, for example, in the case of pooled results from multiple locations with different study protocols.
Articles were given a summary score based on the number of criteria they met out of nine (if the article specified that multiple histopathologists reviewed excised samples) or eight (if only one histopathologist reviewed excised samples). Partial verification bias was assessed by quantifying the proportion of molecular test results verified by a final histopathological diagnosis.
Types of biases and limitations
Several biases can occur before, during, and after conducting diagnostic tests and, depending on the initial study design, can be difficult to avoid. The focus of this study is to characterize these biases. We distinguish between “biases,” “limitations,” and “variations”; bias arises from defects in study design and can result in incorrect conclusions; the direction of bias can sometimes be identified 21 and statistically corrected. 20,22 –26 Limitations arise from oversights in study design and cannot typically be corrected for, and variation is a by-product of changing underlying conditions among studies, for example, patient population and clinical protocol. These underlying conditions should be reported to contextualize findings. 27 Primary biases and limitations in study design are detailed in Table 1.
Biases and Limitations in Study Design Assessed
Histopathologic diagnosis (generally, the “gold standard” test revealing ground truth).
Molecular diagnostic test (generally, the test being evaluated).
Data synthesis and analysis
Statistical analyses were performed using R version 3.6.2. 28 Publication bias was assessed separately for each set of studies evaluating the same molecular test using Deek's funnel plot, which displays the inverse of the square root of sample size (number of nodules) as a function of the diagnostic odds ratio (the odds of a positive test in patients with disease relative to the odds of a negative test in patients without disease). 29 Cochran's Q test 30 was used to assess underlying population heterogeneity between studies evaluating the same molecular test. For each molecular test with more than three studies assessing its clinical validity, true-positive, false-positive, true-negative, and false-negative results were extracted (results were assessed per nodule, even if there were multiple nodules tested from a single patient). Studies without information needed to calculate both sensitivity and specificity were excluded from the analysis. Exact confidence intervals (CIs) were computed for sensitivity and specificity. 31,32
A bivariate linear mixed-effects model was constructed to jointly model the logit-transformed sensitivity, specificity, and positive and negative LRs across studies. 33 AUC was extracted, and a CI was constructed using a bootstrapping approach, 34 described previously. 35 The clinical validity results of molecular tests with fewer than three validation studies are reported as extracted and were not summarized.
Results
Study selection and inclusion
The literature search yielded 2062 results. After excluding 532 duplicates, the abstracts of 1530 unique records were screened for eligibility, which resulted in 103 articles included for full-text review. The final screening resulted in 49 articles (Fig. 1). See Supplementary Appendix Table SA1 for a list of excluded full-text articles and reasons for their exclusion.

PRISMA flow diagram. PRISMA, Preferred Reporting Items for Systematic Reviews and Meta-Analyses.
Study characteristics are described in Table 2. Thirty-five studies 13,36 –69 evaluated the clinical validity of Afirma GEC; nine 38,44 47,56,63 64,69 –71 evaluated Afirma GSC. One study 72 evaluated both Afirma GEC and GSC, noting an institutional switch in clinical practice without separating the results, so it is included in the analyses of Afirma GEC and Afirma GSC. Nine studies 49,50,54,73 –78 evaluated ThyroSeq v1 or v2 (version 1 and version 2 results were combined in the studies that did evaluate both 49,77,79 ), and five studies 64,71,78,80,81 evaluated ThyroSeq v3 (a significant expansion of previous versions). One study 82 evaluated both ThyroSeq v2 and ThyroSeq v3, noting an institutional switch in clinical practice without separating the results, so it is included in the analyses of ThyroSeq v1 and v2 and ThyroSeq v3. Two 58,72 evaluated Rosetta GX Reveal, two 83,84 evaluated miRInform, one 72 evaluated ThyraMIR, and one 73 evaluated Quest Diagnostics' gene mutation panel (GMP).
Study Characteristics
Percent of criteria (defined in Supplementary Appendix Table SA1) fulfilled, does not include observer variability criterium for studies with only one histopathologist (N = 15).
Full-quality assessment results, including whether criteria were met, not met, partially met, not reported, or not applicable can be found in Supplementary Appendix Table SA2.
Results separated by institution.
AGEC and Reveal applied to same nodules.
AGEC, Afirma Gene Expression Classifier; AGSC, Afirma Genomic Sequencing Classifier; FNA, fine needle aspiration; GMP, gene mutation panel; miRInform, miRInform or ThyGenX Thyroid Oncogene Panel; MT, molecular test; N, no; NIFTP, noninvasive follicular thyroid neoplasm with papillary-like nuclear features; Reveal, RosettaGx Reveal; TS, ThyroSeq (versions 1 and 2); TS v3, ThyroSeq version 3; Y, yes.
Of the 49 studies included, 14 38,44,47,49 –51,54,56,58,63,69,71,73,78 compared 2 diagnostic tests and 2 64,72 compared 3 diagnostic tests. Thirty-nine studies 36 –38,40 –45,47 –49,51,53 –59,61 –70,72 –74,77,78,81 –84 used a retrospective design, nine studies 13,39,46,50,52,60,71,75,80 used a prospective design, and one study 76 combined retrospective and prospective cohorts. Twenty-five studies 13,36,37,41,42,44 –47,50,52,56,61,63,64,66,68 –71,75,77,78,80,82 performed a single FNA before molecular testing and two 13,70 reported receiving industry sponsorship. The proportion of quality criteria met varies greatly among these studies (Table 2).
Quality assessment
Risk of bias was quantified using categorical criteria as outlined in Supplementary Appendix Table SA2. The risk-of-bias assessment results are displayed in Figure 2, and detailed results can be found in Supplementary Appendix Table SA3 and Supplementary Appendix Figure SA1. Of the 49 studies evaluated, all of the studies enrolled patients consecutively, and 90% of studies evaluated thyroid nodules with a consistent molecular test, or else reported the results of different molecular tests separately. Eighty-four percent enrolled patients regardless of whether they ultimately received surgery (rather than only including patients who ultimately did receive surgery). Seventy-one percent of studies reported their exclusion criteria. Thirty-eight percent of studies that reported multiple histopathologists or multiple institutions (n = 34) addressed observer variability in their histopathology review. Forty-one percent of studies reported nondiagnostic molecular test results, and 31% of studies reported the institutional prevalence of malignancy (i.e., pretest probability of cancer) as a range to reflect uncertainty.

Potential risk of bias addressed across 49 studies. *Only assessed for studies with multiple institutions or histopathologists reviewing final diagnoses; n = 34.
Fourteen percent of the studies conducted histopathologic reviews blind to the molecular test results, and 8% made the decision to go to surgery blind to the molecular test result (Fig. 2). Six percent of studies fulfilled more than 75% of the quality assessment criteria, and 49% of the studies fulfilled more than 50% (Table 2).
Study population characteristics and molecular test results are presented in Supplementary Appendix Tables SA4 and SA5. The institutional context varies greatly across studies; underlying prevalence of malignancy ranges from 4.7% to 67% (mean, 0.21), and patients are sent to surgery between 7% and 96% of the time (excluding studies that only included patients who had surgery). Furthermore, the initial nodule cytology breakdown varies across institutions, as does the study reporting; while most studies only tested cytologically indeterminate (Bethesda III and IV) nodules, some studies did perform molecular testing on Bethesda I, II, V, and VI nodules (Supplementary Appendix Table SA4).
Statistical analysis
Meta-analysis results are presented in Figure 3. Cochran's Q statistic revealed significant heterogeneity between studies evaluating Afirma GEC (n = 38); p = 0.006 for sensitivity and p < 0.001 for specificity. A random-effects bivariate model revealed a summary sensitivity of 0.92 (CI: 0.90–0.94), a specificity of 0.26 (CI: 0.20–0.32) (Fig. 3A), a negative LR− of 0.32 (0.23–0.44), a positive LR+ of 1.24 (1.15–1.35), and an AUC of 0.83 (0.74–0.89). We were unable to detect heterogeneity between studies evaluating Afirma GSC (n = 10; p = 0.88), but there was significant heterogeneity between specificities (p < 0.001). The sensitivity of Afirma GSC was 0.94 (0.89–0.96), the specificity was 0.38 (0.27–0.50) (Fig. 3B), the LR− was 0.18 (0.10–0.30), the LR+ was 1.52 (1.28–1.87), and the AUC was 0.91 (0.62–0.92).

Sensitivity and specificity of (
We were unable to detect heterogeneity between sensitivities of studies evaluating ThyroSeq v1 and v2 (n = 10; p = 0.98), but significant heterogeneity was detected between specificities (p < 0.001). The overall sensitivity of ThyroSeq v1 and v2 was 0.86 (0.82–0.90), and the overall specificity was 0.74 (0.59–0.85) (Fig. 3C). The LR− was 0.19 (0.13–0.26), the LR+ was 3.52 (2.08–5.92), and the AUC was 0.86 (0.81–0.90). We were unable to detect heterogeneity between sensitivities of studies evaluating ThyroSeq v3 (n = 6; p = 0.54), but significant heterogeneity was detected between specificities (p < 0.001). The overall sensitivity of ThyroSeq v3 was 0.92 (0.86–0.95), and the overall specificity was 0.41 (0.18–0.69) (Fig. 3D). The LR− was 0.24 (0.09–0.62) and the LR+ was 1.67 (1.09–2.98), while the AUC was 0.90 (0.63–0.92).
Full results of all four models can be found in Supplementary Appendix Tables SA6–SA9. Results pooled by study design (i.e., prospective or retrospective) for Afirma GEC and ThyroSeq v1 and v2 can be found in Supplementary Appendix Tables SA10–SA14. Stratified results were not calculated for Afirma GEC because there was only one study with a prospective design, 71 and they were not calculated for ThyroSeq v3 because there were only two studies with a prospective design. 71,80
A significant relationship between the inverse of study sample sizes and the reported diagnostic odds ratios would provide evidence of publication bias. There is insufficient evidence of publication bias (Afirma GEC studies, p = 0.11; Afirma GSC studies, p = 0.19; ThyroSeq v1 and v2 studies, p = 0.78; ThyroSeq v3 studies, p = 0.096) (Supplementary Appendix Fig. SA2).
Discussion
This is a comprehensive systematic review of clinical validations of molecular tests for thyroid malignancy. We have performed a series of meta-analyses and adapted tools to assess bias within the evaluated studies to contextualize the results of these meta-analyses to further inform future medical and research practices.
We assessed 49 studies of diagnostic accuracy for indeterminate thyroid nodules. Nearly all of the studies followed recommendations to consecutively enroll and separately evaluate the validity of separate molecular tests, and the majority of studies enrolled both patients who went to surgery and patients who did not. These enrollment practices help ensure that the initial patient sample is representative of patients who will undergo a molecular diagnostic test. Most studies reported patient exclusion criteria. A third of studies with multiple histopathologists addressed potential variability between different practitioners' interpretations of the final diagnosis. Less than half of studies reported nondiagnostic test results, which may introduce bias if the true diagnosis of these excluded nodules is differential. Omitting these details hinders evaluations of molecular tests' generalizability across settings.
We offer two major considerations that should be taken in context when interpreting accuracy. First, there was variation in the institutional context (i.e., institutional prevalences and practices such as how many FNAs are performed before ordering a molecular test or which cytological result categories are sent for molecular testing [i.e., indeterminate only, indeterminate + Bethesda V, or all cytological results]). There was also variation in study design, that is, prospective versus retrospective data collection, however, we did not find differences in pooled results when stratifying by study design. This is consistent with a prior systematic review of clinical validations of Afirma GEC, which noted that the initial validation study participants had significantly different nodule characteristics from the patients being tested in practice. 85
This variation enriches the body of literature, but must be accounted for when using these studies to inform clinical management. Given the wide range of institutional prevalences of malignancy and protocols in administering molecular tests (e.g., whether to send a sample for testing after one indeterminate FNA result or two), we found that the sensitivity, specificity, and negative LR measures also vary.
Second, we found that in most studies, the majority of benign molecular test results were never verified by the reference standard (histopathologic diagnosis following surgical resection of the index nodule). Along these same lines, very few studies reported making the decision to send a patient to surgery regardless of molecular test results, which is essential to obtain a representative sample of patients who may undergo a molecular test. This is expected, given the ethical concerns around unnecessary surgery, and indeed, several studies report that the implementation of molecular testing in their clinical practices decreased the incidence of thyroid nodule excision. In addition, the time needed to identify a missed cancer exceeds the study period for many of these studies, and thus, long-term follow-up was not consistently reported. However, the lack of true-negative result verification likely inflates sensitivity.
We found that validations of the four most commonly evaluated molecular tests for indeterminate thyroid nodules—Afirma GEC, Afirma GSC, ThyroSeq v1 and v2, and ThyroSeq v3—did not correct for partial verification bias, either statistically or surgically. Because thyroid malignancies tend to advance slowly, 86,87 it could be months to years before a false-negative result is clinically identified. Thus, the true false-negative rate is unknown. This bias was partially addressed by those studies that only select surgery patients; however, this sample has a higher underlying malignancy prevalence than randomly or consecutively selected samples attributable to other clinical factors.
We also found that the overwhelming majority of studies do not report conducting a blinded histopathologic review, that is, review of a resected specimen without knowledge of the molecular test result. This can expose the diagnostic process to confirmation biases in cases where a diagnostic decision is otherwise on the borderline. These results are consistent with prior systematic reviews of potential biases within studies evaluating Afirma GEC. 21,88 Since the majority of these studies are performed as a retrospective medical records review, it is imperative to use statistical methods to control for underlying biases; however, those methods often require the original patient-level data, which may be personally identifiable. So, these statistical adjustments will be most accurate if they are performed by the original study authors, who do not need to make assumptions about the data.
In addition, we observed a lack of consensus on whether to classify NIFTP findings as benign or malignant; most studies classified NIFTP as malignant, and several evaluated clinical validity in each case. Afirma GEC and ThyroSeq versions 1 and 2 entered the market before NIFTP's reclassification in 2015, so they do not offer any specific diagnostic or management guidance on these nodules. Newer molecular tests, such as Afirma GSC, ThyroSeq v3, or those geared toward microRNA (miRNA) identification similar to ThyraMIR and miRinform, do not advertise the ability to differentiate between NIFTP and benign or malignant nodules. Further study of long-term trajectories of patients with NIFTP will aid in deciding whether to group them more closely with benign or malignant nodules, and future innovations in molecular testing are likely to take this into account.
Our meta-analyses of the four molecular tests with more than three published clinical validations revealed high sensitivities and AUC measures for each test. Our results, in addition to prior systematic reviews of molecular tests for thyroid cancer, 15,88 –90 bolster the conclusions of the majority of studies reviewed, including the industry-sponsored studies, which are that molecular tests for indeterminate thyroid nodules have the potential to aid in clinical decision-making; however, solidifying this finding warrants further investigations. For now, given the high level of biases and limitations in the studies evaluated, these results must be interpreted with caution. Future clinical validations of molecular tests must avoid common pitfalls enumerated in this review to evaluate diagnostic molecular tests in a minimally biased manner.
This study has a few limitations. First, different studies have different underlying characteristics (location, patient demographics, observer variability, indeterminate nodule detection rates, study design, patient selection criteria, and cancer prevalence); we find high between-study variation (Supplementary Appendix Tables SA4–SA7). We are unable to assess differences in, for example, patient selection practices from one institution to another, however, we note that these sources of variation may confound the results and introduce additional bias. Notably, while two studies report having received industry sponsorship, 13,70 they report a blinded study design and their results do not appear to differ from the nonsponsored studies.
There are too few studies to perform a secondary subanalysis. Second, there were not enough studies assessing Reveal, ThyraMIR, miRInform, or Quest GMP to conduct meta-analyses. However, these tests are not widely used or available; Afirma GSC and ThyroSeq v3 are the primary diagnostic molecular tests currently utlitized for thyroid cancer, similar to their predecessors, Afirma GEC and ThyroSeq v1 and v2, were before them. Future reviews may benefit from more clinical validations of these tests having been published by the time they are conducted. Third, future reviews may also benefit from head-to-head comparisons of molecular tests, that is, comparisons of different tests' performances on the same cytological sample; only one study in our sample had this design. 58 Fourth, we only included studies that reported diagnostic results, that is, counts of true-negative, true-positive, false-negative, and false-positive results., which may have confounded the results.
Fifth, both because our primary aim was to assess study quality and the reporting of the information needed was limited, we did not perform meta-analyses of test performance stratified by cytological (FNA) results. Finally, we combined ThyroSeq versions 1 and 2 due to version 1 results being combined with version 2 in the studies that did evaluate both, and for studies that noted an institutional switch in which test was used without separating the results, we included their results in the meta-analyses for both tests rather than excluding them from this study. We note, however, that the diagnostic accuracy results of the studies that combined two tests did not differ from the majority of the studies that examined single tests.
Diagnostic molecular testing is a fast-growing market, and oncological tests are projected to claim the largest share of growth over the next decade. 1 Current molecular tests are marketed as rule-out tests and can impact practice by reducing surgeries, but they can also be prohibitively expensive and return indeterminate results. Evaluating independent experiences with all currently and previously commercially available molecular tests for indeterminate thyroid nodules revealed limitations, sources of bias, and gaps in reporting. These biases and gaps in reporting can be addressed in future studies of current and future diagnostic tests.
This systematic review reveals significant sources of bias, which are common among clinical validation studies of commercially available molecular tests for indeterminate thyroid nodules. Meta-analyses of four commonly evaluated and used tests—Afirma GEC, Afirma GSC, ThyroSeq v1 and v2, and ThyroSeq v3—show high sensitivities and AUC measures, which seemingly underscore the suitability and utility of their current use guidelines as a part of thyroid nodule management practices. However, these results must be interpreted in light of high levels of diagnostic review bias and verification bias, in addition to study design limitations. Although molecular tests offer improvements in accuracy over conventional FNA alone, ideally, future validation studies will have prospective designs that err on the side of overinclusion to ensure accurate perceptions of the value added by molecular testing. Moreover, the role of patient decision-making and decision-making aids in the use of these tools should be considered and evaluated.
Footnotes
Authors' Contributions
C.D.: Conceptualization, methodology, software, validation, formal analysis, investigation, data curation, writing—original draft, writing—review and editing, and visualization; V.V.: conceptualization, methodology, data curation, validation, and writing—review and editing; M.S.J.: writing—review and editing and methodology; A.T.: writing—review and editing; T.W.: writing—review and editing; S.G.: writing—review and editing; N.M.: writing—review and editing, methodology, and software; C.C.L.: conceptualization, methodology, supervision, project administration, writing—review and editing, and funding acquisition.
Acknowledgments
We thank Lisa Philpott and Melissa Lydston at the Massachusetts General Hospital Treadwell Library for constructing the initial search query and performing the academic database search and for providing key feedback on our initial search strategy.
Author Disclosure Statement
The authors have nothing to disclose.
Funding Information
This work was supported by NIH/NCI R37 CA231957 (C.C.L.). The funder had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the article; or decision to submit the article for publication.
Supplementary Material
Supplementary Appendix Methods
Supplementary Appendix Figure SA1
Supplementary Appendix Figure SA2
Supplementary Appendix Table SA1
Supplementary Appendix Table SA2
Supplementary Appendix Table SA3
Supplementary Appendix Table SA4
Supplementary Appendix Table SA5
Supplementary Appendix Table SA6
Supplementary Appendix Table SA7
Supplementary Appendix Table SA8
Supplementary Appendix Table SA9
Supplementary Appendix Table SA10
Supplementary Appendix Table SA11
Supplementary Appendix Table SA12
Supplementary Appendix Table SA13
Supplementary Appendix Table SA14
