Laboratory-Developed Test—SynFRAME: An Approach for Assessing Laboratory-Developed Tests Synthesized from Prior Appraisal Frameworks

Abstract

Since the late 1990s, there has been an unprecedented growth in the development of new molecular and proteomic assays for clinical decision making. Despite the thousands of tests available, a standardized, well-defined, and coherent evaluation framework for these molecular assays is still lacking. We aim to summarize the publicly available appraisal criteria and to develop a succinct and accessible set of criteria that can provide a roadmap for the appraisal of gene-based laboratory developed tests (LDTs). We conducted a systematic literature review of the available molecular diagnostic framework in PubMed MD and CINAHL and identified 91 articles on existing appraisal criteria. We provided a summary of the historical appraisal system and developed an analysis of these appraisal systems, LDT-SynFRAME, which details the major criteria for evaluating molecular diagnostics in the clinical setting. Our goal with the LDT-SynFRAME system is to promote a well-informed dialog among all the stakeholders responsible for the development, approval, reimbursement, and use of new molecular classifiers.

Introduction

The endeavor to uncover the human genome has resulted in an unprecedented level of investment in the research and development of new molecular and proteomic assays for clinical application (Paik et al., 2004). Experts at the U.S. Centers for Disease Control and Prevention estimate that more than 1200 molecular assays or gene-based laboratory developed tests (LDTs) are currently available and more than 1000 new assays may be in development (Genetic Testing, 2007).

With the promise of each new assay entering clinical practice comes the question of what processes should be followed to appraise its effectiveness, safety, utility, and affordability. The audience for a standardized set of appraisal processes is large and has diverse backgrounds and interests; it includes regulatory officials, health technology appraisal groups, investors, developers of molecular assays, clinical investigators, payers, physicians, and the general public.

To meet this need, several health technology assessment (HTA) groups have adapted their existing appraisal processes, many of which were developed initially for evaluating pharmaceuticals, for the assessment of molecular assays and LDTs (U.S. Preventive Services Task Force Procedure Manual 2008; Wald and Cuckle, 1989; Hayes et al., 1996; Haddow and Palomaki, 2003; McShane et al., 2005; Simon, 2006; Gene Expression Profiling for Managing Breast Cancer Treatment, 2007; Genetic Testing, 2007; Teutsch et al., 2009). Consensus and standardization of the appropriate terminology, definition of these terminologies, and appropriate evaluation criteria, however, are lacking. This discordance is best highlighted by the inconsistent use of clinical utility and clinical validity by the clinicians, researchers, policy makers, and appraisal groups. In addition, confusion remains on the necessary and sufficient steps required to validate a novel assay, reflected by on-going debate on the appropriate market authorization and regulation oversight required for LDTs (Ntzani and Ioannidis, 2003; Michiels et al., 2005). The primary aims herein are to (1) summarize publicly available appraisal criteria and (2) to develop a succinct and accessible set of criteria that can provide a roadmap for the appraisal of gene-based LDTs.

Methods

We searched PubMed (U.S. National Library of Medicine, Bethesda, MD) and CINAHL (Cumulative Index to Nursing and Allied Health Literature, EBSCO Publishing, Ipswich, MA) from 2000 to the present to identify literature that explains processes used to appraise a molecular assay. We also conducted an ISI Web of Science^® (Thomson Reuters, New York, NY) supplemental search of key articles and looked at guidelines that are published by government agencies, professional societies, and HTA groups. Search terms included <molecular><molecular assay*> <molecular test*> <validation> <clinical validation> <analytical validation> <clinical utility> <clinical impact> <cost-eff*> <cost-benefit> <economics>. Inclusion criteria were (1) description of framework(s) or processes for assessing molecular assays, (2) explicit appraisals of molecular assays that included a description of the appraisal process, (3) other guidelines that have been published on appraisals, and (4) English language. We exclude case reports, editorials, and reviews (Fig. 1).

FIG. 1.

Literature search flow diagram.

Two authors (J.H. and J.D.) reviewed and abstracted all articles to identify the criteria deemed to be important when appraising molecular assays. We organized these criteria into several recognized categories, thereby creating a structured technology appraisal framework for molecular classifiers and gene-based LDT.

Results

The literature search retrieved 1136 articles, of which 24 were duplicates and 1021 articles were excluded. Ninety-one articles were reviewed (Fig. 2). Articles that did not meet the inclusion criteria were excluded from the final summary. In addition, evaluation methods from prominent HTA organizations were retrieved and incorporated into our validation methodology. Brief accounts of the history of the appraisal process, as well as its current state, were drawn from the literature.

FIG. 2.

Categories of tests.

Classification of tests

A wide variety of molecular tests are available. Based on canonical medical texts, we created a basic classification system for LDTs shown in Figure 2. Two categories of test were identified: first tests for signs, symptoms, or no known disease; second tests for known disease. Under the first category, there are tests for susceptibility or risk-factor assessment to initiate intervention and prevent occurrence of disease, such as the BRCA1 gene test for risk of breast cancer (Weitzel et al., 2007). There are also tests for screening the presence of occult disease, to initiate intervention to cure or avoid progression to more severe health states, such as the Papanicoloau test for diagnosis of precancerous or cancerous cervical lesions (Weitzel et al., 2007).

Under the category of tests that pertain to apparent disease, there are tests that diagnose or determine cause of apparent symptoms to decide on interventions, for instance, chest X-ray for a patient with chronic cough (Hricak et al., 2005). Another type of test develops or refines a differential diagnosis, to reduce the list of possible causes of prior clinical or test findings, such as an electrocardiogram for abnormal pulse (Pearson et al., 2002). Tests for the management of known diseases consists of five categories. Staging tests that evaluate the extent or severity of the disease, as well as prognostic tests that predict natural history, assess the urgency of the problem, and the appropriateness of intervention. Tests that predict response to treatment help to inform selection of an intervention, such as the 21-gene recurrence score for early-stage breast cancer (Paik et al., 2006). Another type of test monitors the course of the disease to assess the disease status and the need for intervention. Lastly, tests that assess response to intervention are meant to assess the effectiveness of treatment. Table 1 provides examples of these categories of tests. We aim to describe a unified validation framework for the wide diversity of the aforementioned tests.

Table 1.

Examples of Types of Tests

Category of tests	Example
No known disease
Susceptibility	BRCA 1 gene testing for risk of breast cancer
Presences of an occult disease	Fecal occult blood test for colon cancer or Pap test for cervical cancer
Known or probable disease
Diagnostic	Chest X-ray, CT, or MRI
Differential diagnostic	EKG for abnormal pulse
	Tumor marker test: β-HCG for diagnosis of placental cancer in women and testicular cancer in men
Staging	Tumor biopsy; GCC colorectal cancer staging test
Prognostic	12-gene colon cancer Recurrence Score assay
Predictive	21-gene breast cancer Recurrence Score; K-ras
Surveillance	Tumor marker test: CEA for colon cancer recurrence
Assess response to treatment	Tumor marker test: calcitonin to assess treatment response of medullary thyroid cancer

β-HCG, beta-human chorionic gonadotropin; CEA, carcinoembryonic antigen; CT, computed tomography; EKG, electrocardiogram; GCC, guanylyl cyclase C; MRI, magnetic resonance imaging; Pap, Papanicoloau.

Appraisal process: a brief history

Table 2 describes the major efforts to create appraisal systems over the last 20 years. These efforts include the Tumor Marker Utility Grading System (TMUGS), the U.S. Preventive Services Task Force, Blue Cross Blue Shield Technology Evaluation Center (BCBS TEC), the Centers for Disease Control and Prevention Analytical Validity, Clinical Validity, Clinical Utility, Ethics, Legal, Social Implications system, (CDC ACCE) REporting Recommendations for MARKer Prognostic Studies (REMARK), and others (Wald and Cuckle, 1989; Frame and Carlson, 1975; Hayes et al., 1996; Harris et al., 2001; Haddow and Palomaki, 2003; McShane et al., 2005; Ramsey et al., 2006; Teutsch et al., 2009). These appraisal systems have introduced themes in the validation process that are common to the most widely used appraisal systems currently in use. In general, three major categories of appraisal are cited: analytical validity, clinical validity, and clinical utility. Financial, ethical, societal, and legal implications represent the fourth category that is not consistently included by all assessment agencies, with the evaluation criteria by BCBS TEC as one example. These categories are originally featured in the current framework of the CDC ACCE framework, a commonly cited evaluation method in genetic testing (Genetic Testing, 2007; Teutsch et al., 2009). Analytical validity is defined as the ability of the test platforms, such as microarray or polymerase chain reaction to accurately and reliably measure the analyte of interest. Clinical validity is most often defined as the test's ability to detect or predict the associated disorder. Clinical utility, unlike analytical validity, has a variety of definitions. In a 2008 report, the Johns Hopkins University Evidence-based Practice Center, part of Agency for Healthcare Research and Quality (AHRQ), defined utility as “the clinical utility of a test tells us whether the test helps discriminate between those who will have more or less benefit from a therapeutic intervention” (Marchionni et al., 2008). More recently, AHRQ described clinical utility with a broader definition of “the usefulness of test and its value to clinical practice” (Sun et al., 2010). The Academy of Managed Care Pharmacy defines clinical utility as the safety and efficacy profile of a test (The AMCP format, 2010). In addition to the four ACCE criteria, other authoritative sources stress the importance of other themes, such as appropriate introduction of the disease and test, scientific rigor and validity, clarity in presentation of findings, a rigorous chain of evidence, generalizability, and clear, and unbiased presentation of the evidence (U.S. Preventive Services Task Force Procedure Manual 2008; Hayes et al., 1996; McShane et al., 2005; Simon, 2006; Teutsch et al., 2009). Figure 3 displays a timeline of the seminal appraisal systems over the past three decades. As an analysis of these appraisal systems, LDT-SynFRAME includes the necessary components relevant to the validation of health technology. These components are as follows: a clear introductory context; well-substantiated analytical validity, clinical validity, and clinical utility; economic and ethical outcomes; and transparent presentation. HTA groups have articulated this important factor, such as the U.K. National Institute for Health and Clinical Excellence (NICE) in its scoping process. We emphasize the importance of the evidence being presented in a clear, unbiased, and understandable manner. Third, this is the first published version of the framework, but we anticipate that appraisal processes will continue to evolve, expecting that this continue to evolve across a continuum of time and expert input in a variety of fields.

FIG. 3.

Timeline of key appraisal proposals.

Table 2.

The Appraisal Process: A Brief History

Title	Author	Year	Method of evaluation	Items
Reporting the assessment of screening and diagnostic tests	Wald and Cuckle	1989	9 criteria	29 items: The test, the disorder, prevalence of the disorder, therapeutic intervention, test results, test performance, cost-and-benefit analysis, evaluation of the test, practical problem
The efficacy of diagnostic imaging	Fryback and Thornbury	1991	6 levels	24 items: Technical efficiency, diagnostic accuracy efficacy, diagnostic thinking efficacy, therapeutic efficacy, patient outcome efficacy, societal efficacy
Tumor Marker Utility Grading System: A framework to evaluate clinical utility of tumor markers (TMUGS)	Hayes et al.	1996	6-level utility scale for favorable clinical outcomes6-level level of evidence scale	6 items: The test, the disease, clinical uses, marker correlation with biologic processes, marker correlation with biologic end points, marker use leading to decision that results in more favorable clinical outcomes
Current methods of the U.S. Preventive Task Force: A review of the process.	Harris et al.	2001, updated 2008	4-5 levels	4 items: Hierarchy of research design, grading the internal validity of individual studies, evaluating the quality of evidence at three strata, grading of recommendation
The Evaluation of Genomic Applications in Practice and Prevention (EGAPP) initiative: methods of the EGAPP Working Group	Teutsch et al.	2000 to 2004; updated 2009	5 criteria	44 items, including analytic validity: Analytic sensitivity (or the analytic detection rate), analytic specificity, laboratory quality control, and assay robustness; clinical validity: clinical sensitivity (or the clinical detection rate), clinical specificity, prevalence of the specific disorder, positive and negative predictive values; clinical utility; ethical, legal, and social implications
Reporting Recommendations for Tumor Marker Prognostic Studies (REMARK)	Statistics Subcommittee of the NCI-EORTC Working Group on Cancer Diagnostics	2005-2006		20 items
A checklist for evaluating reports of expression profiling for treatment selection	Simon R	2006		16 questions on study validity
Toward evidence-based assessment for coverage and reimbursement of laboratory-based diagnostic and genetic tests	Ramsey et al.	2006		6 items: Technical efficiency, diagnostic accuracy, impact on diagnostic accuracy, impact on therapeutic choice, impact on patient choice, impact on society

Contextualizing the problem: clear introduction

LDT-SynFRAME combines the most pertinent evaluation criteria from the current leading assessment systems to create one framework with wide applicability to novel tests. The detailed framework is presented in Tables 3 -6. At the beginning of the evaluation should be an Introduction, which adequately describes the purpose of the test, the natural history, prevalence and current management of the disorders of interest, and the expected clinical, economic, and social outcomes. The introduction should provide a clear picture of the unmet clinical needs, as well as set a structured context for the application of a novel test. A clear, succinct, and easily read introduction informs an outside reader of the clinical, social, and or economic parameters of the test. Following the introduction are the four major evaluation categories, which are described below.

Table 3.

Laboratory-Developed Test-SynFRAME

Introduction

• The test, the disorder, prevalence/incidence, current management, guidelines, expected clinical, economic, and social outcomes

Analytic validity

• Sensitivity/accuracy, specificity, detection and quantification limits of reactions, efficiency, linearity/reportability range, precision/variability, repeatability, reproducibility, quality control, success rate, traceability, stability, expected values, normalization

Clinical validity

• Sample population: Representativeness, homogeneity of patient characteristics, enrolled in therapeutically relevant clinical trial, sufficiently large

• Clinical meaningfulness: Relevant endpoints assessed, for example, progression and survival, accurately measured endpoints, clear cutoffs for classification, clear treatment implications

• Statistical significance: Predictive accuracy statistically significantly better than chance, adjusted appropriately for confounding, absence of statistical flaws, classifier developed from a separate training set and applied to a different validation set, positive and negative predictive values, pre-specified protocol

Clinical utility

• Reduces uncertainty: Diagnostic, prognostic, or effects of therapy

• Influences decision making

• Associated with improved outcomes (survival, morbidity, quality of life, patient satisfaction)

• Provides benefits beyond established measures

• Generalizable

Economic and social implications

• Financial: to third-party payers, patients, physicians and other providers, employers

• Tradeoffs: for example, cost versus benefits

• Differential effects on groups: for example, disparities

Presentation

• Complete, uniform, unbiased, understandable

Table 4.

Analytical Validity

Test characteristics	Genotype	Gene expression	Protein expression	Varies by type of test	Still evolving toward convergence
Accuracy	+	+	+		y
Sensitivity	NA	+	+	x	y
Specificity	+	+	+
Efficiency	+	+	+
Linearity (dynamic range)
Limit of detection	NA	+	+	x
Limit of quantitation	NA	+	+	x
Precision
Repeatability	+	+	+
Reproducibility	+	+	+
Quality control	+	+	+
Traceability	+	NA	NA	x	y
Assay stability	+	+	+
Sample stability	+	+	+
Detection limit	+	+	+
Expected values	+	NA	NA	x
Normalization	NA	+		x	y
Success rate	+	+	+
Assay cut-off	NA	+	+	x

+, test characteristics relevant to assessing this type of test; NA, test characteristic not applicable to this type of test; x, indicates that definition varies by type of test; y, various academies, colleges, and institutes (see below) still evolving toward convergent standards; blank cells, neither varies by type of test nor is there a relevant different across tests on standards for this test characteristic.

Compiled from standards published by National Academy of Clinical Biochemistry (NACB), the College of American Pathologists (CAP), the Association for Molecular Pathology (AMP), the Clinical Laboratory Standards Institute (CLSI), and the National Institute for Standards and Technology (NIST).

Table 5.

Research Design and Statistical Issues of Clinical Validation Studies

Design	1. Data from well-conducted controlled trial
	2. Prospectively stated hypothesis, analysis techniques, and patient population
	3. Predefined and standardized assay and scoring system
	4. Sample size and power justification
Sample population	5. Representative
	6. Homogeneity of patient characteristics
	7. Enrolled in therapeutically relevant study
	8. Sufficiently large to avoid missing a relevant effect if it truly exists
Clinical meaningfulness	9. Relevant endpoints assessed, for example, progression and survival
	10. Accurately measured endpoints
	11. Clear cutoffs for classification
	12. Clear treatment implications
Statistical significance	13. Predictive accuracy statistically significantly better than chance
	14. Adjusted appropriately for confounding
	15. Absence of statistical flaws
	16. Masking/blinding
	17. Classifier developed from a separate training set and applied to a different validation set
	18. Positive and negative predictive values
	19. Pre-specified protocol

Table 6.

Economic Implications of Diagnostic Assay

Structure	1. Statement of decision problem/objective
	2. Justification of modeling approach
	3. Statement of scope/perspective
	4. Thorough description of all assumptions & strategies/comparators
	5. Use of appropriate model type
	6. Definition of relevant health states
	7. The appropriateness of the cycle length, if analyzed with a Markov model
Data	8. All relevant data sources should be identified and appropriately used
	9. Follow well-established guidelines on literature retrieval and synthesis
	10. Grade the evidence
	11. If primary data are used and analyzed, the analysis should be consistent with well-established statistical methods
	12. Discount both benefits and costs
	13. Examine appropriate patient subgroups
	14. Include half-cycle correction
	15. Extrapolation of data beyond the duration of the available data (e.g., in a clinical trial) may be appropriate depending on whether the interventions under consideration have implications beyond the trial duration
Uncertainty	16. The instability, or uncertainty of the model and its findings under conditions different than the base reference case should be assessed
	17. Examine variations in model structure and input parameters
	18. Should highlight the parameters that could most influence the findings of the analyses
	19. Indicate areas of future research
Consistency	20. Internal consistency
	21. Mathematical programs used for the analyses should be devoid of errors
	22. Changes in model parameters should provide results that are consistent with theory (e.g., increasing the unit cost of a drug under investigation should under most circumstances increase the cost-effectiveness ratio)
	23. Face validity
	24. Amenable to intuitive explanation
	25. Calibration (external consistency or validation)
	26. To the extent that data are available that were not also used to develop the model (e.g., a separate validation dataset that because available after the model was developed)
	27. The analyses should be assessed for their ability to predict the results of the new dataset, called predictive validity
	28. Peer-review by clinicians, analysts, and end-users (e.g., payers and patients)

Analytical validity

Mansfield et al. define analytical validation “as a process by which the measurement performance of a test system is assessed (Mansfield et al., 2005).” Traditionally, components of analytical validity have included accuracy, sensitivity, specificity, efficiency, linearity, precision, quality control, traceability, assay stability, sample stability, detection limit, expected values, normalization, success rate, and clear assay cut-off values. The list of criteria was compiled from the standards of the National Academy of Clinical Biochemistry (NACB), the College of American Pathologists (CAP), the Association for Molecular Pathology (AMP), the Clinical Laboratory Standards Institute (CLSI), and the National Institute for Standards and Technology (NIST) (Tholen et al., 2003; Tholen et al., 2004; Mansfield et al., 2005; Krouwer et al., 2006; Wolff et al., 2007; College of American Pathologists, 2007). How these criteria are applied in genotyping, gene expression, and protein expression tests is shown in Table 4 and illustrates how certain elements have yet to reach convergence among different professional societies. Successfully reaching convergence among this diverse range of professional groups on defining appropriate criteria and methods for measuring diagnostic test performance will require overcoming a number of barriers. The high complexity test types enabled by new genomic and proteomic technologies do not fit seamlessly into legacy templates of analytical validation. LDT-SynFRAME is the first framework that pinpoints important distinctions when validating single-gene assays, multi-gene expression assays, and protein expression assays. Traditional measures of diagnostic test performance will have to evolve to adapt to these new, high complexity test types. Furthermore, unique aspects of genetic, genomic, and proteomic tests may mean that developing a single check list of performance criteria to accommodate all test types may not be possible.

Clinical validity

Clinical validity refers to the extent to which a test accurately predicts the risk of an outcome (i.e., calibration), as well as its ability to separate patients with different outcomes into separate risk classes (discrimination) (Marchionni et al., 2008). In the case of clinical decision making for early stage breast cancer, clinical validity refers to ability of the test to (1) predict recurrence risk or (2) treatment response.

Several research groups have commented on processes they use to evaluate the clinical validity of molecular classifiers (Gene Expression Profiling for Managing Breast Cancer Treatment, 2005; Simon, 2006; Marchionni et al., 2008). Although these groups address specific aspects of clinical validity, we found that they differ in their specific definitions and in how criteria are organized. We grouped the criteria for clinical validity into four categories: design, sample population, clinical meaningfulness, and statistical significance.

For a sample population, experts recommend that clinical validation studies enroll patients who are representative of the target population. Sufficient estimation of treatment benefit is accomplished in the context of randomized controlled trials. This may be done prospectively or retrospectively, given that all measurements are done concurrently and independently of the outcome. If results of retrospective analyses of randomized clinical trials are consistent in two or more independent studies, then the evidential value of the retrospective study is comparable to that of a concurrent prospective study (Simon et al., 2009).

Simon recommends that the patients' characteristics should be sufficiently homogeneous to assure that they face similar therapeutic options (Simon, 2006). For example, the treatment options for women with node-positive breast cancer are sufficiently different from treatment for women with node-negative disease. A study that includes both stages of cancer would be insufficiently homogeneous to draw meaningful conclusions about the validity (and utility) of the test in real-world settings. Patients should be enrolled in therapeutically meaningful clinical studies to assess if the classifier is valid as a predictor of not only prognosis but also treatment response.

Last, the study must be large enough to draw meaningful statistical inferences and assess generalizability to other populations. A suggested rule of thumb is that the study provides at least 20 patients per class (e.g., 20 responders and 20 nonresponders) (Simon, 2006).

Clinical meaningfulness criteria pertain to validation studies properly examining endpoints that influence decisions in the clinic. Studies should examine endpoints that are clinically relevant; in the case of adjuvant chemotherapy for breast cancer, the endpoints should include those typically assessed in clinical trials of pharmaceuticals, such as progression-free survival and overall survival. Data on the endpoints should be obtained rigorously (e.g., limited missing data) and accurately (e.g., limited measurement error). The Johns Hopkins University Evidence-based Practice Center highlighted the importance of cutoffs being stated and pre-specified for the classifier (e.g., low, intermediate, or high) to help determine decisions based on the test result in the 2008 report on the values of the gene expression assay for breast cancer (Marchionni et al., 2008). Last, the validation study should address whether knowledge of the results has clinical implications.

Clinical utility

Clinical utility refers to the ability of a novel tool to add value and clinical benefit beyond traditional or previously established practices. The BCBS TEC defines this criterion as “a technology that should improve net health outcomes as much as, or more than, established alternatives” (Technology Evaluation Center Criteria, 2008). A tool with meaningful clinical utility should lead to a more favorable outcome than the leading standard of care or a suitable comparator. Furthermore, a clinically useful assay should have substantial and measurable impact on clinical decisions. Sparano and Solin in 2010 identified four measures that quantify and qualify the impact of these tests (Sparano and Solin, 2010):

1. Treatment-sparing: test results indicate that treatment will not be beneficial and patient may be spared.

2. Treatment selection: test results indicate that treatment is likely to be beneficial when clinical features indicate no treatment is necessary.

3. Treatment direction: test results provide definitive treatment direction when clinical features are uncertain

4. Treatment confirmation: test confirms original treatment recommendation.

For example, the value of a predictive tool is to reduce uncertainty, which then alters and/or improves clinical decisions and, finally, results in improved clinical outcomes and/or lower costs. The types of uncertainty targeted by an assay may include susceptibility to disease, probability of a diagnosis or set of diagnoses, severity of disease, risk of disease recurrence, risk of toxicity to therapy, and probability of response to treatment(s).

Besides evidence on the assay's ability to influence clinical decisions and outcomes, HTA groups seek evidence on the generalizability of the assay outside of the research settings. The benefits of the assay must reach the general population at large, in that decision makers, such as physicians and patients, will respond practically to the assay's results, and that use of the assay is associated with favorable outcomes and/or lower costs. Some of these may be hard to establish at the time an assay becomes available. For example, how generalizable an assay is and whether it influences decision making may be relatively uncertain until it has been used outside a research protocol.

Financial, social, legal, and economic implications

HTA groups also have been concerned about a broader set of issues pertaining to the adoption of new technologies, which are relevant to molecular assays. These considerations include the financial implications to different stakeholders (e.g., payers, patients, and society as a whole), the relative tradeoffs between financial costs and clinical benefits, how the new assay may differentially affect different populations (especially if it may create widening of disparities in health care), and nonmedical implications, such as access to life insurance and employment.

Economic validity refers to the completeness, quality, and reliability of the analyses used to assess the economic implications of novel technologies. In 2003, Weinstein et al. outlined the criteria that should be considered when evaluating economic analysis; these are detailed in Table 5 (Weinstein et al., 2003). With regard to structure, the objective of the economic analysis must be clearly stated, as well as the parameters and assumptions of the analysis. Data sources must be transparent, graded for quality, and obtained in a well-established methodical way. Analysis of the data should be consistent with well-established statistical methods, and have appropriate patient subgroups analyzed. The instability or uncertainty of the model under varying conditions must be tested. Last, the analysis should have internal consistency, face validity, be calibrated, and subjected to peer review.

Presentation

Presentation is the final component of LDT-SynFRAME and emphasizes the importance of transparent and informative communication of scientific and research methods that is complete, uniform, unbiased, and understandable. This issue was highlighted more than 30 years ago in a 1979 study conducted by Casscells et al. (1978), which showed that “almost all physicians confused the sensitivity of a test with positive predictive value.” In a review of the Casscells et al. (1978) study, Hoffrage et al. (2000) conclude that representation of evidence is critical to understanding and decision making. Evidence, especially statistical data, to support the validity and utility of a test can be conveyed in a number of methods. The accuracy of a diagnostic test could be represented through contingency tables, in numerical values for sensitivity and specificity, or graphically by area under the curves, boxplots, likelihood ratio nomograms, or by decision trees (Whiting et al., 2008). As another example, test prediction can also be presented through a variety of graphs: Kaplan-Meier curves of different cohorts are one possibility, a continuous risk curve is another possibility, and categorical group is a third. While these are all valid methods to present data, not all options are as equally accessible or comprehensible. Hoffrage et al. (2000) recommended that statistical evidence and data must be presented in a manner that is meaningful for the main users. In the field of molecular classifiers, the end users are often physicians and patients; given such an audience, the evidence for the data and test results should be presented in a manner with clinical significances that can be easily understood.³⁶

Discussion

The development of molecular assays has opened new possibilities for reducing clinical uncertainty, improving patient outcomes, and lowering costs. The clinical values of a number of molecular classifiers have been recognized by clinical associations, such as National Comprehensive Cancer Network (NCCN) and American Society of Clinical Oncology (ASCO) and are incorporated into their respective guidelines. The rapidly expanding field of molecular medicine and genomic-based practice, however, is not like the traditional diagnostic model; as such the old appraisal system for pharmaceuticals does not translate into a sufficient framework for appraisal of molecular classifiers. The growth of genomic medicine calls for a new evaluation and assessment system that is accessible to health technology assessor, managed-care payers, physicians, and patients alike. In response to this unmet need, we have summarized and synthesize the published appraisal criteria into the LDT-SynFRAME framework, which is a comprehensive technology evaluation system.

LDT-SynFRAME synthesizes into one the components of all available frameworks, and several of the most important differentiating aspects bear highlighting. First, some of the prior frameworks refer to tests in general, and so provide relative limited specifics on issues concerning the validation of molecular classifiers. Others have focused on validation of single-gene tests, presumably because the majority of novel tests are of this type. However, there are important distinctions, especially with respect to analytical validity, for tests that contain an array of genes, or genes used in combination with other predictors, including clinical factors or protein-based assays. LDT-SynFRAME makes this evolution transparent. Second, the definition of Level I evidence for clinical validation studies has evolved, even since publication of prior frameworks. LDT-SynFRAME incorporates these new concepts. Third, clear communication and reporting of results has been identified as essential for other areas of technology assessment, such as reporting of methods and results of randomized controlled trials and of economic analyses. By synthesizing prior work, some of which was not specifically focused on genetic tests, LDT-SynFRAME uniquely and properly provides a framework for how to evaluate if the reporting of methods and results is comprehensive and unbiased.

We found substantial convergence across prior frameworks for some areas, which were reflected in LDT-SynFRAME. For example, all approaches indicate the importance of a clear introductory statement of the purpose of the assessment. Moreover, they all comment on the importance of assessing clinical validity of the assays and the need to examine the broader ethical, legal, and social issues concerning adoption of such assays. LDT-SynFRAME deliberately differs from other frameworks on the emphasis its places on the importance of economic analyses than others; this reflects that the authors were unconstrained by potential conflicts with policy concerns that continue into this decade to discourage the explicit use of such analyses when appraising technologies.

Will the adoption of LDT-SynFRAME eliminate the confusion many audiences have when seeing contradictory conclusions about the value of a novel diagnostic? We believe not. The various appraisal groups operate under different incentives, so it should be unsurprising that circumstances arise where they arrive at different conclusions when evaluating the same evidence. One of the important values of any appraisal framework is to provide an explicit and transparent set of principles whereby the conclusions of decision makers can be examined thoughtfully, and whereby decision makers can fully and systematically substantiate the conclusions from appropriate application of the framework. Developers of new molecular classifiers and their investors will more accurately determine what evidence will be needed to warrant positive assessments of their innovations. We believe that LDT-SynFRAME, while comprehensive, is also simple enough that it provides a structured perspective for those interested persons, who though they may not perform the actual appraisal, to assess the thoroughness and validity of the appraisals.

Continued and regular update of a framework for assessing molecular classifiers will be needed. Development of and indications for molecular classifiers are changing rapidly; such robust research activity usually brings attention to bear on new issues unforeseen by prior frameworks. In this case, examples include the developments in analytical validity of multi-gene assays versus single-gene assays and definitions of what constitutes Level I evidence for clinical validity. Areas appearing to deserve more discussion and consensus among experts in the field of assessing molecular classifiers include (1) the role of randomized clinical trials, (2) clarifying the principles for the design of decision impact studies, and (3) best principles for reporting methods and results of clinical validation studies. Our contribution is intended to promote a well-informed dialog among innovators, health technology appraisers, physicians, patients, and those responsible for approval and reimbursement of new molecular classifiers, resulting in the highest quality of clinical and policy decision making.

Footnotes

Disclosure Statement

The author received no funding for the development or preparation of this article. All authors, however, were employees of Cedar Associates LLC at the time of writing.

References

The AMCP Format for Formulary Submissions Version 3.0. 2010. Academy of Managed Care Pharmacy, Foundation for Managed Care Pharmacy. www.amcp.org/data/jmcp/1007_121%2019%2009(3).pdf

Casscells

, Schoenberger

, Grayboys

. 1978. Interpretation by physicians of clinical laboratory results. N Engl J Med, 299:999-1001.

College of American Pathologists: molecular pathology checklist. College of American Pathologists 2005 (revised 2007) cap.org/apps/docs/laboratory_accreditation/checklists/molecular_pathology_sep07.pdf

Fischbach

, Dunning

. 2009. A Manual of Laboratory and Diagnostic Tests, 8th. Lippincott Williams & Wilkins: Philadelphia, PA.

Frame

, Carlson

. 1975. A critical review of periodic health screening using specific screening criteria. Part 1: selected diseases of respiratory, cardiovascular, and central nervous systems. J Fam Pract, 2:29-36.

Fryback

, Thornbury

. 1991. The efficacy of diagnostic imaging. Med Decis Making, 11:88-94.

Gene Expression Profiling for Managing Breast Cancer Treatment. 2005. Blue Cross Blue Shield Technology Evaluation Center. Blue Cross Blue Shield, 20,3.

Gene Expression Profiling for Managing Breast Cancer Treatment. 2007. Blue Cross Blue Shield Technology Evaluation Center. Blue Cross Blue Shield, 22,13.

Genetic Testing. 2007. ACCE model system for collecting, analyzing and disseminating information on genetic tests. www.cdc.gov/genomics/gtesting/ACCE/fbr/index.htm. 2009 May 4.

10.

Haddow

, Palomaki

. 2003. ACCE: A model process for evaluating data on emerging genetic tests. Khoury

, Little

. Burke

. Human Genome Epidemiology: A Scientific Foundation for Using Genetic Information to Improve Health and Prevent Disease. Oxford: Oxford University Press, 217-233.

11.

Harris

, Helfand

, Woolf

et al. 2001. Current methods of the US preventive services task force: a review of the process. Am J Prev Med, 20:21-35.

12.

Hayes

, Bast

, Desch

et al. 1996. Tumor marker utility grading system: a framework to evaluate clinical utility of tumor markers. J Natl Cancer Inst, 88:1456-1466.

13.

Hayes

, Ethier

, Lippman

. 2006. New guidelines for reporting of tumor marker studies in breast cancer research and treatment: REMARK. Breast Cancer Res Treat, 100:237-238.

14.

Hoffrage

, Lindsey

, Hertwig

, Gigerenzer

. 2000. Communicating statistical information. Science, 290:2261-2262.

15.

Hricak

, Akin

, Bradbury

et al. 2005. Advanced Imaging Methods: Functional and Metabolic Imaging, 7th. Lippincott Williams & Wilkins: Philadelphia, PA.

16.

Krouwer

, Cembrowski

, Tholen

. 2006. Preliminary Evaluation of Quantitative Clinical Laboratory Measurement Procedures; Approved guideline-Third Edition. Volume 26. Clinical and Laboratory Standards Institute (CLSI): Wayne, PA, document EP10-A3 (IBSN 1-56238-622-0).

17.

Mansfield

, O'Leary

, Gutman

. 2005. Food and Drug Administration regulation of in vitro diagnostic devices. J Mol Diagn, 7:2-7.

18.

Marchionni

, Wilson

, Marinopoulos

et al. 2008. Impact of Gene Profiling Tests on Breast Cancer Outcomes. Evidence Report/Technology Assessment. The Johns Hopkins University Evidence-based Practice Center: Baltimore, MD www.ahrq.gov/clinic/tp/brcgenetp.htm

19.

Mark

. 2010. Chapter 3 Decision-Making in Clinical Medicine. Fauci

, Braunwald

, Kasper

et al. Harrison's Principles of Internal Medicine, 17th. McGraw Hill: New York.

20.

McShane

, Altman

, Sauerbrei

et al. 2005. Reporting recommendations for tumor marker prognostic studies. J Clin Oncol, 23:9067-9072.

21.

Michiels

, Koscielny

, Hill

. 2005. Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet, 365:488-492.

22.

Ntzani

, Ioannidis

. 2003. Predictive ability of DNA microarrays for cancer outcomes and correlates: an empirical assessment. Lancet, 362:1439-1444.

23.

Paik

, Shak

, Tang

et al. 2004. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med, 351:2817-2826.

24.

Paik

, Tang

, Shak

et al. 2006. Gene expression and benefit of chemotherapy in women with node-negative, estrogen receptor-positive breast cancer. J Clin Oncol, 24:3726-3734.

25.

Pearson

, Blair

, Daniels

et al. 2002. AHA guidelines for primary prevention of cardiovascular disease and stroke: 2002 update: consensus panel guide to comprehensive risk reduction for adult patients without coronary or other atherosclerotic vascular diseases. American heart association science advisory and coordinating committee. Circulation, 106:388-391.

26.

Ramsey

, Veenstra

, Garrison

Jr.

et al. 2006. Toward evidence-based assessment for coverage and reimbursement of laboratory-based diagnostic and genetic tests. Am J Manag Care, 12:197-202.

27.

Simon

. 2006. A checklist for evaluating reports of expression profiling for treatment selection. Clin Adv Hematol Oncol, 4:219-224.

28.

Simon

, Paik

, Hayes

. 2009. Use of archived specimens in evaluation of prognostic and predictive biomarkers. JNCI, 101:1446-1452.

29.

Sparano

, Solin

. 2010. Defining the clinical utility of gene expression assays in breast cancer: the intersection of science and art in clinical decision making. J Clin Oncol, 28:1625-1627.

30.

Sturgeon

, Hoffman

, Chan

et al. 2008. National academy of clinical biochemistry laboratory medicine practice guidelines for use of tumor markers in clinical practice: quality requirements. Clin Chem, 54:e1-e10.

31.

Sun

, Bruening

, Uhl

et al. 2010. Quality, Regulation and Clinical Utility of Laboratory-developed Molecular Tests: Technology Assessment Report. ECRI Institute Evidence-based Practice Center: Plymouth Meeting, PA.

32.

Technology Evaluation Center Criteria. 2008. www.bcbs.com/blueresources/tec/tec-criteria.html. 2011 January 21.

33.

Teutsch

, Bradley

, Palomaki

et al. 2009. The evaluation of genomic applications in practice and prevention (EGAPP) initiative: methods of the EGAPP Working Group. Genet Med, 11:3-14.

34.

Tholen

, Kallner

, Kennedy

et al. 2004. Evaluation of Precision Performance of Quantitative Measurement Methods; Approved Guideline-second edition. National Committee of Clinical Laboratory Standards Institute (CLSI): Wayne, PA, document number EP5-A2 (ISBN 1-56238-542-9).

35.

Tholen

, Kroll

, Astles

et al. 2003. Evaluation of linearity of quantitative measurement procedures; A statistical approach; approved guideline. Wayne, PA: National Committee of Clinical Laboratory Standards Institute (CLSI)document number EP6-A (ISBN 1-56238-498-8).

36.

U.S. Preventive Services Task Force Procedure Manual. 2008AHRQ Publication No. 08-05118-EFwww.uspreventiveservicestaskforce.org/uspstf08/methods/procmanual.htm. 2008 July .

37.

Wald

, Cuckle

. 1989. Reporting the assessment of screening and diagnostic tests. Br J Obstet Gynaecol, 96:389-396.

38.

Weinstein

, O'Brien

, Hornberger

et al. 2003. Principles of good practice for decision analytic modeling in health-care evaluation: report of the ISPOR Task Force on Good Research Practices—Modeling Studies. Value Health, 6:9-17.

39.

Weitzel

, Lagos

, Cullinane

et al. 2007. Limited family structure and BRCA gene mutation status in single cases of breast cancer. JAMA, 297:2587-2595.

40.

Whiting

, Sterne

, Westwood

et al. 2008. Graphical representation of diagnostic information. BMC Med Res Methology, 8:1-15.

41.

Whiting

, Toerien

, de Salis

et al. 2007. A review identifies and classifies reasons for ordering diagnostic tests. J Clin Epidemiol, 60:981-989.

42.

Wolff

, Hammond

, Schwartz

et al. 2007. American Society of Clinical Oncology/College of American Pathologists guideline recommendations for Human Epidermal Growth Factor Receptor 2 testing in breast cancer. Arch Pathol Lab Med, 131:18-43.