Abstract
Double (bi-allelic) mutations in the gene encoding the CCAAT/enhancer-binding protein-alpha (CEBPA) transcription factor have a favorable prognostic impact in acute myeloid leukemia (AML). Double mutations in CEBPA can be detected using various techniques, but it is a notoriously difficult gene to sequence due to its high GC-content. Here we developed a two-step gene expression classifier for accurate and standardized detection of CEBPA double mutations. The key feature of the two-step classifier is that it explicitly removes cases with low CEBPA expression, thereby excluding CEBPA hypermethylated cases that have similar gene expression profiles as a CEBPA double mutant, which would result in false-positive predictions. In the second step, we have developed a 55 gene signature to identity the true CEBPA double-mutation cases. This two-step classifier was tested on a cohort of 505 unselected AML cases, including 26 CEBPA double mutants, 12 CEBPA single mutants, and seven CEBPA promoter hypermethylated cases, on which its performance was estimated by a double-loop cross-validation protocol. The two-step classifier achieves a sensitivity of 96.2% (95% confidence interval [CI] 81.1 to 99.3) and specificity of 100.0% (95% CI 99.2 to 100.0). There are no false-positive detections. This two-step CEBPA double-mutation classifier has been incorporated on a microarray platform that can simultaneously detect other relevant molecular biomarkers, which allows for a standardized comprehensive diagnostic assay. In conclusion, gene expression profiling provides a reliable method for CEBPA double-mutation detection in patients with AML for clinical use.
Introduction
O
The WHO has defined AML with CEBPA mutations (single and double collectively) as a provisional favorable subtype of AML (Swerdlow et al., 2008). AML cases with a CEBPA mutation are predominantly found among the cytogenetically intermediate risk group, mainly those with normal karyotypes (Grimwade et al., 1998). However, AMLs carrying CEBPA double mutations rather than CEBPA single mutations represent a distinct prognostically favorable group of AML cases, and therefore, CEBPA double mutants provide a clinically useful marker for risk stratification of the leukemia (Pabst et al., 2009; Renneville et al., 2009; Wouters et al., 2009; Green et al., 2010; Taskesen et al., 2011). Finally, AML cases have been identified that are CEBPA wt, yet are hypermethylated (Figueroa et al., 2009). Not surprisingly, these AMLs display gene expression patterns that are very similar to those of CEBPA double mutants because a shared set of downstream transcripts are low either due to the bi-allelic mutations or the hypermethylated promoter status of CEBPA (Figueroa et al., 2009; Taskesen et al., 2011). However, CEBPA double-mutation AMLs and CEBPA hypermethylated AMLs are different as regard to the gene expression levels of CEBPA itself, which are generally higher in the double-mutant AML subtypes, but low to absent in the hypermethylated subtype.
Classification of CEBPA dm cases has been previously demonstrated using gene expression profiling (Wouters et al., 2009); however, misclassified cases were the CEBPA hypermethylated AMLs (Figueroa et al., 2009; Taskesen et al., 2011). Herein we have developed a two-step classifier, which explicitly distinguishes the CEBPA double mutants from the CEBPA hypermethylated. Accuracy of this two-step classifier was established by means of a double-loop cross validation on a cohort of 505 primary AML cases. The two-step classifier was then incorporated on a custom chip, along with detection capabilities for AML1-ETO, CBFB-MYH11, PML-RARA, other mutations such as NPM1 (Van Vliet et al., Submitted), as well as overexpression of prognostic genes EVI1 and BAALC (Brand et al., in preparation). Thus, this gene expression assay provides a standardized test for multiple molecular biomarkers relevant for AML at the time of diagnosis.
Materials and Methods
Custom in vitro diagnostic gene expression array
A customized Affymetrix (Santa Clara, CA) GeneChip microarray was designed (AMLprofiler™). This custom chip includes a subset of the probe sets present on the Affymetrix HG-U133 Plus2 platform and a subset of probes designed for specific purposes. Its resulting .CEL file is automatically routed through proprietary software that reports 0 or 1 for CEBPA wt and CEBPA dm, respectively.
Datasets
We employed the blood or bone marrow specimens from 505 AML patients, who had been enrolled in the Dutch-Belgian Hematology-Oncology Cooperative group protocols -04, -04A, -29, and -42 (HOVON-SAKK, the Erasmus University Medical Center, Rotterdam), (Valk et al., 2004; Wouters et al., 2009). The 505 cases employed here are a subset from the 524 cases (Wouters et al., 2009) for which the hybridization cocktail was available. The 505 set contains all 26 CEBPA dm and 12 CEBPA sm cases from the Wouters et al. (2009) cohort (Table 1). Sample processing and quality control were performed as previously described (Wouters et al., 2009). Bi-allelic mutant, single mutant, and wild-type CEBPA annotations were confirmed in all cases by the entire CEBPA coding region investigation by denaturing high-performance liquid chromatography (dHPLC), analysis of selected regions by agarose gel analysis (van Waalwijk van Doorn-Khosrovani et al., 2003), and/or nucleotide sequencing (Wouters et al., 2009). CEBPA promoter hypermethylation was determined using the HELP assay (HpaII tiny fragment enrichment by ligation-mediated PCR), as previously described (Valk et al., 2004). Table 1 shows an overview of the CEBPA mutation status of all AML cases in the cohort.
Hybridization cocktails were obtained for all cases (Verhaak et al., 2009), and were hybridized onto the AMLprofiler platform. All .CEL files were preprocessed using the MAS5 algorithm (scaling to 1500), and probe set intensities below 30 truncated to 30. Subsequently, geometric mean centering was applied per probe set relative to a subset of 244 AML cases.
AML profiler CEL files are available at the Gene Expression Omnibus (National Center for Biotechnology Information) under accession number GSE42194.
All patients provided written informed consent in accordance with the Declaration of Helsinki.
Double-loop cross validation
To build diagnostic classifiers from high throughput data, we used the double-loop cross-validation framework (Wessels et al., 2005). We adopted this methodology combined with forward filtering as the feature selector, the signal to noise ratio as a criterion to evaluate the individual genes, and ClaNC (Dabney, 2005) a simple classifier that is known to perform well on this type of data. The double-loop cross validation was executed with 100 repeats of 26-fold cross validation in the outer (validation) loop, and 10-fold cross validation in the inner loop. Learning curves were constructed for up to 100 genes, using the average of the sensitivity and specificity as criterion (reported as percentages, 50% is random classification, 100% is perfect classification) to be optimized. At all points, data splits were stratified with respect to the class prior probabilities. To optimize a classifier toward less FN or less FP cases, the prior probabilities were adjusted, such that the classifier boundary gets shifted.
Results
Development of a classifier for CEBPA dm
We set out to develop a classifier for detecting CEBPA dm AML cases in a cohort of 505 patients with AML (Table 1). For this, we consider the CEBPA dm as the positive group, and the remainder as negative (CEBPA wt, sm, or hypermethylated). This provides a six gene classifier with high performance, as shown in Figure 1 and Table 2. The misclassified cases are predominantly the hypermethylated cases (Fig. 1), which are consistent with their biological background, as the downstream gene expression profiles from a nonfunctional CEBPA protein due to the bi-allelic gene mutations may be equivalent to the absence of the protein due to the hypermethylated gene status. Although this classifier has a high accuracy, from a clinical utility point of view, false-positive cases are undesirable as they might be at risk of undertreatment given the favorable prognosis of CEBPA double mutants. Therefore, we investigated methods to exclude the hypermethylated cases before the application of a classifier.

CEBPA hypermethylated cases are often FP in CEBPA double-mutant classifiers. The scatter plot showing the output of a one-step classifier trained on the 505 cases as determined by the double-loop cross-validation procedure. Numbers indicate how often a case was misclassified in the 100 repeats (samples always correctly classified do not have a number printed). Color images available online at www.liebertpub.com/gtmb
Sens, sensitivity; spec, specificity; CI, confidence interval; LL, lower limit; UL, upper limit.
Development of a two-step classifier for CEBPA dm
Since AML cases with hypermethylated CEBPA show reduced CEBPA gene expression levels, we propose a two-step classifier. In step 1, the CEBPA expression level is assessed in comparison with a threshold. In step 2, only the selected cases exceeding that threshold will subsequently be input into the classifier (Fig. 2).

Overview of the two-step classifier for predicting CEBPA dm.
The step 1 threshold was developed as follows. The expression levels for the different CEBPA mutation groups were compared (Fig. 3A). As expected, the hypermethylated CEBPA cases all have notably low CEBPA expression values. The CEBPA wt cases show a wide range of variable CEBPA expression levels, whereas the CEBPA sm cases have slightly elevated CEBPA expression. However, the CEBPA dm cases all share high CEBPA expression levels. The aim of the step 1 classifier is to classify all hypermethylated cases below the threshold and all CEBPA dm cases above the threshold. The threshold is determined by the intersection of the two fitted normal distributions on the group of CEBPA dm cases and CEBPA hypermethylated cases. At that point, the groups have an equal chance of correct classification (i.e., where the overall chance of misclassification is minimal), indicated by the red line in Figure 3A. A total of 79 cases exceeds this threshold, including all 26 CEBPA dm cases, 5 CEBPA sm cases, and 48 CEBPA wt cases (see Table 1).

The step 2 classifier was developed using the 79 cases that exceed the threshold of the first step of the classifier. On these 79 cases, we trained a classifier without FPs and minimal FNs. This results in a 55-gene classifier with 0 FP and 1 FN (sensitivity 96.2%; specificity 100.0%; Fig. 3B). This classifier has a better sensitivity and specificity compared with the classifier trained on all 505 cases (see Table 2). The 55 probe sets used in the classifier are provided in Table 3.
Discussion
Given the favorable prognostic value of CEBPA double mutants, accurate detection of CEBPA mutations at the time of AML diagnosis is important. Current methods include dHPLC followed by Sanger sequencing, direct sequencing, melting curves, etc. However, CEBPA is a notoriously difficult gene to sequence due to its high GC content. In addition, due to short sequence reads, most next generation sequencing (NGS) techniques cannot confirm whether two co-occurring mutations inactivate both alleles or reside on the same allele. Alternatively, a gene expression classifier for CEBPA dm in AML has been built previously (Wouters et al., 2009), but not one that explicitly prevented CEBPA promoter hypermethylated AML cases as false-positive CEBPA double mutants (Figueroa et al., 2009; Taskesen et al., 2011).
Given the favorable prognosis of CEBPA double mutants, FP cases might be at risk of undertreatment. CEBPA hypermethylated cases have a gene expression pattern that is overall very similar to CEBPA double-mutant cases (Figueroa et al., 2009; Taskesen et al., 2011), and therefore, they are more likely than others to end up as FP. Therefore, we have built a two-step classifier that first eliminates CEBPA hypermethylated cases based on their low CEBPA expression level, which is the only difference that distinguishes them from CEBPA double-mutant cases. Next, a classifier with unequal prior probability is employed such that there were zero FPs. This two-step classifier provides an accurate method for predicting CEBPA dm status in AML.
Validation of a classifier on an independent cohort is a common way to estimate its performance. However, such gene expression data measured using the custom Affymetrix GeneChip platform is currently unavailable. As an alternative, we employed the double-loop cross-validation procedure, which estimates the classifier's performance using multiple resamplings of the same data (generalization performance, Wessels et al., 2005). Finally, the two-step classifier presented here will also be validated in an ongoing prospective clinical trial.
In conclusion, the two-step classifier provides an accurate means to detect CEBPA dm status. Moreover, the CEBPA dm classifier has been included on the AML profiler Affymetrix platform that also detects the presence of t(8;21), t(15;17), inv(16)/t(16;16), NPM1 mutations, EVI1 overexpression, and BAALC overexpression, in a single standardized test. Together, this enables a more efficient analysis at diagnosis, with a single sample work up that is suitable for clinical use.
Footnotes
Acknowledgments
This research was performed within the framework of CTMM, the Center for Translational Molecular Medicine, project BioCHIP grant 03O-102.
Author Disclosure Statement
M.H.V.V. performed research, design of experiments, data analysis, data interpretation, and manuscript writing. P.B., L.D.Q., J.P.M.B., L.C.M.B., and H.V. performed research. B.L. and P.J.M.V. conceptuated the project, and were involved in research, design of the experiments, data interpretation, and manuscript writing. EHVB performed research, design of experiments, data interpretation, and manuscript writing. M.H.V.V., P.B., L.D.Q., J.P.M.B., L.C.M.B., H.V., B.L., P.J.M.V., and E.H.V.B. have declared ownership interests in Skyline Diagnostics.
