Experiment Specific Expression Patterns

Abstract

The differential analysis of genes between microarrays from several experimental conditions or treatments routinely estimates which genes change significantly between groups. As genes are never regulated individually, observed behavior may be a consequence of changes in other genes. Existing approaches like co-expression analysis aim to resolve such patterns from a wide range of experiments. The knowledge of such a background set of experiments can be used to compute expected gene behavior based on known links. It is particularly interesting to detect previously unseen specific effects in other experiments. Here, a new method to spot genes deviating from expected behavior (PAttern DEviation SCOring—Padesco) is devised. It uses linear regression models learned from a background set to arrive at gene specific prediction accuracy distributions. For a given experiment, it is then decided whether each gene is predicted better or worse than expected. This provides a novel way to estimate the experiment specificity of each gene. We propose a validation procedure to estimate the detection of such specific candidates and show that these can be identified with an average accuracy of about 85%.

1. Introduction

Microarrays are often measurements of two or multiple conditions. A natural way to analyze such data is through differential expression. It results in candidate gene lists containing several hundreds of genes. Detailed biological downstream studies are usually not feasible for all of these genes. Further filtering towards more promising candidates is necessary. Moreover, most candidates are likely indirect targets of initially affected genes or, more generally, they follow a pattern which can be observed similarly in other experiments. Such genes may not be of immediate interest. In return, striking differences to known behavior indicate specificity for a certain experiment and such genes are suited for further analysis. We will now introduce how Padesco (PAttern DEviation SCOring) models common patterns and in which way differences to known patterns are obtained.

Patterns. Hirsch et al. (2010) noted that disease specific effects eventually trigger core biological pathways and thus frequently lead to “a transcriptional signature that is common to a diverse set of human diseases.” Such signatures can be learned and used for experiment specific predictions. Padesco detects how well the behavior of a gene can be derived from other genes. It allows to detect genes which show both differential and unexpected—and thereby interesting—behavior. The target gene patterns we learn are derived through Support Vector Regression (SVR) and basically constitute linear models describing its dependencies to other genes. Since we aim at unexpected changes rather than states, we use fold-changes (not expression values) to describe gene behavior (see Fig. 1).

FIG. 1.

Basic structure of Padesco. A matrix F (Fig. 2) of gene expression fold-changes (fcs) is computed. A leave-one-out cross validation (LOOCV) is performed for each gene and omitting each experiment once resulting in |G|*|E| models. For a new experiment e, g's fcs are predicted with the model for (g,e) and compared to the measured fcs (residual scoring). Based on the LOOCV a deviation score (padscore) is derived. It enables deviation filtering of significant genes.

Deviations. Not all differential genes detected by differential expression analysis are specific. A background set of experiments must be heterogeneous to assess this. Padesco's key idea is that we can decide whether the behavior of a gene can be predicted worse than expected and is thus a specific gene. It is important to note that there may well be genes which are difficult or easy to predict in general. Our scoring scheme is designed to account for this individual prediction complexity by estimating an empirical distribution of deviations in a cross-validation (CV) setting.

Evaluation. Evaluation of differential expression results is difficult. Simulations may show methodological strengths and weaknesses, but biological evaluation is only possible through comparison to published knowledge or downstream experiments. We discuss Padesco's performance both by means of an exhaustive simulation experiment and a detailed discussion of literature supporting genes found to be interesting by our approach. The simulation shows that genes deviating from their common behavior would be neglected due to differential expression analysis alone, since they often show only moderate differential expression.

Scoring. Padesco is trained on a background set of experiments consisting of 1437 microarrays from 25 experiments sharing 4117 genes. A leave-one-out cross-validation (LOOCV) across all experiments is done yielding predictions for a genes fold-change for all pairs of arrays within the omitted experiment. We estimate how well a gene can be predicted by deriving the empirical distribution of its deviations from the measured fold-changes. We then devise a score based on the median absolute deviation to score a gene in an unseen experiment. We assume that a gene in a differential setting is interesting if it exhibits a change in its gene expression. We thus use differentially expressed genes. Furthermore, a gene may be predicted better than expected, which points at stronger presence of a trained gene-gene dependency within this experiment. Although the problem is related we do not focus it in this work. If a gene is predicted worse than expected this suggests changes in a known dependency structure.

1.1. Related work

Padesco is a natural extension to co-expression approaches like Eisen et al. (1998), Rhodes et al. (2002), Lee et al. (2004), or Ucar et al., (2007) as well as residual scoring schemes like Kostka and Spang (2004) and Prieto et al. (2006). Co-expression aims to construct gene sets by clustering, or construction of co-regulated gene sets across many samples. Our trained patterns are similar, yet not identical to these previously derived co-expression patterns. Mentionable differences are the use of fold-changes rather than raw measurements and our predictivity criterion. The idea that a gene can be predicted from other genes and the prediction error may model condition-specific changes is a concept previously applied as residual scoring (Kostka and Spang, 2004; Prieto et al., 2006). To achieve meaningful residual scores, these approaches must be applied on homogeneous training data like a certain disease subtype. These scoring schemes fail on very heterogeneous experiment data. Padesco aims to bridge the gap between mere co-regulation detection and residual scoring. For heterogeneous training sets, a background sensitive view on differential experiment results is provided. The most common framework for both residual scoring and differential analysis is provided by BioConductor (Gentleman et al., 2004). In general, identified sets of differentially expressed genes maximize sample discrimination. Such markers, or gene signatures, are promising targets for further analysis. Veer et al. (2002), for instance, used gene expression profiling to identify transcription signatures for breast cancer prognosis classification. Gene expression profiles have been used by Rhodes et al. (2002) to reveal pathway dysregulation in prostate cancer. Comprehensive resources like GEO (Gene Expression Omnibus [Edgar et al., 2002]) enable the analysis of co-expression across many experiments. The comparison can for instance be quantified by measures of reproducibility in-between experiments (Garrett-Mayer et al., 2008). Padesco's linear models are loosely related to those used for imputation of missing values (IMV). Here, either data from single (Hastie et al., 1999) or multiple experiments (Hu et al., 2006) is used. To our knowledge fold-change predictions have not been applied so far. SVR has been previously used by Kim et al. (2005) for IMV.

2. Methods

2.1. Data sets

Padesco is trained on a set of experiments compiled by Lee et al. (2004). It consists of 3924 microarrays from 60 human data sets. These sets comprise 62.2 million expression measurements. They consist of at least 10 and up to 255 samples. Genes are filtered for a minimum amount of variance across samples. We restrict the data set to Affymetrix array platforms (HG-U95A, HG-U95Av2, HU6800, HuGeneFl, HG-U133A, and HG-U133comb). There is no direct necessity for a comparability of absolute expression levels since our method works on fold-changes. We restrict to a subset G of 4117 genes which occur in more than 50% of all experiments and contain at least 75% valid (i.e., non-missing) measurements. Twenty-five experiments with a total of p = 1437 microarrays fulfill this constraint. We obtain a matrix of fold-changes F by sampling p pairs of arrays (Fig. 2) from the space of all possible pairs within experiments (see Section 2.5).

FIG. 2.

Schematic view of the fold-change (fc) matrix F. Rows: fcs of array pairs. Columns: genes; fcs \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$f_1 \ldots f_p$$ \end{document} are computed among pairs from the same experiments. Training: linear models are derived from fcs in the training set (gray shades) with training labels of gene l (red). Prediction is done using the fcs in the test-set (blue areas) for the test labels of gene l (yellow). They result in predicted fcs which can be compared to measured fcs (residual scoring). Deviation: based on a leave-one-out validation each gene is assigned a background distribution of deviations. A median absolute deviation based score (padscore) is derived to estimate whether a prediction is better or worse than expected.

For 13 selected experiments, we combine arrays into sample groups (e.g., tumor vs. normal samples). This is required for differential expression analysis (see Section 2.3) that compare the expression of genes between different sample groups. Some experiments, for instance time course measurements, cannot be subdivided into sample groups but are used as background experiments. Overall, 62 sample groups have been analyzed. Comparisons are performed as one sample group versus the others from the same experiment (i.e., one comparison is performed for experiments with n = 2 sample groups and n comparisons otherwise). We thus conducted 59 comparisons in total.

2.2. Basic protocol

Padesco uses a two step approach for the selection of candidate genes from expression measurements. Prior to the application of Padesco, expression patterns must be trained on a background set bs of experiments (see Section 2.1). We apply SVR to train one model for each gene using bs (see Sections 2.4 and 2.5). Predicted labels are within-experiment fold-changes of this gene. Training features are the fold-changes of all other genes. For a new experiment (not contained in bs), Padesco selects genes by two consecutive filter steps. First, the measured genes are analyzed for differential expression (see Section 2.3) based on Wilcoxon's rank sum test (Wilcoxon, 1945). In general, any differential expression approach can be applied as Padesco does not rely on a particular method. The novel second filter step is based on an analysis of the trained regression models. Here we discard genes that conform to the patterns learned from the background experiments. We analyze the gene prediction errors using residual scoring (see Section 2.7) by comparing its predicted against the observed fold-changes. We then assess its pattern conformance in terms of the distribution of its residuals across the LOOCV as described below. We show our basic work flow in Figure 1.

To arrive at a background distribution of errors for each gene, we perform a leave-one-out cross-validation omitting each experiment once. Each fold induces |G| models. The prediction performance can thus be evaluated independently for each gene in each experiment. The background-training set for an (experiment e, gene g)-pair contains all but this experiment using the gene as dependent variable. Once trained, the application of Padesco involves only one prediction per gene using the model trained on all known experiments.

2.3. Differential analysis

We apply a common differential microarray analysis setup as primary filter step. Different experimental sample groups are compared by the Wilcoxon rank sum statistic. For a given gene, a p-value is computed for each sample group comparison as measure of significance of the differential expression. P-values are corrected for multiple testing using the procedure of Benjamini and Hochberg (1995). In this primary filter step, we assume genes as differentially expressed if they exhibit significance-level α of 0.01.

2.4. Support vector regression

Padesco is based on the training of predictive regression models using SVR (Vapnik, 1998; Smola and Schölkopf, 1998) (see Section 2.5). We focus Support Vector Machines (SVMs), which have acquired general acceptance for microarray applications and have been used for a wide range of tasks (Vapnik, 1998; Schölkopf and Smola, 2001), in particular, array and tissue classification (Furey et al., 2000; Chen et al., 2007), They have been shown to yield very competitive results for applications in molecular biology (Schölkopf et al., 2004). In their most common formulation, SVMs work on a set D of real-valued vectors with associated labels \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$L , D = \{ \langle x , l (x) \rangle \mid x \in {\mathbb R}^n , l (x) \in L \}$$ \end{document} . For SVR, the labels are real numbers ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$L = {\mathbb R}$$ \end{document} ). The ε-SVR estimates a function \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$f : {\mathbb R}^n \rightarrow L$$ \end{document} such that at most ε deviation (|f (x) − l(x)| < ε) from the correct values of all training data is allowed. ν-SVR is an extension of ε-SVR which estimates an optimal value for ε. The ν parameter controls the number of support vectors and points outside of the ε-insensitive tube. A parameter C controls how strong deviations from the optimal model are penalized. In our experiments we apply the libSVM implementation of ν-SVR (Chang and Lin, 2001). A linear kernel is used in our experiments. We examined the linear kernel as well as the radial basis kernel. Final parameters have been chosen as (C,ν) = (100, 0.2) using an exhaustive grid search combined with leave-one-experiment out (LOO) for hyper-parameters \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$C \in \{ 10^{i} \mid i = - 2 , - 1 , \ldots , 4 \}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\nu \in \{ 0.2 , \ldots , 0.8 \}$$ \end{document} . This initial screening has shown that on average the linear kernel provides similar performance to the radial basis kernel (screening \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\gamma \in \{ 10^{i} \mid i = - 4 , - 3 , \ldots , 1 \}$$ \end{document} ) and a stable prediction performance across a wide range of parameters in our setting rendering an additional CV unnecessary in our setting. We will now show how to use SVR to train models which are capable of predicting fold-changes from other genes.

2.5. Model building

We derive SVR models to predict gene expression fold-changes. A model to predict the fold-change of a gene is trained on fold-changes of the other (n −1) genes in a CV setting skipping an experiment at a time. Here, the linear kernel enables the immediate training on data sets containing missing values. We additionally examined an orthogonal coding scheme (Wang et al., 2006)—0-imputation and average-imputation—but neither increased the average performance of the method. Since imputation methods may introduce additional errors, we omit instances carrying missing values (mis) as a label from both prediction and training. Raw data is represented as a matrix D of expression values \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$d_{ij} \in {\mathbb R} \cup \{ mis \}$$ \end{document} with p rows (arrays) and n columns (genes). We denote the set of genes \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$G = \{ g_1 , \ldots , g_n \}$$ \end{document} and the set of arrays \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$A = \{ a_1 , \ldots , a_p \}$$ \end{document} . Each array \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$a \in A$$ \end{document} belongs to an experiment \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$e (a) \in E , E = \{ e_1 , \ldots , e_k \}$$ \end{document} where each experiment e divides into sample groups \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$S_e = \{ s^e_1 , \ldots , s^e_{\mid S_e \mid} \}$$ \end{document} . Thus, each array maps to one condition \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$s_e ( a ) \in S_e$$ \end{document} . A single measurement is real-valued and written as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$m^g_{a}$$ \end{document} for gene \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$g \in G$$ \end{document} in array \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$a \in A$$ \end{document} . Padesco is trained on a set of fold-change vectors derived from arrays \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$a , b \in A , a \neq b$$ \end{document} from the same experiment e(a) = e(b) but different conditions, i.e., s_e(a) ≠ s_e(b). The fold-change for gene g between array a and array b is computed as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} f^g_{a , b} = \begin{cases} mis , \ {\rm if} \ (m^g_a = mis) \ {\rm or} \ (m^g_b = mis) \\ log_2 \left( \frac {m^g_a + c } {m^g_b + c} \right) \ {\rm otherwise}. \end{cases} \tag {1} \end{align*} \end{document}

By adding a constant c, small expression values up to an estimated noise level will not lead to extreme fold-changes. The value of c was screened similar to SVR hyper-parameters during the initial parameter screening and thus set to 50. In a LOOCV, models are trained for the prediction of fold-changes of a gene g in an experiment e. For target gene g with associated label l(g), the training is done using pairwise fold-change vectors of arrays within the same experiment. The induced fold-changes \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$f^g_e$$ \end{document} constitute the target values. The experiment e to be predicted is omitted completely during training, i.e., all its sample groups. Note that this setting avoids over-fitting more rigorously. It enables the estimation of prediction performance. The LOOCV results in |E| = 25 models for each gene and |G|*|E| models overall. Note that the factor |E| is only relevant for initial training purposes. The prediction requires the calculation of the prediction value for |G| genes using existing models only. For our validation, we obtain a model for g in each experiment e.

2.6. Fold-change prediction

We now simulate the real-world setting: the experiment e has never been seen during training, and we like to do a prediction for a gene of interest g. First, we calculate the fold-changes for all genes among all t pairs in this particular experiment. This corresponds to a vector \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$f_i^h$$ \end{document} for each \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$h \in G$$ \end{document} and each experiment pair \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$i = 1 \ldots t$$ \end{document} . We are especially interested in the values \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\rm\bf f}^g = \{ f^g_i \mid i = 1 \ldots t \}$$ \end{document} holding the measured fold-changes for g and predict them using the model for g. This yields \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\rm\bf f}^{\prime g} = \{ f^{\prime g}_i \mid i = 1 \ldots t \}$$ \end{document} constituting the vector of all predicted fold-changes for g in experiment e.

2.7. Scoring performance deviation

We now have a measured fold-change vector f^g and a predicted fold-change vector f^′g for an experiment e for a gene g. We compare them using the uncentered Pearson correlation ρ_g,e. Given the two n-dimensional vectors f^g and f^′g it is calculated as: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \rho_{g , e} = \frac {\big( \sum_{i = 1}^{n} f^g_i \cdot f^{\prime g}_i\big)} {\sqrt {\sum_{i = 1}^{n} {f^g_i}^2 \sum_{i = 1}^{n} {f^{\prime g}_i}^2}} \tag {2} \end{align*} \end{document}

We also compute a discretized version of f^g. We set all fold-changes above a threshold t_f to 1 and all below to 0. Thereby, we can compute an area under curve (AUC) value by varying t_f for f^′g and leave the threshold for f^g fixed at 2.

We now know how well a gene performs in a single experiment. Yet, we cannot say whether this is more or less than we expected. To arrive at a deviation of known patterns we compute the empirical distribution D^g of all \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\{ \rho_{g , e} \mid e \in E \}$$ \end{document} . The deviation for an experiment e and gene g is expressed in units of median absolute deviations (padscore) with respect to this distribution. Given the median med^g of D^g and the corresponding median absolute deviation mad^g the padscore is given by: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} (med^g - \rho_{g , e}) / mad^g \tag{3} \end{align*} \end{document}

In order to be retained by Padesco genes must simultaneously satisfy the significance level of differential expression in a specific sample group given by a certain α-level as well as a minimum padscore. These genes are referred to as specific. Remaining unspecific genes are significantly regulated, but could be predicted by the SVR models.

2.8. Permutation test for the evaluation of Padesco

Gold standards on the experiment-specific expression of genes are not available. We use a permutation test to simulate genes deviating from their common behavior. By copying (spiking) the expression values from a significant gene g⁺ to an insignificant gene g^- within an experiment e, we force genes to violate their common behavior and trivially become significant. We have two choices in parameters here. First, we can choose a z-score level t_z for significance. Second, we can choose a padscore level t_p to select interesting genes. The following permutation test selects these thresholds and estimates the associated performance for spike-in controls. We sample from the significant genes, and spike into the insignificant genes. We then recompute the padscore for the previously insignificant gene. This process is repeated s times where s is the number of significant genes. SPIKE denotes the set of spiked genes. After all repeats have been computed, we obtain sensitivity = tp/(tp + fn) and precision = tp/(tp + fp) and repeat the evaluation for all possible thresholds t_z and t_p in the experiment. A gene g's recomputed padscore p determines whether it is tp, fn or fp based on these thresholds (Table 1).

Table 1.

Classification Assignment for the Evaluation

Type	Abbreviation	Condition
true positive	tp	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$g \in SPIKE \land p \ge t_p$$ \end{document}
false positive	fp	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$g \not \in SPIKE \land p \ge t_p$$ \end{document}
false negative	fn	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$g \in SPIKE \land p < t_p$$ \end{document}

A threshold on z-score (t_z) and padscore (t_p) is chosen. After spiking (see Section 2.8) a gene g's recomputed padscore p determines its type.

3. Results

3.1. Evaluation of expression fold-change predictions

We use the uncentered correlation to measure how well the regression models can predict expression fold-changes of gene g in experiment e. The prediction performance is significantly better than random for the majority of genes: the uncentered correlation achieved an average value of ρ_u = 0.7 in our experiment. 91% of the predictions exhibit a ρ_u >0. After discretization we achieve an AUC of 81% on 15.7% cases with a fold-change of more than two. We argue that the remaining specific candidates that cannot be predicted well are particularly interesting because they exhibit an experiment-specific expression that could not be learned from the training data.

3.2. Prediction of fold-changes for individual genes

Gene signatures, i.e. sets of genes that are differentially regulated between different cellular states are frequently published along with microarray experiments. Such signatures are expected to, for example, yield diagnostic markers that could help to differentiate between healthy and sick individuals. Here, we examine a gene signature that has been compiled by Armstrong et al. (2002) to distinguish a particular chromosomal translocation involving the MLL (mixed-lineage leukemia) gene from the regular ALL (acute lymphoblastic leukemia) genotype. The MLL translocation is significant as it frequently leads to an early relapse after chemotherapy. In Figure 3, we compare our predictions against the experimentally determined expression fold-changes for this gene signature. For all genes, the direction of differential expression can be correctly derived from our predictions, although the values of the measured fold-changes are not reproduced exactly. Similar results have been obtained for other published signatures.

FIG. 3.

Scatter plot of predicted versus measured fold-changes. Microarray studies frequently derive gene signatures, i.e. sets of differentially expressed genes discriminating between experimental conditions. Gene signatures can be predicted well by our SVR models, as shown here for a gene signature distinguishing ALL and MLL leukemic genotypes published along with the data set of armstrong-mll ([Armstrong et al., 2002]). For every gene, our SVR models predict expression changes between ALL and MLL correctly, although the precise values of the measured fold-changes are not reproduced exactly. A gene is depicted as a single point, i.e. as average of all fold-changes of the given gene (see Section 2.5) over the array-pairs comparing the conditions ALL and MLL.

3.3. Permutation test based evaluation

Padesco filters genes based on a standard differential expression z-score (Wilcoxon test, see Section 2.3) and a novel padscore (see Section 2.7) indicating experiment-specific expression. This second score indicates whether genes can be predicted less well than expected from the training (background) set of experiments. They are selected due to their padscore since they do not conform to their trained patterns. Cutoffs on the two scores are required for the selection of specific candidate genes that exhibit differential as well as experiment specific expression. The permutation test introduced in Section 2.8 generates artificial pattern deviations through spiked genes. As shown in Figure 4, at a padscore cutoff of 2.0 specific candidates are accurately detected (85% precision). The precision increases for differentially expressed genes (z >3). Based on the evaluation we picked a padscore threshold of 2.0 and a z-score threshold of 3.0 to receive a moderate number of candidates exhibiting a high precision. Thereby, we selected some 250 specific candidates.

FIG. 4.

Precision (percent, colormap) versus number of detected genes (log base 2, contour). As gold standards are not available, we estimate the performance of specific candidate gene detection by a permutation test. This test evaluates how well known spike-in controls can be recovered by Padesco for arbitrary z-score (differential expression) and padscore (experiment specific expression) thresholds. Specific candidate genes can be reliably identified (85% precision) using a padscore above 1.5 even if they exhibit only moderate levels of differential expression (z-score <4). By combined z-score and padscore thresholds candidate gene lists can efficiently be reduced for follow-up studies. Analyzed here are 59 condition comparisons from 13 experiments. At the chosen padscore (2.0) and z-score (3.0) thresholds, some 250 specific candidates (contour) are detected by Padesco.

3.4. Specific candidates

In this section, we discuss sample results in two experiments that examine prostate cancer (Singh et al., 2002; Welsh et al., 2001) and one study on leukemia (Armstrong et al., 2002). As a further example, we provide results from a Toxoplasma gondii infection study (see Fig. 5) by Chaussabel et al. (2003). Padesco does not aim to whole relevant pathways but allows to focus on a small subset of interesting genes for further analysis. Vogelstein and Kinzler (2004) discuss key pathways which are likely to be disturbed to promote cancer in almost any cell type. Development of cancer is strongly coupled to the perturbation of one or more such pathways. The interleukin 2 pathway has been shown to be deregulated in many cancer types and was also described to be involved in prostate cancer. IL2RB dimerization with the α-subunit leads to a higher affinity towards interleukin-2. IL-2 treatment was previously shown to lead to reduced prostate tumor growth in rats (Hautmann et al., 1999; Moody et al., 1994). Another cancer therapy using IL-2 has been developed by Otter et al. (2008).

FIG. 5.

Common regulation and experiment deviation. The heat map shows a cluster of genes which are usually correlated (pairwise Pearson's Correlation above 0.8, “yoon-p53 (Yoon)” [Yoon et al., 2002] is given as reference). Here, AUH with a padscore of 5.09 is detected as interesting in “chaussabel-parasite (Cha)” [Chaussabel et al., 2003].

Eicosanoids are known to interact with immune messengers like interleukins. In the first experiment by Welsh et al. (2001), tumor samples were compared against normal and HUVEC (Human Umbilical Vein Endothelial Cells) samples, where 27 genes have been detected as differentially expressed. For an initial screening of functional significance we apply a gene ontology over-representation analysis (DAVID [Dennis et al., 2003; Huang et al., 2009]) on the differentially expressed genes (i.e., without padscore filter). As for Welsh et al. (2001), a screening for significance (Benjamini-Hochberg corrected scores, α = 0.01) shows no significant enrichment, yet 3 genes (IPR, EPR3R, and CYT450J) are found to belong to the eicosanoid metabolism (p = 0.08). With padscore filter, Interleukin 2 receptor β (IL2RB) is the only gene found to be interesting in this experiment. Padesco reported an unusual expression of IL2RB in the tumor samples, which could explain the decoupling of eicosanoid pathway members from IL2RB regulation.

The second examined experiment on prostate cancer is described in Singh et al. (2002). Here, 60 differentially expressed genes were identified. DAVID analysis shows no significant over-representation. After Padesco-filtering, 4 genes remain that we discuss below. HCK (padscore = 7.2), an src related tyrosin kinase, is most interesting in terms of the padscore. Smith et al. (2001) describes its association with gpl130 and the formation of a complex with IL-6R which promotes high affinity binding of IL-6. In prostate cancer, IL-6 is a key protein. It has been suggested to contribute to prostate cancer progression towards an androgen-independent state. We observe MGC17330 (PIK3IP1, padscore = 2.6) to be the second padscore relevant gene. It is a negative regulator of PI3K. Src kinases are upstream mediators for the PI3K signaling pathway with important roles in proliferation, migration and survival. It is described to be a tumor suppressor in heptacellular carcinomas (Fabregat, 2009). It shows only a weak positive fold-change in this experiment which may explain why it fails as a suppressor here. Mutations within the PI3K pathway have been described by Vogelstein and Kinzler (2004) to be involved in a number of tumor types. The third gene found is ATP2B1 (PMCA1, padscore = 2.6). It is a Ca²⁺ ATPase subunit. An unusual reduction in gene expression can be observed in our data. This reduction is discussed by Roderick and Cook (2008). Ca²⁺ pumps are likely to provide good therapeutic targets for anticancer drug development as suggested by Monteith et al. (2007) and Roderick and Cook (2008). They emphasize the role of Ca²⁺ (intracellular calcium) in the life-and-death decisions of the cell such that disturbed control of Ca²⁺ may lead to an inappropriate cell fate. The fourth candidate—C2orf3 (padscore = 2.03)—has also been identified of an approach of Hong et al. (2008). They analyzed three prostate cancer sets, but since more than a hundred genes are identified and C2orf3 is no top-ranking gene, this candidate has not been subject to further discussion. The transcription repressor binds GC-rich sequences of the epidermal growth factor receptor, beta-actin and calcium-dependent protease promoters.

Slightly below the padscore threshold STK38 (padscore = 1.84) has been described to exhibit cancer specific alternative splicing variants (Lee et al., 2003). In Rozanov et al. (2008), it is suggested as a part of the downstream network of MT1-MMP, a key regulator linked to tumorgenesis and metastasis. Similarly, CPD (padscore = 1.47) is a metallo carboxypeptidase family enzyme. It is described to have shown lower levels of gene expression in colon carcinomas. We observe a similar downregulation in our data. The well-known PSMA (or PSA) also exposes carboxypeptidase activity associated with increased invasiveness of prostate cancer (Ghosh et al., 2005). Similarily, GTF2B/TFIIB (padscore = 1.46), the general transcription factor 2B, plays a major role in the transcription of eukaryotic genes. Minucci and Pelicci (2006) suggested Histone deacetylases (HDACs) as promising targets in cancer therapy, partly due to their DNA binding capabilities. GTF2B/TFIIB exhibits auto-acetyltransferase function regulated by acetylation, while acetylation also impairs activities of enzymes involved in DNA metabolism and repair. DNA repair is a key mechanism which has to be bypassed to allow for tumor development due to Vogelstein and Kinzler, (2004). With padscore of 1.11, KLF6 is predicted only slightly worse than usually, yet with a much smaller offset than HCK. The Krueppel-like factor 6 is a well-known tumor suppressor gene.

The third data set by Armstrong et al. (2002) compares ALL to MLL. We can recover most specific genes originally published (some are filtered due to our criteria among all integrated experiments). An interesting exeption is LGALS1, which shows considerable expression in both the MLL and the AML samples, but not in ALL. Using a clustering procedure, Armstrong et al. (2002) describe it to be MLL specific. Our top candidate here does not meet the padscore = 2 requirement, yet RhoA (padscore = 1.99) is involved in cytokinesis and has been described as both pathogen target as well as regulator of oncogenensis (Burridge and Wennerberg, 2004).

Ordonez-Moran et al. (2008) associates RhoA with colon carcinoma (not in our background set). A second gene, ETS1 (padscore = 1.86) belongs to the familiy of ETS transcription factors and has been shown to regulate telomerase activity at gene transcription level. It constitutes a well-known oncogene with potential effects on telomer stability. In Yeoh et al. (2002), BCR-ABL samples 2 genes exhibit an extreme padscore of 4.8 and 5.9: PHLDA2 and PSMA6. While PSMA6 has recently been suggested as a new prognostic marker in acute monocytic leukemia (Chen et al., 2009), PHLDA2 (pleckstrin homolog-like domain family member 2) is located in a region considered to be an important tumor supressor gene region. The top gene among the identified candidates in the prostate cancer subtype of Butte et al. (2000) is SLC30A3 (padscore = 6.81), a zinc transfer transporter from the solute carrier family (prostate tissue in general shows an increased zinc content—around 10-fold higher than other tissues). SLC30A3 has been suggested to be a member of the apoptotic pathway in Ackland et al. (2007). Usually, SLC30A3 lowers intracellular zinc concentrations by mediating zinc efflux. Mouse TRAMP models suggest that both too high and too low zinc uptake have drastic effects on prostate tumor sizes thus this candidate's dysregulation could be crucial for tumor tissue. Prasad et al. (2010) even argue that an optimal zinc level is crucial as a protective instance against cancer development. DOOST (padscore = 8.53) is involved in the metabolism of glycoproteins and has been suggested to mediate processes associated with cell-adhesion or invasion (Hasegawa et al., 2002) and has been frequently described in the context of gastric cancer which was not yet included in our background set of experiments. Chaussabel et al. (2003) analyze diverse parasite infections on human macrophages and dendritic cells. We find CCR1 to be the most prominent gene (padscore = 9.29) in the Toxoplasma infection subgroup. Mice lacking this chemokine receptor CCR1 have shown dramatically increased mortality after Toxomplasma gondii infection (Khan et al., 2001).

4. Discussion

Genes are not regulated individually (Lee et al., 2004; Ucar et al., 2007). Frequently, patterns of co- or anti-regulation can be observed, such that the up-regulation of a gene A is a good hint that another gene B will also be up-regulated while a third gene C will rather be down-regulated. The disruption of such patterns pinpoints genes with experiment-specific expressions. We call such genes specific candidates in contrast to the remaining unspecific candidates that exhibit only generic expression patterns. Padesco detects specific candidates by analyzing fold-change based co-expression patterns with SVR models trained on a background set of microarray experiments. After training, we select specific candidate genes via a two stage filter. The first filter step is a routine analysis of differential expression (significant genes). A novel second filter selects genes that show deviations from generic expression patterns predictable by linear models (interesting genes).

In order to avoid the predictions of false specific candidates, Padesco depends on a good prediction performance of the underlying SVR models. The prediction performance can be evaluated rigorously, as the prediction target experiment is excluded from training in a LOOCV setting, where all conditions of particular experiments are left out. This not only excludes condition specific but also experiment specific biases. We examined 4117 genes across 25 experiments consisting of 1437 individual microarrays. Predictions by Padesco are better than expected by chance in 91% of the cases. It has been suggested that gene signatures rarely help to identify the involved biological processes or the causal regulatory mechanisms (Segal et al., 2004). Hirsch et al. (2010) further argued that gene signatures frequently do not represent specific attributes of the measured biological conditions. We analyzed gene signatures published together with the corresponding microarray experiments. These signatures were selected by the authors of the corresponding studies to discriminate between experimental conditions (sample groups). We found that expression changes for genes in signatures are predicted well by our SVR models trained on other, unrelated experiments. An example is the signature distinguishing ALL and MLL (Armstrong et al., 2002). Although the ALL/AML signature certainly provides discriminating marker genes, it does not capture experiment specific expression patterns according to Padesco.

The extent of differential expression alone does not indicate experiment specific involvement of genes. Based on the prediction performance, we identified specific candidates genes that exhibit experiment specific expression (i.e., expression changes that cannot be explained—predicted—by our models). This analysis is related to co-expression studies and complements differential expression analysis. It enables us to focus on concise candidate lists for follow-up studies that consist of experiment-specific candidates only. We screened for filter thresholds and estimated Padesco's performance from permutation tests, as comprehensive gold standards for the experiment specific expression of genes are not available. This newly devised simulation approach suggests that specific candidates are identified reliably by Padesco (>85% precision at padscore >1.5) even if they show only marginal levels of differential expression. On the other hand, more than 90% of the genes selected by differential expression alone exhibit only generic expression patterns and could thus be excluded from further studies. Specific candidates are likely to represent characteristic features of the corresponding experimental conditions.

We evaluated Padesco selected genes for two data sets on prostate cancer. Besides interesting new candidates, we found several genes with a known involvement in the disease. Some of them, such as IL-2RB, have already been reported as promising drug targets. We demonstrated that such examples are more difficult to detect by differential expression analysis alone. Instead, differential expression tends to pick up genes that act similarly in other, biologically unrelated experiments. Thus, in combination with differential expression analysis, Padesco is a promising protocol for the detection and analysis of particularly distinctive features of microarray experiments.

References

Ackland

M.L.

, Zou

, Freestone

et al. 2007. Diesel exhaust particulate matter induces multinucleate cells and zinc transporter-dependent apoptosis in human airway cells. Immunol. Cell Biol., 85:617–622.

Armstrong

S.A.

, Staunton

J.E.

, Silverman

L.B.

et al. 2002. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat. Genet., 30:41–47.

Benjamini

, Hochberg

1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B Methodol., 57:289–300.

Burridge

, Wennerberg

2004. Rho and Rac take center stage. Cell, 116:167–179.

Butte

A.J.

, Tamayo

, Slonim

et al. 2000. Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc. Natl. Acad. Sci. USA, 97:12182–12186.

Chang

C.-C.

, Lin

C.-J.

2001. LIBSVM: a library for support vector machines. www.csie.ntu.edu.tw/∼cjlin/libsvm. 2011 August 15.

Chaussabel

, Semnani

R.T.

, McDowell

M.A.

et al. 2003. Unique gene expression profiles of human macrophages and dendritic cells to phylogenetically distinct parasites. Blood, 102:672–681.

Chen

, Wang

, Zhang

et al. 2009. Expression of genes psma6 and slc25a4 in patients with acute monocytic leukemia. J. Exp. Hematol. Chin. Assoc. Pathophysiol., 17:1168.

Chen

, Li

, Wei

2007. A multiple kernel support vector machine scheme for feature selection and rule extraction from gene expression data of cancer tissue. Artif Intell Med., 41:161–175.

10.

Dennis

, Sherman

B.T.

, Hosack

D.A.

et al. 2003. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol., 4:P3.

11.

Edgar

, Domrachev

, Lash

A.E.

2002. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleid Acids Res., 30:207–210.

12.

Eisen

, Spellman

, Brown

et al. 1998. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 95:14863–14868.

13.

Fabregat

2009. Dysregulation of apoptosis in hepatocellular carcinoma cells. World J. Gastroenterol., 15:513.

14.

Furey

T.S.

, Cristianini

, Duffy

et al. 2000. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16:906–914.

15.

Garrett-Mayer

, Parmigiani

, Zhong

et al. 2008. Cross-study validation and combined analysis of gene expression microarray data. Biostatistics, 9:333–354.

16.

Gentleman

, Carey

, Bates

et al. 2004. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol., 5:R80.

17.

Ghosh

, Wang

, Klein

et al. 2005. Novel role of prostate-specific membrane antigen in suppressing prostate cancer invasiveness. Cancer Res., 65:727–731.

18.

Hasegawa

, Furukawa

, Li

et al. 2002. Genome-wide analysis of gene expression in intestinal-type gastric cancers using a complementary DNA microarray representing 23,040 genes. Cancer Res., 62:7012.

19.

Hastie

, Tibshirani

, Sherlock

et al. 1999. Imputing missing data for gene expression arrays [Technical report] Department of Statistics, Stanford University: Palo Alto, CA.

20.

Hautmann

S.H.

, Huland

1999. Local intratumor immunotherapy of prostate cancer with interleukin-2 reduces tumor growth. Anticancer Res., 19:2661–2663.

21.

Hirsch

H.A.

, Iliopoulos

, Joshi

et al. 2010. A transcriptional signature and common gene networks link cancer with lipid metabolism and diverse human diseases. Cancer Cell, 17:348–361.

22.

Hong

, Lee

, Hong

et al. 2008. Extraction of Informative Genes from Integrated Microarray Data. Springer: Berlin.

23.

, Li

, Waterman

et al. 2006. Integrative missing value estimation for microarray data. BMC Bioinform., 7:449.

24.

Huang

D.W.

, Sherman

B.T.

, Lempicki

R.A.

2009. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protocols, 4:44–57.

25.

Khan

, Murphy

, Casciotti

et al. 2001. Mice lacking the chemokine receptor CCR1 show increased susceptibility to Toxoplasma gondii infection. J. Immunol., 166:1930.

26.

Kim

, Golub

G.H.

, Park

2005. Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics, 21:187–198.

27.

Kostka

, Spang

2004. Finding disease specific alterations in the co-expression of genes. Bioinformatics, 20,Suppl 1:i194–i199.

28.

Lee

, Atanelov

, Modrek

et al. 2003. ASAP: the alternative splicing annotation project. Nucleic Acids Res., 31:101.

29.

Lee

H.K.

, Hsu

A.K.

, Sajdak

et al. 2004. Co-expression analysis of human genes across many microarray data sets. Genome Res., 14:1085–1094.

30.

Minucci

, Pelicci

2006. Histone deacetylase inhibitors and the promise of epigenetic (and more) treatments for cancer. Nat. Rev. Cancer, 6:38–51.

31.

Monteith

, McAndrew

, Faddy

et al. 2007. Calcium and cancer: targeting Ca²⁺ transport. Nat. Rev. Cancer, 7:519–530.

32.

Moody

D.B.

, Robinson

J.C.

, Ewing

C.M.

et al. 1994. Interleukin-2 transfected prostate cancer cells generate a local antitumor effect in vivo. Prostate, 24:244–251.

33.

Ordonez-Moran

, Larriba

, Palmer

et al. 2008. RhoA-ROCK and p38MAPK-MSK1 mediate vitamin D effects on gene expression, phenotype, and Wnt pathway in colon cancer cells. J. Cell Biol., 183:697.

34.

Otter

W.D.

, Jacobs

J.J.L.

, Battermann

J.J.

et al. 2008. Local therapy of cancer with free IL-2. Cancer Immunol., 57:931–950.

35.

Prasad

A.S.

, Mukhtar

, Beck

F.W.J.

et al. 2010. Dietary zinc and prostate cancer in the TRAMP mouse model. J. Medicinal Food, 13:70–76.

36.

Prieto

, Rivas

M.J.

, Sánchez

J.M.

et al. 2006. Algorithm to find gene expression profiles of deregulation and identify families of disease-altered genes. Bioinformatics, 22:1103–1110.

37.

Rhodes

D.R.

, Barrette

T.R.

, Rubin

M.A.

et al. 2002. Meta-analysis of microarrays: interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer. Cancer Res., 62:4427–4433.

38.

Roderick

H.L.

, Cook

S.J.

2008. Ca²⁺ signalling checkpoints in cancer: remodelling Ca²⁺ for cancer cell proliferation and survival. Nat. Rev. Cancer, 8:361–375.

39.

Rozanov

, Savinov

, Williams

et al. 2008. Molecular signature of MT1-MMP: transactivation of the downstream universal gene network in cancer. Cancer Res., 68:4086.

40.

Schölkopf

, Smola

A.J.

2001. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond (Adaptive Computation and Machine Learning) The MIT Press: Cambridge, MA.

41.

Schölkopf

, Tsuda

, Vert

2004. Kernel Methods in Computational Biology (Computational Molecular Biology) The MIT Press: Cambridge, MA.

42.

Segal

, Friedman

, Koller

et al. 2004. A module map showing conditional activity of expression modules in cancer. Nat. Genet., 36:1090–1098.

43.

Singh

, Febbo

P.G.

, Ross

et al. 2002. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1:203–209.

44.

Smith

, Hobisch

, Lin

et al. 2001. Interleukin-6 and prostate cancer progression. Cytokine and Growth Factor Rev., 12:33–40.

45.

Smola

, Schölkopf

1998. A tutorial on support vector regression [Technical Report NC2-TR-1998-030] NeuroCOLT2.

46.

Ucar

, Neuhaus

, Ross-MacDonald

et al. 2007. Construction of a reference gene association network from multiple profiling data: application to data analysis. Bioinformatics, 23:2716–2724.

47.

Vapnik

V.N.

1998. Statistical Learning Theory. Wiley-Interscience: New York.

48.

Veer

V.L.

, Dai

, van de Vijver

M.J.

et al. 2002. Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415:530–536.

49.

Vogelstein

, Kinzler

K.W.

2004. Cancer genes and the pathways they control. Nat. Med., 10:789–799.

50.

Wang

, Li

, Jiang

et al. 2006. Missing value estimation for DNA microarray gene expression data by Support Vector Regression imputation and orthogonal coding scheme. BMC Bioinform., 7:32.

51.

Welsh

J.B.

, Sapinoso

L.M.

, Su

A.I.

et al. 2001. Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. Cancer Res., 61:5974–5978.

52.

Wilcoxon

1945. Individual comparisons by ranking methods. Biometrics Bull., 1:80–83.

53.

Yeoh

E.J.

, Ross

M.E.

, Shurtleff

S.A.

et al. 2002. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell, 1:133–143.

54.

Yoon

, Liyanarachchi

, Wright

F.A.

et al. 2002. Gene expression profiling of isogenic cells with different TP53 gene dosage reveals numerous genes that are affected by TP53 dosage and identifies CSPG2 as a direct target of p53. Proc. Natl. Acad. Sci. USA, 99:15632–15637.