Abstract
Abstract
Microarray technology is widely recognized as one of the most important tools when it comes to understanding genetic expression in biological processes. In light of the thousands of gene expression level measurements (including measurements across a number of conditions), identifying differentially expressed genes necessarily implies data mining or large-scale multiple testing procedures. To date, advances with regard to this field have been multivariate-descriptive or inferential-univariate in nature and therefore have important limitations regarding the biological validity of detected genes. In the present article, we present a new multivariate inferential method designed to detect active differentially expressed genes in gene expression data. The proposed method estimates false discovery rates using artificial components. Our method excels when applied to the most common gene expression data structures, providing new insights into differentially expressed genes. The method described herein was programmed in an R-Bioconductor package called acde that has been available since 2015.
1. Introduction
Today, gene expression technologies including, but not limited to, microarray analysis represent essential tools for understanding genetic expression in biological processes (Yuan and Kendziorski, 2006). Gene expression studies represent a sound approach, to tackling genomic problems (Simon et al., 2003). Research into these technologies, dating back to the 1990s, has gained traction in response to the overwhelming amount of data made available over the past few decades. New statistical methods for the detection of differentially expressed genes were born out of challenges stemming from an explosion in data quantity with difficult structure (high variability, few replicates, etc.). As microarrays contain measurements of thousands of genes' expression levels across several conditions, statistical analysis of a microarray experiment necessarily involve data mining or large-scale multiple testing procedures to limit false positives, that is, genes that are detected as differentially expressed but are not really differentially expressed.
Most univariate methods have concentrated on this aspect without taking into account the multivariate aspect of gene expression data, conducing to the detection of inactive genes, because expression levels are only compared with the same gene in another condition in a gene-by-gene approach. The overall variability, due to different activity levels of all genes, is often ignored in these univariate approaches. Inferential univariate approaches employ parametric models; for example, Analysis of Variance (ANOVA) (Kerr et al., 2000) and Hidden Markov Chains (Yuan and Kendziorski, 2006), or in nonparametric multiple testing procedures controlling the family-wise error rate (Dudoit et al., 2002; Benjamini and Yekutieli, 2001) or the false discovery rate (FDR; Storey and Tibshirani, 2001; Tusher et al., 2001; Storey, 2002, 2003; Taylor et al., 2005; Vélez et al., 2014). Regardless, these methods fail to account for the data's multivariate structure, which would take into account the above-mentioned gene activity levels.
Some advances in multivariate methods for gene differential expression have appeared but are essentially descriptive (Alizadeh et al., 2000; Ross et al., 2000; Ospina and López-Kleine, 2013). Xiong et al. (2014) developed a method for testing gene differential expression that represents the expression profile of a gene by a functional curve based on a functional principal components (FPC) space and tests by comparing FPC coefficients between two groups of samples. Yet, this method suffers one primary drawback: it is based on functional statistics and thus requires a staggering number of replicates. To the best of our knowledge, no new multivariate methods have been published for differential gene expression.
Consequently, multivariate-descriptive and univariate-inferential methods are best understood as two parallel paths whose integration would provide significant benefits in terms of identifying differentially expressed genes in microarray experiments, taking into account overall variability and including a measure of confidence such as p value or FDR.
To bring these two aspects together and avoid the subsequent challenges associated with employing each approach independently, we have laid out an innovative combination of a gene-by-gene multiple testing procedure and a multivariate descriptive approach. The resulting multivariate-inferential method proves suitable for microarray data. However, in light of the fact that the methods described herein do not require any assumptions related to variable distributions, they can be applied to other gene expression data, such as RNA sequencing. An exact interpretation of overall and differential expression is achieved, and an FDR estimation is provided based on a previous methodology (Storey and Tibshirani, 2001).
The strategy we propose initially computes artificial components inspired by the principal components (PCs) of a principal component analysis (PCA) representing the extent to which each gene is differentially expressed. Next, FDRs based on these components, as functions of certain thresholds, are estimated using bootstraps. Then, bootstrap confidence bounds for FDRs are established. We illustrate our method using a publicly available microarray data set comparing healthy and pathogen-inoculated tomato, and, for purposes of methodology assessment, we compare results with those previously obtained by Cai et al. (2013) on the same data set. Moreover, we validate the utility of our method on 26 further publicly available data sets on neurodegenerative diseases. Our method has been implemented in the Bioconductor acde package and is available since 2015.
2. Results
Inertia ratios for the tomato data set showed 60 hours after inoculation (hai) to be the time point at which differential expression was most salient and some signs of differential expression at 36 hai, as reported earlier (Restrepo et al., 2005), when the same data were analyzed using the univariate method significance analysis of microarray (SAM; Tusher et al., 2001). Data sets at 0 and 12 hai have virtually no information regarding differential expression. The acde was applied to the time point at 60 hai only. Gene distribution is consistent with the expected behavior assuming Biological Scenario 2 holds (see Section 4), in which most of the genes have low expression levels. Indeed, most genes are located near the origin, and only a small proportion are located toward the far right of the plot, implying that a small proportion of the genes were actually expressing themselves at the time of sampling. Note that no genes are located to the far left of the plane (Fig. 2).
To identify differentially expressed genes in the PI data set, we estimated the FDR and its 95% confidence upper bound per Algorithms 1 and 2. We used

Estimated false discovery rate for the tomato microarray data set 60 hai. hai, hours after inoculation.
2.1. Methodology comparison
Cai et al. (2013) applied SAM and a two-factor (cultivar and time point) ANOVA to identify differentially expressed genes between IL6-2 tomato plants forming a part of the PI data set and, on the other, a different near isogenic tomato line (M82). Although the authors' principal objectives diverged from our own, results obtained on IL6-2 allow us to compare the results (Table 1; Figs. 2 and 3).

Projection of genes on the artificial components and differentially expressed genes in the PI microarray data set 60 hai detected by acde. See also Table 1.

Comparison of Differentially Expressed Genes Between acde and Previous Reported Results
While our method found 2 upregulated and 53 downregulated genes unidentified by Cai et al. (2013), they still identified a much larger number of differentially expressed genes. This is a consequence of row standardization as required by ANOVA and SAM and their corresponding univariate point of view. Indeed, when row standardization is performed, the inherent scale of genetic expression in the microarray is lost for further analysis.
To see this, note that a very large number of the genes identified by Cai et al. (2013) lie very close to the origin of the artificial plane in Figure 3, which means that their overall expression levels are very close to the average overall expression level in the microarray experiment. If Biological Scenario 2 holds, this is an important error because genes with no expression (those near the origin) are being identified as differentially expressed. Thus, the value of a multivariate point of view and the reason why our method should be preferred when Biological Scenario 2 is more likely to hold.
The results obtained on the 26 data sets on neurodegenerative diseases show that, in a general way, lower FDRs were obtained with acde, and they were much easier to tune when active differentially expressed genes are searched (Supplementary Algorithm 1).
3. Discussion
The present article outlines a multivariate approach for the identification of differentially expressed genes in gene expression data. Despite relying on a general probabilistic model, our methodology's applicability is rooted in the key biological and technical assumptions for gene expression data summarized in Biological Scenario 2. If such assumptions are correct, as is generally the case for microarray data, a multivariate approach is needed to obviate the (mis)identification of unexpressed genes as differentially expressed. Surprisingly, though, no multivariate inferential approach suitable for Biological Scenario 2 has been previously proposed.
Our methodology departs from Storey and Tibshirani's (2001) work on FDR estimation and the construction of two artificial components. These artificial components resemble the data's PCs, which generally reflect gene expression level in the first PC and differences between conditions in the second PC as already pointed out by the Kim et al. (2007) and their proposed parametric covariance matrix data transformation. Nevertheless, the here proposed artificial components offer an exact interpretation in terms of overall and differential genetic expression because they are constructed artificially directly computing mean expression differences. Storey and Tibshirani's (2001) research grants insight into both the validity of Biological Scenario 2 and the behavior of differential expression processes.
Here, we have suggested additional assessments that establish greater statistical confidence with respect to the results obtained via our methodology, such as BCa upper confidence bound for FDR. In sum, these complements represent the final pieces of an integral strategy aimed at identifying differentially expressed genes in microarray data.
At 60 hai, our analysis of the PI data set found 32 defense-related genes identified as upregulated and 94 primary metabolic function-related genes identified as downregulated. Comparing the above-mentioned findings with previous results (Cai et al., 2013), we observe that traditional methods place large number of differentially expressed genes close to the origin of the artificial plane. This led us to deduce two important problems in methods based on univariate statistics: data regarding overall gene activity in the inherent genetic expression scale were lost and genes with true zero expression levels were possibly wrongly categorized as differentially expressed.
On balance, it is safe to say that univariate-oriented methods identify more genes as differentially expressed, but FDRs are generally higher, a problem associated with differences in the biological assumptions underpinning this type of method and corresponding implied definitions of differential expression. Therefore, an examination of the numbers does not paint any method as more powerful in terms of a multiple testing procedure. That being said, when a study looks to perform an intervention on differentially expressed genes, our methodology demonstrates greater value insofar as it prevents intervention on unexpressed genes. The detection of active differentially expressed genes is also important when the procedure is used as a filter for the construction of a network. Genes with very low expression are not desired on a network.
4. Methods
4.1. Artificial components
Let
where zj is the j-th column of
Usually, in a PCA of microarray data, the first PC will mainly explain gene expression and the second one will mainly explain differential expression between conditions. However, to perform multiple tests regarding genetic differential expression, we need new components that exactly capture the genes' overall and differential expression levels. We call these components artificial because they do not arise naturally from solving a maximization problem (as do the PCs in a PCA). Instead, they are constructed deliberately to capture specific features of the data and thus have an exact interpretation. When Biological Scenario holds, they almost overlap with the first two PCs of a PCA because the first component tends to explain variability due to differences in gene expression between genes (mean) and the second one explains differences between conditions (mean differences). Their construction is as follows: For
For p1 treatment and p2 control replicates with
Now,
However,
Under Biological Scenario 2, a gene can have large positive or large negative values of
4.2. A word on scale
As a rule, methods that control the FDR in microarray data (Benjamini and Yuketeli, 2001; Storey and Tibshirani 2001; Tusher et al., 2001) use test statistics of the form
where c0 is a convenient constant (usually 0) or monotone functions of s as p values. However, when dividing by the standard error in Equation (4), we lose the inherent genetic expression scale that lies within the data, and genes with very low expression levels will be detected as differentially expressed. Nevertheless, the most common scenario is that only a very small fraction of the genes have high activities (Biological Scenario 2) as defined below.
If Scenario 1 holds, there is no relevant information in the differences between the scales of the rows in the data, and row standardization is in order. If, on the contrary, Scenario 2 seems more appropriate, the information contained in the differences between the scales of the rows is relevant for it allows to asses which genes had actual positive expression levels when the sample was taken.
Likewise, if Scenario 2 holds, the data for the genes that were not expressing themselves when the sample was taken are only the results of external sources of variation. Those genes, having true zero expression levels in both treatment and control replicates, cannot be classified as being differentially expressed. However, because n is very large and there might be systematic sources of variation (changes in printing tips, background intensity, or other aspects of microarray technology), it is very likely that a considerable number of those genes with no expression will be identified as differentially expressed when using statistics of the form in Equation (4).
We strongly believe that the first condition of Biological Scenario 2 is more reasonable in the context of microarray experiments. Additionally, there are several methods for normalization of gene expression data that remove systematic sources of variation other than control/treatment effects, without performing any kind of row standardization (Dudoit et al., 2002; Simon et al., 2003). These methods can be applied beforehand to assure the technical part of Biological Scenario 2. To conclude, row standardization should not be performed, to avoid identifying genes with no expression as differentially expressed, because this standardization eliminates natural gene differences making all gene expression levels very similar. If a scenario in between seems more likely, Scenario 2 should be favored.
4.3. Detection of differentially expressed genes with acde
4.3.1. Estimation
Our methodology aims at identifying differentially expressed genes in gene expression data while controlling the FDR at a desired level (Storey and Tibshirani, 2001; Storey et al., 2004). For controlling the FDR, researchers can turn to one of the two distinct approaches to multiple hypothesis testing: while one establishes a desired FDR level and estimates a rejection region, the other proceeds in reverse order (establishes a rejection region before estimating the FDR). Storey et al. (2004) showed that these two approaches are asymptotically equivalent. More specifically, they define:
. Then, rejecting all null hypothesis with
proves tantamount to establishing FDR control at level
The following observations regarding Algorithm 1 may be made:
FDR is only estimated for t at the values at which the number of rejected hypotheses is actually affected. More specifically, let To streamline computation, we set The estimation of Computation of the genes' Q-values (Storey, 2002, 2003) may provide useful insights.
is considered conservatively biases, that is,
(Storey and Tibshirani, 2001).
can be computed via bootstrap methods.
in step 2 may be carried out by using permutation or bootstrap estimates of the statistics' null distribution (Dudoit and Van Der Laan, 2007). Although permutation methods are more frequently used in the reference literature (Li and Tibshirani, 2013), we have opted for bootstrap estimates of the null distribution because they over greater ease of interpretation (Dudoit et al., 2008, p. 65). Whether permutation or bootstraps are used, results should prove very similar for large B (Efron and Tibshirani, 1993).
We applied Storey and Tibshirani's methodology as explained above, but using a new proposed multivariate statistic that is well suited for gene expression data. It is based on the large sample estimator for the FDR,
, presented for multiple testing procedures, but it uses
For more details, see Supplementary Algorithm 1.
4.3.2. Further assessments
As B, n, and p grow,
approaches from above both the FDR and the realized proportion of false positives among all rejected null hypothesis. In practice, however, because B, n, and p are finite, the control achieved using
is only approximate and some additional assessments are needed. Storey and Tibshirani (2001) suggested the use of a bootstrap percentile upper confidence bound for the FDR to provide a somewhat more precise notion of the actual control achieved but concluded that percentile upper bounds were not appropriate as they underestimated the actual confidence upper bound. We overcome this limitation by computing a BCa upper confidence bound (Efron and Tibshirani, 1993) for the FDR as shown in Algorithm 2. We find plots of
and the FDRs confidence upper bounds versus t to be very informative as to the actual FDR control achieved.
For more details, see Supplementary Algorithm 2.
To illustrate our method, we will analyze the data set from the Tomato Expression Database web site (http://ted.bti.cornell.edu/), experiment E022 (Restrepo et al., 2005; PI data set). Throughout the experiment, eight tomato plants (line IL6-2) in field conditions were inoculated with Phytophthora infestans and eight control plants were mock-inoculated with sterile water. Leaf tissue samples from each replicate were taken at 12 hours before and 12, 36, and 60 hai. We refer to 12 hours before inoculation as the 0 hai time point. RNA was extracted from each sample and then hybridized on a complementary DNA microarray (for more details of the experimental design and conditions of the study, see Cai et al., 2013). Expression levels were obtained for 13,440 genes.
To further validate the method, 26 data sets on neurodegenerative disease are composed of 4 data sets of amyotrophic lateral sclerosis, 3 Alzheimer's disease data sets, 7 Parkinson's disease data sets, 8 multiple sclerosis data sets, and four schizophrenia data sets. For details on accession numbers, see Supplementary File.
Availability
The here presented method is available as a Bioconductor package (acde): https://www.bioconductor.org/packages/3.3/bioc/html/acde.html, https://www.bioconductor.org/packages/devel/bioc/html/acde.html
Footnotes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
