Differential Expression Analysis in RNA-seq Data Using a Geometric Approach

Abstract

Although differential gene expression (DGE) profiling in RNA-seq is used by many researchers, new packages and pipelines are continuously being presented as a result of an ongoing investigation. In this work, a geometric approach based on Supervised Variational Relevance Learning (Suvrel) was compared with DEpackages (edgeR, DESEq, baySeq, PoissonSeq, and limma) in the DGE profiling. The Suvrel method seeks to determine the relevance of characteristics (e.g., gene or transcript) based on intraclass and interclass distances. The comparison was performed using technical and biological replicates. For technical replicates, we used receiver operating characteristic (ROC) analysis, while for the other ones, we used robustness analysis. From ROC analysis, we found that geometric approach had a better performance than the DEpackages. Particularly, for a reduced list of differentially expressed genes (DEG), we noticed that this method had a remarkable advantage in ranking of most DEG (with a specificity ranging from 1 to 0.8). From robustness analysis associated to biological replicates, we found that geometric approach has comparable performance to the DEpackages. We conclude that the geometric approach had a slight overall better performance than the other methods. Moreover, it is a simple method that does not make any assumption about the distribution associated with RNA-seq data set. From this perspective, the relevance of this study was to show that a simple method can provide as good performance as more complex methods.

1. Introduction

RNA sequencing (RNA-seq) is a technique used in many genome-wide transcriptome studies, which employ a high-throughput next-generation approach characterized by qualitative measurements, whole-genome coverage, and single-nucleotide resolution (Wang et al., 2009; Sun and Zhu, 2012; Seyednasrollah et al., 2013; Dou et al., 2015). RNA-seq analysis commonly starts from a library of cDNA fragments associated with a population of mRNAs, which are sequenced with or without amplification, generating millions of short reads. Subsequently, the product is computationally mapped to a reference genome or transcriptome, where summarized read counts are produced. The result of these steps is a count matrix, which is generated when the number of reads aligned to a genomic feature of interest is counted (Li et al., 2011; Soneson and Delorenzi, 2013; Finotello and Di Camillo, 2014). Different studies can be performed from the results of the summary, such as single-nucleotide polymorphism discovery, identification of gene isoforms, post-transcriptional base modifications, and translocation events (Wang et al., 2009; Dou et al., 2015). However, one of the main objectives of RNA-Seq studies is to identify differentially expressed genes (DEG) in distinct sample groups (Rapaport et al., 2013; Soneson and Delorenzi, 2013).

When differential gene expression (DGE) analysis is performed, there are some biases that must be taken into account. So it is necessary to carry out a normalization step, which is intended to eliminate or reduce these effects (Bullard et al., 2010; Soneson and Delorenzi, 2013). Each method subjected to DGE analysis employs a particular type of normalization. The normalization method used in this study is reads per million (RPM). In this procedure, the reads in a sample are divided by a scale factor, and measured as the sum of the reads of the sample divided by 1 million.

The method implemented in the edgeR package (Robinson et al., 2010) is known as Trimmed Mean of M-values. This method uses scale factors that are calculated after removing the genes with large differences from expression and high average read counts (Robinson and Oshlack, 2010). The normalization method implemented in the DESeq package is based on a scaling factor for a given sample, which is calculated as the median of the ratios associated with read counts for each gene over its geometric mean across all samples (Anders and Huber). In the default normalization method implemented in the baySeq package (Hardcastle and Kelly, 2010) proposed by Bullard et al. (2010), the distribution of read counts is scaled by a factor that adjusts count distributions using third quantile. The normalization factors associated with PoissonSeq normalization (Li et al., 2011) are calculated using a goodness-of-fit estimate (Rapaport et al., 2013). The normalization method implemented in the limma package (Smyth, 2004) transforms the read counts into the appropriate log form for linear modeling using a LOWESS regression (Law et al., 2014). After the last step, DGE analysis can be carried out.

Although DGE analysis is widely used, the field is not completely consolidated, and new methods and pipelines are continuously being presented (Garber et al., 2011; Seyednasrollah et al., 2013; Soneson and Delorenzi, 2013; Finotello and Di Camillo, 2014; Dou et al., 2015). In this study, we examine the use of a geometric approach based on Suvrel ideas in DGE profiling. The Suvrel method is a variational procedure that determines metric tensors to define distance-based similarity in pattern classification inspired by relevance learning. The variational method is applied to a cost function that penalizes large intraclass distances and favors small interclass distances. The method assigns different levels of relevance to features associated with experiments and brought advantages in classification procedure and data representation associated with microarray and Proton NMR data (Boareto et al., 2015) and enzyme classification (Boareto et al., 2012). Based on these ideas, in this study, we introduce a method for determining the relevance of each genomic feature (e.g., gene, exon, or transcript) based on interclass and intraclass distances.

The article is organized as follows: Section 2 discusses the datasets and presents the approach used in this study. In Section 3, the respective performance of DEpackages (edgeR, DESeq, PoissonSeq, baySeq, and limma) and the geometric approach is compared. To achieve a comprehensive comparison, the analysis was segmented into two parts: in the first, summarized read counts from technical replicate samples associated with Sequencing Quality Control Consortium (SEQC), which represent an idealized scenario of minimal variation (Rapaport et al., 2013), were used, and, in the second part, biological replicate samples associated with the Montgomery et al. (2010) and the Pickrell et al. (2010) studies were used.

2. Methods

2.1. Datasets

The first part of the analysis was produced using two datasets. The first dataset is characterized by a count matrix provided by Rapaport et al. (2013) and contains summarized read counts from two sources that are part of the SEQC study distributed in two experimental conditions: A and B. The samples in Group A contain Stratagene Universal Human Reference RNA (UHRR) with 2% by volume of External RNA Controls Consortium (ERCC) mix 1, and Group B contains Ambions Human Brain Reference RNA (HBRR) with 2% by volume of ERCC mix 2. The ERCC mix consists of spike-in synthetic oligonucleotides mixed at four mixing ratios: 1/2, 2/3, 1, and 4. For future reference, these data are named the SEQC count matrix. The performance comparison between the DEpackages and the geometric approach is performed using ERCC mixing ratios and TaqMan data, the latter containing real-time quantitative reverse transcription polymerase chain (qRT-PCR) measurements associated with a set of roughly 1000 genes from replicated samples of human whole-body reference RNA and HBRR.

The second dataset is more recent and is associated to study presented by Celine Everaert et al. The authors performed a benchmarking study using RNA-seq reads from two replicates associated to UHRR and two replicates related to HBRR reference samples (MAQCA and MAQCB, respectively), and 18,080 protein-coding gene measures using PrimePCR assays from SEQC study (Everaert et al., 2017). The counting matrix was produced using samples aligned and quantified by the Tophat and HTseq tools (GSM2202397, GSM2202398, GSM2202399, and GSM2202400). One of the results associated with the study of Celine Everaert et al. is that the choice of the alignment and quantification strategy has little impact on the measure of gene expression intensities. For future reference, these data are named Celine Everaert dataset.

The second part of the study was conducted using a combination of summarized read counts based on the Montgomery et al. (2010) and Pickrell et al. (2010) studies obtained from the tweeDEseqCountData R package (Esnaola et al., 2013). The count matrix associated with the Montgomery study summarizes lymphoblastoid cell lines from 60 unrelated Caucasian individuals of European descent. The Pickrell count matrix summarizes lymphoblastoid cell lines from 69 unrelated Nigerian individuals. For future reference, these data will be named as the Pickrell-Montgomery (PM) count matrix. The R scripts produced in this study were based on the scripts provided by Rapaport et al. (2013) and will be available online (Tambonis, 2017) and in the Additional Files section. More information about the files used is available in Rapaport et al. (2013).

2.2. Differential gene expression profiling using the geometric approach

In this study we analyze DGE profiling using a geometric approach based on Suvrel ideas presented by Boareto et al. (2015). The Suvrel method is a variational procedure that determines metric tensors \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${g_{ \mu \nu }}$$ \end{document} (where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\mu$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\nu$$ \end{document} are pair of features, e.g., genes or transcripts) to define distance-based similarity in pattern classification inspired by relevance learning. For the sake of clarity, we will refer to genes as a generic term for genes, transcripts, or exons. The variational method is applied to a cost function that penalizes large intraclass distances and favors small interclass distances. The authors show that if the analysis is restricted to the simpler (i.e., diagonal metric tensor), each gene relevance can be obtained by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { \omega _ \nu } = { \frac { { e_ \nu } } { \sqrt ( \sum \nolimits_v { e_v^2 } ) } } \tag { 1 } \end{align*} \end{document}

where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${e_ \nu }$$ \end{document} , named individual feature cost, depends on a summation over features and our analyses will be based on this equation.

The summarized read counts from an experiment i are represented by the vector \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${{ \bf x}_{\rm i}} = ( {x_{i1}} , {x_{i2}} , \ldots , {x_{iG}} )$$ \end{document} in the space of genes \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \mathbb{R}^G}$$ \end{document} , where each gene \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\nu$$ \end{document} represents a dimension (with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\nu = 1 , \ldots , G$$ \end{document} ). Thus, the distance between two experiments i and j considering a particular gene \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\nu$$ \end{document} is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} d_{ij}^ \nu = \mathop \sum \limits_ \nu \vert {x_{i \nu }} - {x_{j \nu }} \vert. \tag{2} \end{align*} \end{document}

The distance of a gene calculated as in Equation (2) is similar to the calculation of fold change. In this study, we define individual feature cost \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${e_ \nu }$$ \end{document} as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { e_ \nu } = { \left( { \frac { 1 } { { { n_ { i , j \in F } } } } \mathop \sum \limits_ { i , j \in F } { d_ { ij } } ^ \nu } \right) ^2 } - { \left( { \frac { 1 } { { { n_ { i , j \in C } } } } \mathop \sum \limits_ { i , j \in C } { d_ { ij } } ^ \nu } \right) ^2 } , \tag { 3 } \end{align*} \end{document}

which penalizes genes that increase intraclass distances, while favoring those that decrease interclass distances. The label C denotes experiments into the same experimental condition and F denotes different experimental conditions; \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${n_{ij \in C}}$$ \end{document} denotes the total number of pairs across experiments i and j into the same experimental condition and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${n_{ij \in F}}$$ \end{document} denotes the total number of pairs across experiments i and j in different experimental conditions. Therefore, the relevance \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \omega _ \nu }$$ \end{document} of each gene \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\nu$$ \end{document} used in the analysis is given by Equation (1), using \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${e_ \nu }$$ \end{document} defined by Equation (3), where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$d_{ij}^ \nu$$ \end{document} is calculated using Equation (2), which depends only on the summarized read counts that should be normalized. Negative values of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \omega _ \nu}$$ \end{document} are associated with genes that are not differentially expressed and in this case, the proposed method assigns 0 values to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \omega _ \nu}$$ \end{document} . Positive values of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \omega _ \nu}$$ \end{document} are associated with genes differentially expressed. Roca et al. (2017) presented a normalization method based on class comparisons. In this study, we have used RPM normalization, but the Reads Per Kilobase per Million mapped reads method can also be used, since the method does not depend on the scale.

Alternatively, the distance between two experiments i and j can be considered the quadratic distance \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} d_{ij}^ \nu = \mathop \sum \limits_ \nu { ( {x_{i \nu }} - {x_{j \nu }} ) ^2}. \tag{4} \end{align*} \end{document}

The analysis using Equation (4) is presented in the Supplementary Material, and the results are qualitatively equivalent to those obtained using Equation (2).

3. Results

3.1. Differential expression analysis

3.1.1. Technical replicates

The comparison between the DEpackages discussed in Section 1 and the geometric approach using technical replicates associated with the SEQC count matrix discussed in Section 2.1 was conducted through receiver operating characteristic (ROC) analysis. In the first step, ROC analysis was performed using the ERCC data resulting from the definition of a mixing ratio of 1:1 (log ratio = 0) as the true negative set and a mixing ratio of 1:2, 2:3, and 4:1 as the positive set. The Area Under the Curve (AUC) values indicate a comparable performance among the DEpackages as reported by Rapaport et al. (2013) and a slight performance advantage in the geometric approach (Fig. 1A). However, the methods perform better when a specific interval of specificity (1–0.9 and 1–0.8) is analyzed (Fig. 1B, C). In DEG profiling, researchers are interested in an optimized list of genes expressed differently. Thus, the method has the interesting characteristic of returning good performance for most relevant genes.

FIG. 1.

ROC analysis of ERCC spike-in data. (A) ROC analysis using ERCC spike-in controls and the SEQC count matrix to compare the respective performance of DEpackages and the geometric approach. The ERCC control oligonucleotides are divided into four groups with different mixing ratios between samples (1:1, 4:1, 1:2, and 2:3), where the 1:1 mix consists of the oligonucleotide true negatives and the rest are the oligonucleotide true positives. (B) AUC partial values in the specific interval of 1–0.9 of specificity. (C) AUC partial values in the specific interval of 1–0.8 of specificity. AUC, area under the curve; ERCC, external RNA controls consortium; ROC, receiver operating characteristic; SEQC, sequencing quality control consortium.

The second ROC analysis was performed using TaqMan data from the definition of the true negative set using \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\mathop { \log } \nolimits_2$$ \end{document} expression change values lower than a cutoff of 0.5 to 2.0 and the true positive set using gene values higher than the cutoffs (Rapaport et al., 2013). With the exception of the limma package, the AUC values indicate a comparable performance among the DEpackages as reported by Rapaport et al. (2013), but the geometric approach has a slight performance advantage compared to the other approaches (Fig. 2A). Moreover, the method also has a better overall advantage at specific intervals of specificity (Fig. 2B).

FIG. 2.

ROC analysis of TaqMan data. (A) ROC analysis using the SEQC count matrix and TaqMan data to compare the respective performances among the DEpackages and the geometric approach. The ROC curve was performed defining the true negative set using \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\mathop { \log } \nolimits_2$$ \end{document} expression change values lower than a cutoff of 0.5 to 2.0 and the true positive set using change values higher than a cutoff of 2.0. (B) AUC partial values in the interval of 1–0.8 of specificity. (C) AUC partial values in the interval of 1–0.9 of specificity.

The third ROC analysis was performed using Celine Everaert dataset. RT-qPCR expression data for 180,80 protein-coding genes were used for the definition of the true negative set using \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\mathop { \log } \nolimits_2$$ \end{document} Cq (quantification cycle) change values lower than 0.025 to 0.1 and the true positive set using values higher than the cutoffs. We follow the same approach as above and the results obtained using such a dataset were qualitatively the same as those presented so far (Fig. 3). The genes inferred as differentially expressed by the DEpackages (adjusted p-value or false discovery rate lower than 0.05) and the geometric approach were analyzed through the same range of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\mathop { \log } \nolimits_2$$ \end{document} Cq change values to calculate the sensitivity, specificity, positive predictive value, accuracy, and F₁ measure (ref. 1998; Zhang and Zhang, 2009) (Supplementary Material). Measures of statistical analysis show that the performances of the methods vary. Analyzing specifically the geometric approach, the method has high sensitivity values and low specificity values. Sensitivity measures the indication of true positives and the differential expression profile is used to infer DEG. Therefore, it is expected that the methods have high sensitivity values.

FIG. 3.

ROC analysis using Celine Everaert dataset (Everaert et al., 2017). (A) ROC analysis using count matrix associated to two replicates of MAQCA and two replicates of MAQCB reference samples and 18080 protein-coding genes from SEQC to compare the respective performances among the DEpackages and the geometric approach. The ROC curve was performed defining the true negative set using \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\mathop { \log } \nolimits_2$$ \end{document} Cq change values lower than 0.1 to 0.5 and the true positive set using values higher than the cutoffs. (B) AUC partial values in the interval of 1–0.8 of specificity. (C) AUC partial values in the interval of 1–0.9 of specificity.

3.1.2. Biological replicates

To assess the robustness of a given method, we used the natural similarity measure proposed by Ein-Dor et al. (2006), which consists of the fraction of genes shared by two lists of DEG obtained from different samples using that method. We compare 100 lists chosen randomly in a serial of intervals of the most DEG and we found that the geometric approach has comparable performance among methods (Fig. 4, 100 most DEG). More selection of genes is presented in the Supplementary Material.

FIG. 4.

Robustness analysis. Average of the fraction of genes shared by two lists of DEG (overlap) as a function of sample size using the PM count matrix. Each list of DEG is composed of the 100 lowest p-value genes chosen according to the different methods edgeR, DESeq, limma, and PoissonSeq, and 100 most relevant genes from the geometric approach. The analysis of other intervals associated with the most DEG is presented in the Supplementary Data. The average value of the overlap between the lists is calculated over 100 lists chosen randomly. The baySeq package was removed from analysis due to the high computational cost. DEG, differentially expressed genes.

4. Discussion and Conclusion

The objective of this study was to compare the performance of the geometric approach and DEpackages intended for DGE profiling of RNA-seq data. To do this, we divided the analyses into technical and biological sample replicates. In both cases, we conclude that the geometric approach has comparable performance to the other methods. In particular, we found that the method has an advantage in the ranking of the most DEG (specificity of 1 to 0.8). The DEpackages analyzed in this study aimed at DGE profiling, and they make assumptions on the distribution associated with read counts in statistical tests. The two most commonly used distributions are the Poisson and Negative Binomial (Robinson and Smyth, 2008; Anders and Huber, 2010; Hardcastle and Kelly, 2010; Auer and Doerge, 2011; Di et al., 2011; Soneson and Delorenzi, 2013). On the other hand, the geometric approach is a simple method that does not require any assumption concerning data distribution, which enhances its robustness. In this case, in particular, the method does not make any assumption about distribution associated with RNA-seq data set. In addition to comparable performance, we also found that the method has advantages in data representation in RNA-seq data, as shown in Figure 5.

FIG. 5.

Multidimensional scaling comparison. Dimensional representation of PM count matrix using Multidimensional Scaling. (A) Using the same relevance assigned to all genes. (B) Relevance assigned by geometric approach. The symbols for the Montgomery and Pickrell studies are in green and blue, respectively. The effective distance between the two groups, given by the mean distance of each condition, normalized by the square root of the variance of each condition.

Despite the positive results, it is possible that problems associated with the use of fold change in ranking also occur with the geometric approach (Feng et al., 2012), although results generated through fold change are more reproducible and biologically relevant (Dembélé and Kastner, 2014). One disadvantage of this method is that it does not provide statistical significance. Nevertheless, it can be employed in association with other methods (such as DEpackages) so that it may help to increase the overall DGE performance. From this perspective, the relevance of this study was to show that this simple method can provide, at least, as good performance as more complex methods do, and may yield additional insights into DGE.

Footnotes

Acknowledgments

T.T. was funded by Higher Education Personnel Improvement Coordination (CAPES). V.B.P.L. was supported by the National Council for Scientific and Technological Development (CNPq) and São Paulo Research Foundation (FAPESP) Grant 2014/06862-7 and 2016/19766-1.

Author Disclosure Statement

The authors declare that no competing financial interests exist.

References

Machine Learning . 1998. 30, 271–274. Kluwer Academic Publishers, https://doi.org/10.1023/A:1017181826899

Anders

, and Huber

2010. Differential expression analysis for sequence count data. Genome Biol. 11, R106.

Auer

P.L.

, and Doerge

R.W.

2011. A two-stage Poisson model for testing RNA-seq data. Stat. Appl. Genet. Mol. Biol., 10, Article 26.

Boareto

, Cesar

, Leite

V.B.

, et al. 2015. Supervised variational relevance learning, an analytic geometric feature selection with applications to omic datasets. IEEE/ACM Trans. Comput. Biol. Bioinform., 12, 705–711.

Boareto

, Yamagishi

M.E.

, Caticha

, et al. 2012. Relationship between global structural parameters and enzyme commission hierarchy: Implications for function prediction. Comput. Biol. Chem. 40, 15–19.

Bullard

J.H.

, Purdom

, Hansen

K.D.

, et al. 2010. Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments. BMC Bioinformatics, 11, 94.

Dembélé

, and Kastner

2014. Fold change rank ordering statistics: A new method for detecting differentially expressed genes. BMC Bioinformatics, 15, 14.

, Schafer

D.W.

, Cumbie

J.S.

, et al. 2011. The NBP negative binomial model for assessing differential gene expression from RNA-seq. Stat. Appl. Genet. Mol. Biol., 10, 1–28.

Dou

, Guo

, Yuan

, et al. 2015. Differential expression analysis in RNA-seq by a naive Bayes classifier with local normalization. BioMed Res. Int. 2015, 789516.

10.

Ein-Dor

, Zuk

, and Domany

2006. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc. Natl Acad. Sci. U. S. A., 103, 5923–5928.

11.

Esnaola

, Puig

, Gonzalez

, et al. 2013. A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments. BMC Bioinformatics, 14, 1.

12.

Everaert

, Luypaert

, Maag

J.L.

, et al. 2017. Benchmarking of RNA-sequencing analysis workflows using whole-transcriptome RT-qPCR expression data. Sci. Rep. 7, 1559.

13.

Feng

, Meyer

C.A.

, Wang

, et al. 2012. Gfold: A generalized fold change for ranking differentially expressed genes from RNA-seq data. Bioinformatics, 28, 2782–2788.

14.

Finotello

, and Di Camillo

2014. Measuring differential gene expression with RNA-seq: Challenges and strategies for data analysis. Brief. Funct. Genomics, 14, 130–142.

15.

Garber

, Grabherr

M.G.

, Guttman

, et al. 2011. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat. Methods, 8, 469–477.

16.

Hardcastle

T.J.

, and Kelly

K.A.

2010. baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics, 11, 422.

17.

Law

C.W.

, Chen

, Shi

, et al. 2014. Voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29.

18.

, Witten

D.M.

, Johnstone

I.M.

, et al. 2011. Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics, 13, 523–538.

19.

Montgomery

S.B.

, Sammeth

, Gutierrez-Arcelus

, et al. 2010. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature, 464, 773–777.

20.

Pickrell

J.K.

, Marioni

J.C.

, Pai

A.A.

, et al. 2010. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature, 464, 768–772.

21.

Rapaport

, Khanin

, Liang

, et al. 2013. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol. 14, R95.

22.

Robinson

M.D.

, McCarthy

D.J.

, and Smyth

G.K.

2010. edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26, 139–140.

23.

Robinson

M.D.

, Oshlack

2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25.

24.

Robinson

M.D.

, and Smyth

G.K.

2008. Small-sample estimation of negative binomial dispersion, with applications to sage data. Biostatistics, 9, 321–332.

25.

Roca

C.P.

, Gomes

S.I.

, Amorim

M.J.

, et al. 2017. Variation-preserving normalization unveils blind spots in gene expression profiling. Sci. Rep. 7, 42460.

26.

Seyednasrollah

, Laiho

, and Elo

L.L.

2013. Comparison of software packages for detecting differential expression in RNA-seq studies. Brief. Bioinform. 16, 59–70.

27.

Smyth

2004. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol., 3, Article 3.

28.

Soneson

, and Delorenzi

2013. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics, 14, 91.

29.

Sun

, and Zhu

2012. Systematic comparison of RNA-seq normalization methods using measurement error models. Bioinformatics, 28, 2584–2591.

30.

Tambonis

2017. Differential expression analysis in RNA-seq data using a geometric approach. Available at: https://github.com/tambonis/GA_RNA_Seq. Last viewed on May 6, 2018.

31.

Wang

, Gerstein

, and Snyder

2009. RNA-seq: A revolutionary tool for transcriptomics. Nat. Rev. Genet., 10, 57–63.

32.

Zhang

, and Zhang

2009. F-Measure, pages 1147–1147. Springer US, Boston, MA.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

1.06 MB