Differential Expression Analysis for RNA-Seq: An Overview of Statistical Methods and Computational Software

Abstract

Deep sequencing has recently emerged as a powerful alternative to microarrays for the high-throughput profiling of gene expression. In order to account for the discrete nature of RNA sequencing data, new statistical methods and computational tools have been developed for the analysis of differential expression to identify genes that are relevant to a disease such as cancer. In this paper, it is thus timely to provide an overview of these analysis methods and tools. For readers with statistical background, we also review the parameter estimation algorithms and hypothesis testing strategies used in these methods.

Keywords

RNA sequencing differential expression analysis overview statistical methods software

Introduction

In the past decade, deep sequencing has emerged as a powerful alternative to microarrays for the high-throughput profiling of gene expression. Comparing with microarrays, RNA sequencing (RNA-seq) possesses a number of technological advantages such as a wider dynamic range and the freedom from predesigned probes.^1–3 It also comes with a unique data feature as discrete sequencing reads. In order to account for this unique data feature, statistical methodologies and computational algorithms have been developed based on various data distributional assumptions such as Poisson, negative binomial, beta binomial, (full or empirical) Bayesian, and nonparametric.^4–16,

For researchers who are new to the analysis of RNA-seq data, in this paper we provide an introductory overview of the methods and software available for the differential expression analysis (DEA) of RNA-seq data when the analysis goal is to identify genes that are relevant to a disease such as cancer.^1,17,18 In addition, for those who are interested in the statistical aspects of these methods, we also provide an overview of their parameter estimation algorithms and hypothesis testing strategies. The overview of these statistical aspects in our paper provides a unique contribution to the review literature on RNA-seq DEA methods.^3,18–23 For readers who are interested in a performance comparison of RNA-seq DEA methods, they can refer to a large body of such papers in the literature.^20–23

The rest of the paper is organized as follows. In the Notation and Normalization Methods section, we introduce the unified notations used for the methods reviewed in our paper and touch on the normalization methods typically used to preprocess RNA-seq data before DEA. In the Statistical Modeling of RNA-seq Data section, we review the statistical modeling RNA-seq DEA categorized by the distributional assumptions such as Poisson,^4–6 negative binomial,^7–10 beta binomial,^11,12 Bayesian,^13,14 and nonparametric.^15,16 All reviewed methods directly work with gene-level count data for DEA and have available R packages. For interested readers with advanced statistical knowledge, the parameter estimation algorithm for each method is presented separately in a text box, and the typical statistical testing frameworks that have been proposed for RNA-seq DEA are reviewed in the Statistical Testing section. Finally, computational tools implemented for the reviewed methods are summarized in Table 2. We note that the methods reviewed in this paper are not an exhaustive collection of available methods in the literature. Rather, we reviewed a list of most commonly used categories of modeling assumptions and included a few representative methods for each category, to help researchers who are new to the field orientated and started in the still evolving literature on this topic.

Table 1.

List of sequencing depth normalization methods and reference papers.

METHODS	RELEVANT REFERENCES
RPKM	Mortazavi et al.²⁶
Upper-quartile, Median	Bullard et al.¹⁸
TMM	Robinson et al.⁷
DESeq	Anders and Huber⁸
Quantile	Bolstad et al.³²

Table 2.

RNA-seq count data DEA statistical methods and software.

MODEL	SOFTWARE	REFERENCES	DATA TYPE	TESTING STRATEGY	NOTES	LIMITATIONS	DATA USED
Poisson	DEGseq	Wang et al.⁴	RNA-seq data	Fisher's exact test Likelihood ratio test	Support raw read counts or normalized gene expression values, identify DE of exons or transcripts	Ignore biological variation	Marioni RNA-seq data
	Myrna	Langmead et al.⁵	RNA-seq data	Likelihood ratio test Parallelized permutation test	Handle dataset with over 1 billion rows, computationally efficient	Ignore biological variation, signal loss due to junction or repetitive reads, inconvenient cloud data transfer	HapMap expression data
	PoissonSeq	Li et al.⁶	RNA-seq data Tag-seq data	Score test	Accommodate multiple covariate types, computationally efficient	Transformation power depends only on gene expression, libraries are totally exchangeable	Marioni RNA-seq data Tag-seq data
Negative binomial	edgeR	Robinson et al.⁷	SAGE data RNA-seq data	Exact test Likelihood ratio test	Separate biological from technical variations	Limited to pairwise comparison	SAGE data Fly RNA-seq data
	DESeq	Anders and Huber⁸	Tag-seq data RNA-seq data ChIP-seq data	Exact test Likelihood ratio test	Extend edgeR by allowing more general, data-driven relationship of mean and variance	Limited to pairwise comparison	Neural stem cell Tag-seq data Yeast RNA-seq data HapMap ChIP-seq data Fly RNA-seq data
	DESeq2	Love et al.⁹	Tag-seq data RNA-seq data ChIP-seq data	Wald test Likelihood ratio test	Improve upon DESeq for better gene ranking, allow hypothesis tests above and below threshold	Limited to pairwise comparison	Fly RNA-seq data Mouse straiturn RNA-seq data
	NBPSeq	Di et al.¹⁰	RNA-seq data	Adapted exact test	Introduce an additional parameter to allow the dispersion to depend on the mean	Assume all library sizes are equal	Arabidopsis RNA-seq data
Beta binomial	BBSeq	Zhou et al.¹²	RNA-seq data	Wald test Likelihood ratio test	Handle outlier detection automatically	Sensitive to outliers of shrinkage or penalization methods	HapMap RNA-seq data
Bayesian and
Empirical Bayesian	ShrinkSeq	Van de Wiel et al.¹³	RNA-seq data CAGE data	Evaluating posterior probability for inference	Provide joint shrink multiple parameters, allow for random effects, address multiplicity problems	Computationally intensive but allow parallelization	HapMap RNA-seq data CAGE data
	baySeq	Hardcastle and Kelly¹⁴	Small RNAs data	Evaluating posterior probability for inference	Involve multiple comparison, accommodate different sample size	Computationally intensive but allow parallelization	Trans-acting small RNAs
Nonparametric	SAMseq	Li and Tibshirani¹⁵	RNA-seq data Tag-seq data miRNA-seq data	Wilcoxon test	Robust to outliers, remove Experimental effect, simplify test for feature effect, accommodate quantitative, survival and multiple group comparison	Overestimate FDR in some cases, relative low power for data with small sample size	Marioni RNA-seq data t'Hoen Tag-seq data Witten miRNA-seq data
	NOIseq	Tarazona et al.¹⁶	RNA-seq data	Wilcoxon test	Robust and maintain a high true-positive rate	Not easy to identify true differential expression at a low count range, limited to pair-wise comparison	Marioni RNA-seq data

Notation and Normalization Methods

Notation

RNA-seq data for G genes and N samples can be described by a G x N matrix Y. Each entry y_gi (g = 1, …, G, i = 1, N) represents the count of sequencing reads for gene g in sample i. For a given g and i, y_gi is a nonnegative integer representing the number of reads mapped to gene g in sample i. For succinctness, we also use notations “·” for summations, eg, y_g = $\sum_{i = 1}^{N} y_{g i}$ and $\sum_{g = 1}^{G} y_{g i} .$ .

We use X to represent an N x P design matrix, where P is the number of covariates. For instance, x_ip can be an indicator variable of disease status, taking a value of 0 for a normal sample and a value of 1 for a tumor sample. When comparing K groups of samples, C_k represents the collection of indices of the samples in group k (k = 1, …, K), that is, C_k = {i: x_i = k}. Each sample can only belong to one group.

Normalization Methods

Similar to microarray data, RNA-seq data are also prone to nonbiological effects due to the experimental process. Consequently, these effects need to be adjusted before any further data analysis.²⁴ One major source of nonbiological effects is sequencing depth, which can be adjusted by rescaling the sequencing counts with factors that mimic sequencing depth.²⁵ Reads per kilobase per million reads (RPKM) is a simple adjustment that considers gene counts standardized by the gene length and the total number of reads in each library as expression values.^17,26 More sophisticated adjustment factors, including trimmed mean of M-values (TMM),²⁷ DESeq size factor,²⁸ and quantile-based normalizations such as upper quartile normalization,¹⁸ are given in Table 1. Other sources of nonbiological effects for RNA-seq include gene length and GC-content,^21,29 whose effects are typically assumed to be consistent across samples for a given gene and hence cancel out in the analysis of differential expression. Interested readers can look up available normalization methods adjusting for gene length and GC-content in the publications such as Risso et al.²⁹, Benjamini and Speed,³⁰ and Hansen et al.³¹

Statistical Modeling of RNA-Seq Data

Poisson

Overview

Models for read counts originated from the idea that each read is sampled independently from a pool of reads and hence the number of reads for a given gene follows a binomial distribution, which can be approximated by a Poisson distribution. Based on the Poisson model assumption for repeated sequencings of a sample, Marioni et al.¹⁷ proposed to use a log-linear model to model the mean difference between two samples and adopted the classical likelihood ratio test for calculating the P-values. Based on the same Poisson assumption, Bullard et al.¹⁸ proposed to use two other test statistics, exact test statistics and score test statistics, in the generalized linear model (GLM) framework. Li et al.⁶ proposed a method called PoissonSeq, which adapts a two-step procedure for fitting a Poisson model. The method first estimates sequencing depths using a Poisson goodness-of-fit statistic and then calculates a score statistic based on a log-linear model. In addition, Wang et al.⁴ developed an R package, DEGseq, to identify differentially expressed (DE) genes with an MA-plot-based approach. Langmead et al.⁵ incorporated cloud computing in their method called Myrna.

Modeling

In a Poisson model, one assumes that Y_gi,, the number of reads mapped to gene g in sample i, follows a Poisson distribution, y_gi ∼ Poisson(μ_gii). μ_gi is the rate parameter for gene g in sample i, which equals both the mean and the variance of the read counts. The probability mass function is:

f (y_{g i} | μ_{g i}) = P (Y_{g i} | μ_{g i}) = \frac{μ_{g i}^{y_{g i}} \exp (- μ_{g i})}{y_{g i}!}

(3.1.1)

and E(Y_gi) = μ_gi and Var(Y_gi) = μ_gi The association of μ_gi with the same sample group can be described by a log-linear model as follows:

\log (μ_{g i}) = \log d_{i} + \log β + \sum_{k = 1}^{K} γ_{g k} I (i \in C_{k}),

(3.1.2)

where d_i represents the sequencing depth of sample i and

\sum_{i = 1}^{N} d_{i}

= 1 is assumed for generality. Let βg be the expression level of gene g and γ_g be the association of gene g with the covariate. For hypothesis testing, γ_g1 = … = γ_gK = 0 indicates that the expression of gene g is not associated with the sample group. In the case of two sample group comparison, if γ_g = 0, then gene g is not DE between the two sample groups.

Algorithm Overview 1: Li et al.'s⁶ PoissonSeq

Li and others proposed PoissonSeq that assumes the hypotheses as follows. Under the null hypothesis where genes and covariates are not relevant,

\log μ_{g i} = \log d_{i} + \log β_{g},

(3.1.a)

where d_i is the sequencing depth in sample i and β_g is the expression of gene g. The model fit from Equation (3.1.a) is denoted as

N_{g i}^{(0)}

in later equations:

N_{g i}^{(0)} = \exp (\log ({\hat{d}}_{i}) + \log ({\hat{β}}_{g}))

Under the alternative hypothesis where genes and covariates, $x_{i}^{*}$ are relevant,

\log μ_{g i} = \log d_{i} + \log β_{g} + γ_{g} x_{i}^{*}

(3.1.b)

where

x_{i}^{*}

would be

I_{(i \in c_{k})}

when comparing two or multiple sample groups. The authors suggested using the maximum likelihood to estimate

{\hat{β}}_{g},

as a result

{\hat{β}}_{g} = y_{g}

. However, instead of using the maximum likelihood estimate of the sequencing depth in sample i, the authors sought for a set of genes, denoted by S, that are not DE to estimate sequencing depth in sample i:

{\hat{d}}_{i} = \frac{\sum_{g \in S} y_{g i}}{\sum_{g \in S} y_{g}} .

(3.1.c)

They then estimated which genes belong to S by a Poisson goodness-of-fit statistic, ie,

G O F_{g} = \sum_{i = 1}^{N} \frac{{(y_{g i} - {\hat{d}}_{i} y_{g .})}^{2}}{{\hat{d}}_{i} y_{g .}}

(3.1.d)

S is set to be the genes whose GOF_g values are in the(∊, 1 - ∊) quantile of all GOF_g values. Li and others used ∊ = 0.25 in their study.⁶

The objective is to test H₀:

γ_{g 1} = … \cdot = γ_{g k} = 0,

and score statistics were proposed to perform the testing. For a two-group or multiple-group covariate, the score statistic for gene g is

\sum_{k = 1}^{K} \frac{{[\sum_{i \in C_{k}} (y_{g i} - N_{g i}^{(0)})]}^{2}}{\sum_{i \in C_{k}} N_{g i}^{(0)}} \sim X^{2} (K - 1) .

(3.1.e)

With accumulating empirical data (especially “with the data available for groups of multiple biological samples), researchers began to observe that in a group, the between-sample variation of sequencing reads for a gene often exceeds the mean.^17,23,33 This excessive variation that cannot be explained by the Poisson model is called overdispersion. Extensions of the classic Poisson model have been proposed in order to accommodate such overdispersion, including the two-stage Poisson models³⁴ and the generalized Poisson model.³⁵

Negative Binomial

Overview

A class of models based on the negative binomial distribution assumption has been developed in order to accommodate the overdispersion among biological replicate data.^8,9,33,36 Robinson and Smyth³³ used the conditional maximum likelihood (CML) to estimate the dispersion parameter-a measure of the excessive variance that a Poisson model does not incorporate-when assuming a common dispersion parameter across genes. They compared the CML method with alternative estimation methods based on pseudolikelihood, quasi-likelihood, and conditional inference.^37–39 In a follow-up paper,³⁶ they also extended the model to allow for gene-specific dispersion parameters and proposed to estimate the dispersion parameters by maximizing a weighted conditional likelihood with empirical Bayesian approximation. Details of their method, edgeR, can be found in Robinson and Smyth.^33,36 edgeRun is based on the same model as edgeR but it uses an unconditional exact test to achieve more power while paying the price of computational time.⁴⁰ Anders and Huber⁸ proposed a method called DESeq also under the negative binomial assumption. They advocated the use of a robust estimate of normalization factors for the estimation of dispersion parameter and a local regression to obtain smooth function for each group on the graphs of expected proportions vs sample variances. DESeq2 was developed in the study by Love et al.⁹ as a successor of DESeq. It employs a number of new modeling features, such as the use of a shrunken fold change and a shrunken dispersion estimation method, to further improve the model performance. Di and others¹⁰ proposed a method, NBPSeq, using a negative binomial power distribution instead of a regular negative binomial distribution. They hypothesized that $E (Y_{g i}) = μ_{g i}, V a r (Y_{g i}) = μ_{g i} (1 + ϕ μ_{g i}^{α - 1}),$ and ϕ is common across genes while a helps to accommodate the overdispersion. ϕ and α are estimated by maximizing conditional log-likelihood,⁴¹ conditional on the total gene counts for each gene g. An exact test modified for negative binomial power distribution is used for hypothesis testing. More details can be found in the study by Di et al.¹⁰

Modeling

The model setup for negative binomial is to assume y_gi ∼ negative binomial $(μ_{g i}, ϕ_{g})$ . The dispersion parameter, ϕ_g, accounts for the sample-to-sample variability, which is usually assumed to be common across samples. There are various estimation methods for this model assumption. More specifically, the negative binomial probability mass function is written as

\begin{matrix} f (y_{g i} | μ_{g i}, ϕ_{g}) = P (Y_{g i} | μ_{g i}, ϕ_{g}) \\ = \frac{Γ (y_{g i} + ϕ_{h}^{- 1})}{Γ (ϕ_{g}^{- 1}) Γ (y_{g i} + 1)} {(\frac{1}{1 + μ_{g i} ϕ_{g}})}^{ϕ_{g}^{- 1}} {(\frac{μ_{g i}}{ϕ_{g}^{- 1} + μ_{g i}})}^{y_{g i}}, \end{matrix}

(3.2.1)

where

E (V_{g i}) = μ_{g i}

and Var

(V_{g i}) = μ_{g i} + ϕ_{g} μ_{g i}^{2}

. Hypothesis testing is set up as H₀: no difference either between the expected normalized expression of gene g in groups or between the proportion of reads that are gene g in groups.

Algorithm Overview 2: Overdispersion

Negative binomial can be derived as a gamma-Poisson mixture model (subscripts g's and i's are omitted for brevity), under the assumption that technical replicates follow a Poisson distribution, and biological replicates follow a gamma distribution, with the latter accommodating the overdispersion observed in empirical data.

\begin{matrix} y \sim p o i s s o n (μ), μ \sim g a m m a (α, β) \\ P (y | μ) = \frac{μ^{y} \exp (- μ)}{y!} \\ f (μ) = {(Γ (α) β^{α})}^{- 1} (μ^{α - 1} \exp (- μ / β)) \end{matrix}

Then,

\begin{matrix} P (y) = \int_{0}^{\infty} P (y_{g i} | μ) f (μ) d μ \\ = {(y! Γ (α) β^{α})}^{- 1} \int_{0}^{\infty} μ^{(y - a) - 1} \exp (- μ (1 + 1 / β)) d μ \\ = \frac{Γ (y + α) β^{y}}{y! Γ (α) {(1 + β)}^{y + α}} \end{matrix}

One substitutes back $μ_{g i}, y_{g i}, α = ϕ_{g}^{- 1}$ , and $β = μ_{g i} ϕ_{g}$ , a gamma–Poisson mixture can be viewed as a negative binomial, see Equation (3.2.1).

Algorithm Overview 3: Robinson and Smyth's^33,36 edgeR

In edgeR, $μ_{g i} = m_{i} λ_{g k (i)}$ where m_i is the ith library size and $λ_{g k (i)} = \sum_{i = 1}^{C_{k}} λ_{g i}$ represents the proportion of the total reads that is gene g in group k and $λ_{g i}$ is the proportion of the total reads that is gene g in sample i.

Under the assumption of gene-wise (or tag-wise in the original paper) dispersion, ϕ_g is estimated by maximizing a weighted conditional log-likelihood, WL(ϕ_g):

W L (ϕ_{g}) = l_{g} (ϕ_{g}) + α l_{c} (ϕ_{g})

(3.2.a)

where α is the weight given to the common likelihood, l_C; the maximum estimator of WL(ϕ_g) is denoted by

{\hat{ϕ}}_{g}^{W L}

. An α has to be chosen such that

{\hat{ϕ}}_{g}^{W L}

coincides with an empirical Bayesian solution,

{\hat{ϕ}}_{g}^{B}

, the Bayesian posterior mean estimator of ϕ_g where

{\hat{ϕ}}_{g} | ϕ \sim N (ϕ_{g}, τ_{g}^{2})

and

{\hat{ϕ}}_{g} \sim N (ϕ_{0}, τ_{0}^{2})

for g = 1, …, G. The approximation method is selected as a direct estimate of ϕ_g is difficult because of the lack of a conjugate prior for ϕ in negative binomial model. Details are given in the study by Robinson and Smyth.³³

In the study by Robinson and Smyth,³⁶ the overdispersion parameter is assumed to be common across all genes (ie, ϕ_g = ϕ). To estimate the shared dispersion parameter with and without equal library size, the authors proposed to use the CML and quantile-adjusted CML (qCML) as follows.

In a special case where m_i = m for i ∈ C_k where C_k = {i: k(i) = k}, ∼ negative binomial $(μ_{g i} = m λ_{g k}, ϕ)$ in group k and Y_gi's evidently become identically distributed, and the maximum likelihood estimator (MLE) ${\hat{λ}}_{g k (i)}$ becomes $\frac{\sum_{i \in C_{k}} y_{g i}}{\sum_{i \in C_{k}} m_{i}}$ in group k. CML function for dispersion ϕ given $z_{k} = \sum_{i = 1}^{n_{k}} y_{k i}$ was proposed. The function is as follows:

\begin{matrix} l_{C} (ϕ) = \sum_{g = 1}^{G} 1_{g} (ϕ) = \sum_{g = 1}^{g} \sum_{k = 1}^{k} [\sum_{j = 1}^{n_{k}} \log Γ (y_{k i} + ϕ^{- 1}) \\ + \log Γ (n_{k} ϕ^{- 1}) \\ - \log Γ (Z_{k} + n_{k} ϕ^{- 1}) - n_{k} \log Γ (ϕ^{- 1}) \end{matrix}

(3.2.b)

In the case of different m_i in group k, the MLE of $λ_{g k (i)}$ depends on ϕ (ie, maximum likelihood estimation of the two parameters proceeds jointly). As a result, an approximate approach called qCML was proposed to equate the library sizes. The quantile-adjusted pseudodata supposedly allows one to use a common likelihood $l_{c} (ϕ)$ to estimate an accurate estimate of ϕ. Specifically, let m^* = ${(\prod_{i = 1}^{N} m_{i})}^{\frac{1}{N}}$ , where m^* is the geometric mean of the library sizes. Then, the observed data could be adjusted as if they were all sampled as identically distributed negative binomial $(m * λ, ϕ)$ .

Hypothesis testing is set up as $H_{0} : λ_{g 1} = λ_{g 2}$ in other words, no difference in proportion of gene g in samples between group 1 and group 2.

Algorithm Overview 4: Anders and Huber's⁸ DESeq

The read count y_gi is modeled by a GLM of negative binomial distribution with a log link:

\log (λ_{g i}) = \sum_{p = 1}^{p} x_{i p} β_{g p}

(3.2.c)

The mean μ_gi is the proportion of reads for gene g in sample i, $λ_{g k (i)}$ , scaled by a normalization factor, m_i. The variance $σ_{g i}^{2}$ is $μ_{g i} + m_{i}^{2} V_{g k (i)},$ where $V_{g k (i)}$ is assumed to be a per gene raw variance, a smoothing function of λ_g and k. The use of the smoothing function can help stabilize the variance estimates especially when the number of samples is small. For the estimation of the normalization factor (which is referred to as the size factor by Anders and Huber), m, for each sample, the authors noted that highly DE genes are more likely to be influential on total count and so the median of the ratios of counts should be used for more robustness:

{\hat{m}}_{i} = m e d i a n \frac{y_{g i}}{{(\prod_{v - 1}^{N} y_{g v})}^{1 / N}}

(3.2.d)

Since $λ_{g k (i)}$ is proportional to the expected value of the unknown proportion from gene g in group k, it is estimated by the average of counts from all samples in group k with a common scale.

{\hat{λ}}_{g k (i)} = \frac{1}{M_{k}} \sum_{i : k (i) = k} \frac{y_{g i}}{{\hat{m}}_{i}},

(3.2.e)

where M_k is the total number of replicates for group k. The sample variances with the common scale are calculated as:

w_{g k} = \frac{1}{M_{k} - 1} {\sum_{i : k (i) = k} (\frac{y_{g i}}{{\hat{m}}_{i}} - λ_{g k (i)})}^{2}

(3.2.f)

z_{g k} = \frac{{\hat{λ}}_{g k (i)}}{M_{k}} \sum_{i : k (i)} \frac{1}{{\hat{m}}_{i}}

(3.2.g)

In the case of a sufficiently large number of M_k, one can see $w_{g k} - z_{g k}$ as the unbiased estimator of the raw variance v_gk. In the case of a small number of M_k, local regression for a smooth function w_k(λ) on the graph of $({\hat{λ}}_{g k (i)}, w_{g k})$ was suggested so that $w_{k} ({\hat{λ}}_{g k (i)}) - z_{g k}$ would be the estimate for the raw variance. More details are in the study by Anders and Huber.⁸

Algorithm Overview 5: Love et al.'s⁹ DESeq2

DESeq2 allows the normalization factors to be gene specific (m_gi), rather than being fixed across genes (m_i). The estimation of m_gi is implemented in their new R packages.⁹

When modeling dispersion parameters, a large variation in estimates usually arises because of small sample sizes. DESeq2 proposed to pool genes with similar average expression together for the estimation of dispersions. To do this, one first separately estimates dispersion with maximum likelihood. Then, one identifies a location parameter for the distribution of the estimates by fitting a smooth curve dependent on average normalized expressions, before finally shrinking gene-specific dispersions to the fitted curve using an empirical Bayesian approach. The authors stated that this procedure is more superior than DESeq.

In order to avoid identifying differential expressions in genes of small average expression, fold change estimation is shrunken toward 0 for genes with insufficient information by employing an empirical Bayesian shrinkage. The procedure is as follows: (1) obtain the maximum likelihood estimates for the log fold changes from the GLM fit, then (2) fit a normal distribution with mean 0 to the estimates, and (3) use that as the prior for a second GLM fit. The maximum a posterior and the standard error for each estimate are the products of this procedure and will be used for the calculation of Wald statistics for DEA.

DESeq2 computes a threshold, η, to filter genes based on their average normalized expressions. The threshold is calculated for maximizing the number of genes with a user-defined false discovery rate. The authors claimed that this filtering step effectively controls the power of detecting DE genes. The null hypothesis becomes $| β_{g p} | \leq η$ where $β_{g p}$ is the shrunken log fold change.

Finally, the method provides a way to diagnose outliers using the Cook's distance from the GLM within each gene,C_d Samples are flagged with C_d  99% quantile of an F distribution with degrees of freedom as the number of parameter, P, and the difference in the number of samples and the number of parameter, N - P. When there is a large number of replicates available, influential data can be removed without removing the whole gene; however, when there is a small number of replicates, the entire gene with influential points should be removed from the analysis to preclude bias. More details on DESeq2's features can be found in the study by Love et al.⁹ In conclusion, DESeq2 is recommended by its authors as an improved solution to perform differential analysis because it adopts many competitive features.

Beta Binomial

Overview

A beta-binomial model is another alternative distribution to accommodate overdispersion.^11,12,42 The beta-binomial distribution has been used in the study by Baggerly et al.¹¹ to account for both between-library and within-library variations. The authors assumed that the true proportion of gene g within a library i, $θ_{g i},$ is library-specific and follows a beta distribution: $θ_{g i}$ ∼ Beta(α, β), and that the count Y_gi given θ_gi follows binomial (m_i, θ_i). Zhou et al.¹² proposed a method, BBSeq, which also assumes a beta-binomial distribution and models the proportions of gene g within sample with a logistic regression. To estimate overdispersion parameters, BBSeq either treats the parameter as free and maximizing likelihood directly, or estimates the parameter through modeling the mean-overdispersion relationship.

Modeling

In a beta-binomial model, y_g is converted from the count of gene g in sample i, to proportion, θ_gi where $θ_{g i} = \frac{y_{g i}}{\sum y_{g i}}$ . The model is constructed as:

\log i t (E θ_{g .})) = \log (\frac{E (θ_{g .})}{1 - E (θ_{g .})}) = X β_{g .}

(3.3.1)

where β_g is a vector of the regression coefficients for sample covariates and is the parameter for hypothesis testing; θ_g. is a vector consisting of the proportion of gene g for sample i through N. With the beta-binomial distribution, we are no longer working with a log link but a logit link. θ_gi ∼ Beta with E(θ_gi) = logit⁻¹(Xβ_g) and var(θ_gi) =

ϕ_{g} E (ϕ_{g i}) (1 - E (ϕ_{g i}))

, where ϕ_g is the dispersion parameter. The hypothesis test is constructed as

H_{0} : β_{g C_{1}} = \dots = β_{g C_{K}}

, where

β_{g C_{K}}

denotes the estimated coefficient of the indicator variable with 1 for samples in group k and 0 otherwise.

Bayesian and Empirical Bayesian

Overview

RNA-seq DEA can be modeled in Bayesian framework using various parametric and nonparametric priors. Van de Wiel et al.¹³ proposed a Bayesian method, ShrinkSeq, which either assumes an informative prior for the overdispersion such as the Dirac–Gaussian prior or estimates one with the empirical Bayesian approach. An empirical Bayesian approach differs from a fully Bayesian approach in that it borrows information from data to elicit priors for overdispersion parameters. For estimating posteriors, Van de Wiel and others¹³ adapted the use of integrated nested Laplace approximations, a method that only considers marginal posteriors, but adds a direct maximization of marginal likelihood to allow information sharing from joint posteriors. They further suggested that the use of informative priors for shrinkage, as in ShrinkSeq, can ensure stability and accommodate multiplicity correction. They also suggested that shrinkage should be applied not only to overdispersion parameters but also to the regression coefficient parameters. baySeq, proposed by Hardcastle and Kelly,¹⁴ constructs the data with tuples grouping genes together based on the study of interest. The distribution of a tuple shares the parameters of some prior distribution so that one can consider many hypotheses for testing beyond two group comparison. The method assumes a negative binomial distribution from the data. baySeq first estimates the empirical distribution on the set of parameters for null and alternative models with the quasi-likelihood approach. Then, it estimates the prior probabilities starting from a prior followed by an iterative process updating the priors until convergence. The authors suggested using a log posterior probability ratio of DE for DEA and noted that the posterior probability of DE for each individual model can be conveniently summed up for hypothesis testing.

Modeling

A Bayesian GLM for RNA-seq can be set as:

Y_{g i}^{\underline{\underline{d}}} F_{μ_{g i}, γ_{g}},

(3.4.1)

where γ_g is a vector of parameters not in the regression. The model is in fact flexible in that F can be negative binomial or other distributions. Suppose F follows a negative binomial distribution, then y_gi ∼ Poisson

(μ_{g i}); μ_{g i}

follows a gamma:

μ_{g i}

∼ Gamma(

e^{η_{g i}}, γ_{g}

), where

η_{g i}

and γ_g are hyperparameters and

η_{g i} = X β_{g} = β_{g 0} + \sum_{p = 1}^{P} β_{g p} x_{i p} . x_{i p}

is the value of the pth covariate for sample i, such as

β_{g 1}

in a two-group comparison. With g(·) as a link function,

μ_{g i} = g^{- 1} (η_{g i})

. The conditional posterior distribution for β is proportional with its prior:

P (β | γ_{g}, y_{g i}) \propto P (β) \prod \frac{\exp {(X β)}^{y_{g i}}}{1 + \exp {(X β)}^{y_{g i} + γ_{g}}}

(3.4.2)

Each parameter has its respective informative prior and one has to specify priors conditional on the model of interest as well as the prior itself to reach the posterior probability. For testing, a null hypothesis of β_g ≤ prior under the null is used.

Algorithm Overview 6: Van de Wiel et al.'s¹³ ShrinkSeq

ShrinkSeq assumes that α is the unknown hyperparameter from a collection of all unknown hyperparameter vectors A. It uses a direct maximization of the marginal likelihood method for the estimation of A; this method is a modified version of INLA.⁴³ The procedure of finding a is shown below and is said to be analogous to the EM algorithm:

Initiate l = 0 and α⁽⁰⁾_b for b = 1, …, B.

Use INLA to estimate posteriors $π_{A (1)} (ϕ | Y_{g}) .$

Obtain $α_{b}^{(l + 1)}$ for b = 1, …, B with ML'.

Iterate from step 2 until convergence.

Notes: let b be the number of informative priors and α^(l)_b be the bth element of A^(l) at iteration l; let $π_{A (1)}$ be the posterior of θg condition on data Y_g with A^(l) as the current estimate of A. ML' is $α^{M L :, (l + 1)}$ = argmax_α $\sum_{S = 1}^{S} \log (π_{α} (z_{s, A^{(1)}}))$ , where this is the prior log-likelihood at $z_{A^{(1)}}$ and s is a large independent sample set from $π_{A^{(1)}}^{E m p B a y e s} (θ)$ ; ML' has the same mechanism as the maximum likelihood.

Dirac–Gaussian and Gaussian–Dirac–Gaussian mixture priors:

π (β) = p_{0} δ_{0} + (1 - p_{0}) N (β; 0, τ^{2}),

(3.4.a)

π (β) = p_{- 1} N (β; μ_{- 1}, τ_{- 1}^{2}) + p_{0} δ_{0} + p_{1} N (β; μ_{1}, τ_{1}^{2}),

(3.4.b)

The subscripts of p, ie, ™1, 0, and 1, indicate the locations. For example, Dirac mass on 0 is denoted as δ₀. Considering the p as probability where p_-1,p₀, and p₁ sum up to 1, then $p_{0} = 1 - p_{- 1} - p_{1} . μ_{- 1} < 0, μ_{1} > 0.$ Priors with positive mass on zero were intentionally selected because it reflects the non-DE condition. For more details on priors, please refer to the study by Van de Wiel et al.¹³

Algorithm Overview 7: Hardcastle and Kelly's¹⁴ baySeq

The tuple system in baySeq is as follows. Let a model be denoted as M. E refers to a set of models described by the data, {E₁ … E_l}. κ represents the set of parameters for each model, M, ie, {θ₁ … θ_l}. Let q be the index of each underlying distribution for model 1, …, l. An example would be that samples in groups 1, 2, and 3 (C₁, C₂, C₃) are grouped together in a way that groups 1 and 2 are equivalently distributed and group 3 stands alone: M = ${A_{i \in C_{1}}, A_{i \in C_{2}}}, {A_{i \in C_{3}}}$ where A is the sample. D_t is the data in tuple $t : {{y_{1 t} … y_{i t} … y_{n_{t} t}}, {m_{1} … m_{i} … m_{n_{t}}}},$ which is the count in tuple t for sample i, m_i is the library size. The posterior probability of model given data is:

P (M | D_{t}) = \frac{P (D_{t} | M) P (M)}{P (D_{t})}

(3.4.c)

P (| D_{t} | M) = \int P (D_{t} | κ | M) d κ

(3.4.d)

Suppose that a sample A_i is in the set E_q where the count of this sample at a particular tuple t is y_it, which follows a negative binomial $(μ_{i t}, φ_{q}) (θ_{q} = (λ_{q}, φ_{q}))$ . The mean count μ_it is a product of the library size scaling factor, m_i, and the proportion of reads in set E_q, λ_q. We have:

\begin{array}{l} P (D_{t} | κ, M) = P (y_{i t} | m_{i}, θ_{q}) \\ = \frac{Γ (y_{i t} + ϕ_{q}^{- 1})}{Γ (ϕ_{q}^{- 1}) Γ (y_{i t} + 1)} {(\frac{1}{1 + μ_{i t} ϕ_{q}})}^{ϕ_{q}^{- 1}} {(\frac{μ_{i t}}{ϕ_{q}^{- 1} + μ_{i t}})}^{y_{i t}} \end{array}

(3.4.e)

baySeq first estimates the empirical distribution on the set of parameters for null and alternative models through sampling from a negative binomial distribution and a quasi-likelihood approach.³⁸ Then, it estimates the prior probabilities starting from a prior followed by an iterative process updating the priors until convergence. For detailed steps, please refer Hardcastle and Kelly.¹⁴ Hypothesis testing can be easily denoted with the tuple system, for instance a two-group case,

\begin{array}{l} H_{0} (n o n - D E) : {A_{i \in C_{1},} A_{i \in C_{2}}} \\ H_{0} (D E) : {A_{i \in C_{1}}} a n d {A_{i \in C_{2}}} \end{array}

Nonparametric

Overview

In this section, we discuss two nonparametric methods for RNA-seq DEA by Li and Tibshirani¹⁵ and Tarazona et al.¹⁶ In SAMseq, Li and Tibshirani¹⁵ calculated a modified two-sample Wilcoxon statistic using the ranked counts for two-group comparison.⁴⁴ The authors proposed two resampling strategies for producing equal sequencing depths of the samples: downsampling and Poisson sampling, and also suggested that ties can be broken by inserting a small random number in resampling. NOISeq by Tarazona et al.¹⁶ first used pseudo-counts corrected by the library size under two conditions (K=2) to calculate log-ratio (M) and absolute value of difference (D). Then, a test statistic is derived from M and D with a null hypothesis of no differential expression; in other words, M and D are no different than random variables either estimated from the real or simulated data.

Modeling

The two nonparametric methods discussed here are explained separately in the test boxes, as they each has a unique model setup.

Algorithm Overview 8: Li and Tibshirani's¹⁵ SAMseq

To use SAMseq, one ranks the counts of gene g across samples and denotes the ordered counts as y'_g1 … y'_gN. If needed, resampling strategy may be used to fulfill the requirement of equal sequencing depths of samples in Wilcoxon test.

In the case of a sufficient minimal sequencing depth, the authors proposed a downsampling strategy where one first identifies the smallest sequencing depth, denoted as m_min, where m_min = min(m₁, …, m_N) and keeps this list f counts while resampling lists of counts for all other samples with the sequencing depth, m_min. Every count is randomly sampled with a success probability of m_min/m_i and failure probability of its complement, ie, the resampled count is

{y^{'}}_{i j} \sim b i n o m i a l (y_{i t}, \frac{m_{\min}}{m_{i}}) .

(3.5.a)

In the case of an insufficient minimal sequencing depth, Li and Tibshirani¹⁵ introduced Poisson sampling strategy, wherein they employed the geometric mean of the sequencing depths for all samples:

{y^{'}}_{i j} \sim P o i s s o n (\frac{\bar{m}}{m_{i}} N_{i j}),

(3.5.b)

where y'_iJ is resampled data and

\bar{m} = {(\prod_{i = 1}^{N} m_{i})}^{1 / N}

. Small random numbers are introduced into the resampling process to break ties, as well as multiple resampling to ensure stability. Poisson sampling is generally preferred based on the simulation.¹⁵ In cases where m_i is unknown, one could use normalization methods to estimate. Differential expression of gene g is identified based on a comparison of the ranks of gene g between the two sample groups.

Algorithm Overview 9: Tarazona et al.'s¹⁶ NOISeq

In NOISeq, for each $y_{g k (1)},$ the count of gene g in sample i from group k, the correction method for library size, m_k(i), is the sum of counts over all genes for the ith sample replicate in condition k. Let m_k(i) be simplified as m_i. One would work with pseudocounts (after normalization) formulated $a s : {\tilde{y}}_{g k (i)} \times 10^{6} / m_{i} .$

With the pseudocounts, the log ratio (L) and the absolute value of difference (D) are calculated. ${\tilde{y}}_{g k}$ is summarized over ith samples, a.k.a. ${\tilde{y}}_{g k} = {\sum^{}}_{i \in C_{k}} {\tilde{y}}_{g i} . L_{g} = \log_{2} (\frac{{\tilde{y}}_{g C_{1}}}{{\tilde{y}}_{g C_{2}}})$ and $D_{g} = | {\tilde{y}}_{g C_{1}} - {\tilde{y}}_{g C_{2}} |$ , where C₁ and C₂ denote group 1 and 2, respectively. Zero counts are replaced by 0.5 or by mid(0, normalized minimum expression) when calculating L_g. Samples with only zeros are dropped.

Null hypothesis: L and D values are no different than noise if no DE. Probability distribution for random variables L^* and D^* are either estimated from real data or simulated data and are used for the noises. One then obtains the probability of DE as:

\begin{matrix} P (D E_{g} = 1 | {\tilde{y}}_{g C_{1}} - {\tilde{y}}_{g C_{2}}) \\ = P (D E_{g} = 1 | L_{g} = l_{g}, D_{g} = d_{g}) \\ = P (| L * | < | l_{g} |, D * < d_{g}) \end{matrix}

(3.5.c)

DE_g equals 1 when gene g is DE. Note that log ratio is in absolute term because either direction indicates DE. See the study by Tarazona et al.¹⁶, for more details.

Statistical Testing

After performing parameter estimation for a statistical model, significance of differential expression can be assessed comparing the expression of gene g among K groups. Assume that $λ_{g k (i)}$ is the expression level of gene g in sample i belonging to sample group k. ϕ_g is the dispersion parameter. DE tests are proposed below for the null hypothesis (H₀):

λ_{g 1} = … = λ_{g K} .

In parametric regime, one can employ classic log-likelihood ratio test.

L R_{g} = \frac{2 (l_{g} (\hat{λ}, Y_{g}) - l_{g} ({\hat{λ}}^{0}, Y_{g}))}{{\hat{ϕ}}_{g}} ~ F_{K - 1, N - K}

(4.1)

In absence of overdispersion,

L R_{g} = 2 (l_{g} (\hat{λ}, Y_{g}) - l_{g} ({\hat{λ}}^{0}, Y_{g})) ~ X^{2} (K - 1)

(4.2)

where l_g denotes the log-likelihood function for the gth gene;

l_{g} (\hat{λ}, Y_{g})

and

l_{g} ({\hat{λ}}^{0}, Y_{g})

denote the MLE of biological and experimental effects under the full model and null model, respectively.

An exact test for negative binomial, analogous to the Fisher's exact test, is used by methods, such as edgeR and DESeq. By conditioning on the total sum, one can calculate the probability of observing counts as extreme or more extreme than what is really obtained, resulting in an exact P-value. Note that a sum of gene counts from all replicates in each group that is either too large or too small indicates a differential expression, so a two-sided test is used.

A score statistic is used by PoissonSeq, which tests for the significance of the association of gene g with expression of groups. In the context of gene count with unknown dispersion parameters, a score test is as follows:

S_{g} = \sum_{k = 1}^{K} \sum_{i \in C_{k}} \frac{w_{g} {(y_{g i} - {\hat{μ}}_{g i})}^{2}}{ϕ_{g} v ({\hat{μ}}_{g i})} ~ F_{K - 1, N - K}

(4.3)

where w_g is a known weight, ${\hat{μ}}_{g i}$ is estimated by MLE under the null hypothesis, and $v ({\hat{μ}}_{g i})$ is the variance function of $μ_{g i} .$ .

Wilcoxon statistic is a rank-transformed version of t-statistics, used by the nonparametric method, SAMseq:

W_{g} = \sum_{i \in C_{k}} r_{g i} - r_{0},

(4.4)

where r_gi is the rank of y_gi across samples and r₀ =

(\sum I_{(i \in C_{k})}) (N + 1) / 2

(r₀ is used to make E(W_g) = 0). W_g > 0 identifies that gene g is overly expressed in group k.

Under a Bayesian or empirical Bayesian framework, methods like baySeq use posterior likelihood of the DE model per gene to identify differential expression:

P (M_{H_{0}} | Y_{g}) = \frac{P (Y_{g} | M_{H_{0}}) P (M_{H_{0}})}{P (Y_{g})},

(4.5)

where M denotes a model. Posterior probability of DE to non-DE ratio is often used.

The choice of a testing strategy is a decision that often depends on the chosen method and other factors such as sample size. With a small sample size, the large-sample approximations based on the Wald test, score test, and likelihood ratio test are questionable and an exact test is usually preferred.³⁶ We summarize testing strategies that are plausible for each method in Table 2.

Finally, almost all the methods we mentioned in this paper use standard approaches for multiple hypothesis correction to control false discoveries.^45,46 PoissonSeq is an exception that builds its own estimation of false discovery rate (FDR) from a permutation test. Permutation test calculates a score test per gene, S_g, for H_0g vs H_ag, each time when the outcome is permuted. For B permutations, the same procedure is applied to calculate null statistics $S_{g}^{0 b}$ for b = 1 ··· B. The permutation P-value is:

p_{g} = \sum_{b = 1}^{B} \sum_{i = 1}^{N} \frac{I {S_{g}^{0 b} > S_{g}} + 1}{N \times B + 1}

(4.6)

For Bayesian methods, since posterior probabilities are computed, Bayesian FDR or local FDR are conveniently used. Local false discovery rate (1FDR_g) is simply the posterior probability $π_{0 g} :$ :

l F D R_{g} = P (M_{H_{0}} | Y_{g}) = P_{0} / (P_{0} + P_{α}),

(4.7)

where

P_{0} = \int_{- \infty}^{△} P (Y_{g} | λ_{g} = λ) π (λ) d λ,

and

P_{1} = \int_{△}^{\infty} P (Y_{g} | λ_{g} = λ) π (λ) d λ,

δ denotes prior. Bayesian false discovery rate (BFDR) is calculated as:

B F D R (t) = \frac{\sum_{g = 1}^{G} l F D R_{g} \times I {π_{0_{g}} < t}}{\sum_{g = 1}^{G} I {π_{0_{g}} < t}} .

(4.8)

Note that $I {π_{0_{g}} < t} = I {π_{1_{g}} \geq t}$ for small t of interest.

Conclusion

RNA-seq data analysis is a relatively new and rapidly growing research area. The statistical model used for sequencing data has been evolving. The first proposed Poisson distribution has become obsolete because it fails to accommodate commonly-observed overdispersion in RNA-seq data. In a parametric framework, the negative binomial distribution is the most common assumption for modeling the marginal distribution due to the technical and biological variations.^8,9,33,36 Other available methods that account for overdispersions include the generalized Poisson distribution,³⁵ negative binomial power distribution,¹⁰ and beta-binomial distribution,^11,12 as well as nonparametric models^15,16 and Bayesian methods.^13,14 Table 2 summarizes all the reviewed methods in this paper.

For readers who are interested in the performance evaluation and method comparison of the available methods, they can refer to the original paper as well as the body of literature on this issue. For instance, in the study by Seyednasrollah et al.²², DESeq has been recommended as one of the most robust methods and caution is advised when dealing with a small number of replicates regardless of which method is being used. Similarly, Soneson and Delorenzi²¹ also advise caution when interpreting results drawn from a small number of replicates and show that SAMseq surpasses many other reviewed methods. In the study by Rapaport et al.²³, DESeq, edgeR, and baySeq, which all assume a negative binomial model, have better specificity, sensitivity, and control of false positive errors than other nonnegative binomial models. As the technology continues to improve and the empirical data accumulate, more compelling statistical modeling for RNA-seq data can be expected.

Footnotes

Author Contributions

Conceived and designed the experiments: HCH, YN, LXQ. Reviewed the literature: HCH, YN, LXQ. Wrote the first draft of the manuscript: HCH, YN. Contributed to the writing of the manuscript: HCH, YN, LXQ. Agree with manuscript results and conclusions: HCH, YN, LXQ. Jointly developed the structure and arguments for the paper: HCH, YN, LXQ. Made critical revisions and approved final version: HCH, YN, LXQ. All authors reviewed and approved of the final manuscript.

References

Wang

, Gerstein

, Snyder

. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009; 10: 57–63.

Malone

J.H.

, Oliver

. Microarrays, deep sequencing and the true measure of the transcriptome. BMC Biol. 2011; 9: 34.

Dillies

M.A.

, Rau

, Aubert

; French StatOmique Consortium. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform. 2012; 14(6): 671–83.

Wang

, Feng

, Wang

, Zhang

. DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics. 2010; 26: 136–8.

Langmead

, Hansen

K.D.

, Leek

J.T.

. Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol. 2010; 11: R83.

, Witten

D.M.

, Johnstone

I.M.

, Tibshirani

. Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics. 2012; 13: 523–38.

Robinson

M.D.

, McCathy

D.J.

, Smyth

G.K.

. edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26: 139–40.

Anders

, Huber

. Differential expression analysis for sequence count data. Genome Biol. 2010; 11: R25.

Love

M.I.

, Huber

, Anders

. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014; 15(12): 550.

10.

, Schafer

D.W.

, Cumbie

J.S.

, Chang

J.H.

. The NBP negative binomial models for assessing differential gene expression from RNA-seq. Stat Appl Genet Mol Biol. 2011; 10: 1.

11.

Baggerly

K.A.

, Deng

, Morris

J.S.

, Aldaz

C.M.

. Differential expression in SAGE: accounting for normal between-library variation. Bioinformatics. 2003; 19(12): 1477–83.

12.

Zhou

Y.H.

, Xia

, Wright

F.A.

. A powerful and flexible approach to the analysis of RNA sequence count data. Bioinformatics. 2011; 27: 2672–8.

13.

Van de Wiel

M.A.

, Leday

G.G.

, Pardo

, Rue

, Van de Vaart

A.W.

, Van Wieringen

W.N.

. Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. Biostatistics. 2013; 14: 113–28.

14.

Hardcastle

T.J.

, Kelly

K.A.

. baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics. 2010; 11: 422.

15.

, Tibshirani

. Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-seq data. Stat Methods Med Res. 2011; 22(5): 519–36.

16.

Tarazona

, Garcia-Alcalde

, Ferrer

, Dopazo

, Conesa

. Differential expression in RNA-seq: a matter of depth. Genome Res. 2011; 21: 2213–23.

17.

Marioni

J.C.

, Mason

C.E.

, Mane

S.M.

, Stephens

, Gilad

. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008; 18: 1509–17.

18.

Bullard

J.H.

, Purdom

, Hansen

K.D.

, Dudoit

. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010; 11: 94.

19.

Oshlack

, Robinson

M.D.

, Young

M.D.

. From RNA-Seq reads to differential expression results. Genome Biol. 2010; 11: 220.

20.

Kvam

V.M.

, Liu

, Si

. A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data. Am J Bot. 2012; 99: 248–56.

21.

Soneson

, Delorenzi

. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics. 2013; 14: 91.

22.

Seyednasrollah

, Laiho

, Elo

. Comparison of software packages for detecting differential expression in RNA-seq studies. Brief Bioinform. 2015; 16(1): 59–70.

23.

Rapaport

, Khanin

, Liang

. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol. 2013; 14: R95.

24.

Leek

J.T.

, Scharpf

R.B.

, Bravo

H.C.

. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. 2010; 11: 733–9.

25.

Chen

, McCarthy

, Robinson

, Smyth

G.K.

. edgeR: differential expression analysis of digital gene expression data. User's Guide; http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf 2015.

26.

Mortazavi

, Williams

B.A.

, McCue

, Schaeffer

, Wold

. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008; 5(7): 621–8.

27.

Robinson

M.D.

, Oshlack

. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010; 11: R25.

28.

Anders

, McCarthy

, Chen

. Count-based differential expression analysis of RNA sequencing data using R and bioconductor. Nat Protoc. 2013; 8: 1765–86.

29.

Risso

, Schwartz

, Sherlock

, Dudoit

. GC-content normalization for RNA-Seq data. BMC Bioinformatics. 2011; 12: 480.

30.

Benjamini

, Speed

. Estimation and correction for GC-content bias in high throughput sequencing. Nucleic Acids Res. 2011; 40(10): e72.

31.

Hansen

K.D.

, Irizarry

R.A.

, Wu

. Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics. 2012; 13(2): 204–16.

32.

Bolstad

B.M.

, Irizarry

R.A.

, Astrand

, Speed

T.P.

. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003; 19: 185–93.

33.

Robinson

M.D.

, Smyth

G.K.

. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007; 23: 2881–7.

34.

Auer

P.L.

, Deroge

R.W.

. A two-stage Poisson model for testing RNA-seq data. Stat Appl Genet Mol Biol. 2011; 10: 26.

35.

Srivastava

, Chen

. A two-parameter generalized Poisson model to improve the analysis of RNA-seq data. Nucleic Acids Res. 2010; 38: e170.

36.

Robinson

M.D.

, Smyth

G.K.

. Small-sample estimation of negative binomial dispersion with applications to SAGE data. Biostatistics. 2008; 9: 321–32.

37.

Smyth

G.K.

. Pearson's goodness of fit statistic as a score test statistic. In: Goldstein

D.R.

, ed. Science and Statistics: A Festschrift for Terry Speed. Hayward, CA: Institute of Mathematical Statistics; 2003: 115–26. [IMS Lecture Notes Monograph Series 40].

38.

Nelder

J.A.

, Lee

. Likelihood, quasi-likelihood and pseudolikelihood: some comparisons. J Roy Stat Soc B. 1992; 54: 273–84.

39.

Cox

D.R.

, Reid

. Parameter orthogonality and approximate conditional inference. J Roy Statist Soc B. 1987; 49(1): 1–39.

40.

Dimont

, Shi

, Kirchner

, Hide

. edgeRun: an R package for sensitive, functionally relevant differential expression discovery using an unconditional exact test. Bioinformatics. 2015; 31(15): 2589–90.

41.

Reid

. The roles of conditioning on inference. Stat Sci. 1995; 10(2): 138–57.

42.

Zhang

, Zhou

, Velculesu

V.E.

. Gene expression profiles in normal and cancer cells. Science. 1997; 276(5316): 1268–72.

43.

Rue

, Martino

, Chopin

. Approximate Bayesian inference for latent Gaussian models using integrated nested Laplace approximations. J Roy Stat Soc B. 2009; 71: 319–92.

44.

Wilcoxon

. Individual comparisons by ranking methods. Biometric Bulletin. 1945; 1(6): 80–3.

45.

Benjamini

, Hochberg

. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. 1995; Series B 57(1): 289–300. MR 1325392.

46.

Storey

J.D.

. The positive false discovery rate: a Bayesian interpretation and the q-value. Ann Stat. 2002; 31(6): 2013–35.

Differential Expression Analysis for RNA-Seq: An Overview of Statistical Methods and Computational Software

Abstract

Keywords

Introduction

Notation and Normalization Methods

Notation

Normalization Methods

Statistical Modeling of RNA-Seq Data

Poisson

Overview

Modeling

Algorithm Overview 1: Li et al.'s 6 PoissonSeq

Negative Binomial

Overview

Modeling

Algorithm Overview 2: Overdispersion

Algorithm Overview 3: Robinson and Smyth's33,36 edgeR

Algorithm Overview 4: Anders and Huber's 8 DESeq

Algorithm Overview 5: Love et al.'s 9 DESeq2

Beta Binomial

Overview

Modeling

Bayesian and Empirical Bayesian

Overview

Modeling

Algorithm Overview 6: Van de Wiel et al.'s 13 ShrinkSeq

Algorithm Overview 7: Hardcastle and Kelly's 14 baySeq

Nonparametric

Overview

Modeling

Algorithm Overview 8: Li and Tibshirani's 15 SAMseq

Algorithm Overview 9: Tarazona et al.'s 16 NOISeq

Statistical Testing

Conclusion

Footnotes

Author Contributions

References

Algorithm Overview 1: Li et al.'s⁶ PoissonSeq

Algorithm Overview 3: Robinson and Smyth's^33,36 edgeR

Algorithm Overview 4: Anders and Huber's⁸ DESeq

Algorithm Overview 5: Love et al.'s⁹ DESeq2

Algorithm Overview 6: Van de Wiel et al.'s¹³ ShrinkSeq

Algorithm Overview 7: Hardcastle and Kelly's¹⁴ baySeq

Algorithm Overview 8: Li and Tibshirani's¹⁵ SAMseq

Algorithm Overview 9: Tarazona et al.'s¹⁶ NOISeq