Abstract
Deep sequencing has recently emerged as a powerful alternative to microarrays for the high-throughput profiling of gene expression. In order to account for the discrete nature of RNA sequencing data, new statistical methods and computational tools have been developed for the analysis of differential expression to identify genes that are relevant to a disease such as cancer. In this paper, it is thus timely to provide an overview of these analysis methods and tools. For readers with statistical background, we also review the parameter estimation algorithms and hypothesis testing strategies used in these methods.
Introduction
In the past decade, deep sequencing has emerged as a powerful alternative to microarrays for the high-throughput profiling of gene expression. Comparing with microarrays, RNA sequencing (RNA-seq) possesses a number of technological advantages such as a wider dynamic range and the freedom from predesigned probes.1–3 It also comes with a unique data feature as discrete sequencing reads. In order to account for this unique data feature, statistical methodologies and computational algorithms have been developed based on various data distributional assumptions such as Poisson, negative binomial, beta binomial, (full or empirical) Bayesian, and nonparametric.4–16,
For researchers who are new to the analysis of RNA-seq data, in this paper we provide an introductory overview of the methods and software available for the differential expression analysis (DEA) of RNA-seq data when the analysis goal is to identify genes that are relevant to a disease such as cancer.1,17,18 In addition, for those who are interested in the statistical aspects of these methods, we also provide an overview of their parameter estimation algorithms and hypothesis testing strategies. The overview of these statistical aspects in our paper provides a unique contribution to the review literature on RNA-seq DEA methods.3,18–23 For readers who are interested in a performance comparison of RNA-seq DEA methods, they can refer to a large body of such papers in the literature.20–23
The rest of the paper is organized as follows. In the Notation and Normalization Methods section, we introduce the unified notations used for the methods reviewed in our paper and touch on the normalization methods typically used to preprocess RNA-seq data before DEA. In the Statistical Modeling of RNA-seq Data section, we review the statistical modeling RNA-seq DEA categorized by the distributional assumptions such as Poisson,4–6 negative binomial,7–10 beta binomial,11,12 Bayesian,13,14 and nonparametric.15,16 All reviewed methods directly work with gene-level count data for DEA and have available R packages. For interested readers with advanced statistical knowledge, the parameter estimation algorithm for each method is presented separately in a text box, and the typical statistical testing frameworks that have been proposed for RNA-seq DEA are reviewed in the Statistical Testing section. Finally, computational tools implemented for the reviewed methods are summarized in Table 2. We note that the methods reviewed in this paper are not an exhaustive collection of available methods in the literature. Rather, we reviewed a list of most commonly used categories of modeling assumptions and included a few representative methods for each category, to help researchers who are new to the field orientated and started in the still evolving literature on this topic.
List of sequencing depth normalization methods and reference papers.
RNA-seq count data DEA statistical methods and software.
Notation and Normalization Methods
Notation
RNA-seq data for G genes and N samples can be described by a G x N matrix
We use
Normalization Methods
Similar to microarray data, RNA-seq data are also prone to nonbiological effects due to the experimental process. Consequently, these effects need to be adjusted before any further data analysis. 24 One major source of nonbiological effects is sequencing depth, which can be adjusted by rescaling the sequencing counts with factors that mimic sequencing depth. 25 Reads per kilobase per million reads (RPKM) is a simple adjustment that considers gene counts standardized by the gene length and the total number of reads in each library as expression values.17,26 More sophisticated adjustment factors, including trimmed mean of M-values (TMM), 27 DESeq size factor, 28 and quantile-based normalizations such as upper quartile normalization, 18 are given in Table 1. Other sources of nonbiological effects for RNA-seq include gene length and GC-content,21,29 whose effects are typically assumed to be consistent across samples for a given gene and hence cancel out in the analysis of differential expression. Interested readers can look up available normalization methods adjusting for gene length and GC-content in the publications such as Risso et al. 29 , Benjamini and Speed, 30 and Hansen et al. 31
Statistical Modeling of RNA-Seq Data
Poisson
Overview
Models for read counts originated from the idea that each read is sampled independently from a pool of reads and hence the number of reads for a given gene follows a binomial distribution, which can be approximated by a Poisson distribution. Based on the Poisson model assumption for repeated sequencings of a sample, Marioni et al. 17 proposed to use a log-linear model to model the mean difference between two samples and adopted the classical likelihood ratio test for calculating the P-values. Based on the same Poisson assumption, Bullard et al. 18 proposed to use two other test statistics, exact test statistics and score test statistics, in the generalized linear model (GLM) framework. Li et al. 6 proposed a method called PoissonSeq, which adapts a two-step procedure for fitting a Poisson model. The method first estimates sequencing depths using a Poisson goodness-of-fit statistic and then calculates a score statistic based on a log-linear model. In addition, Wang et al. 4 developed an R package, DEGseq, to identify differentially expressed (DE) genes with an MA-plot-based approach. Langmead et al. 5 incorporated cloud computing in their method called Myrna.
Modeling
In a Poisson model, one assumes that Ygi,, the number of reads mapped to gene g in sample i, follows a Poisson distribution, ygi ∼ Poisson(μgii). μ
gi
is the rate parameter for gene g in sample i, which equals both the mean and the variance of the read counts. The probability mass function is:
Algorithm Overview 1: Li et al.'s 6 PoissonSeq
Li and others proposed PoissonSeq that assumes the hypotheses as follows. Under the null hypothesis where genes and covariates are not relevant,
Under the alternative hypothesis where genes and covariates,
They then estimated which genes belong to S by a Poisson goodness-of-fit statistic, ie,
S is set to be the genes whose GOF g values are in the(∊, 1 - ∊) quantile of all GOF g values. Li and others used ∊ = 0.25 in their study. 6
The objective is to test H0:
With accumulating empirical data (especially “with the data available for groups of multiple biological samples), researchers began to observe that in a group, the between-sample variation of sequencing reads for a gene often exceeds the mean.17,23,33 This excessive variation that cannot be explained by the Poisson model is called overdispersion. Extensions of the classic Poisson model have been proposed in order to accommodate such overdispersion, including the two-stage Poisson models 34 and the generalized Poisson model. 35
Negative Binomial
Overview
A class of models based on the negative binomial distribution assumption has been developed in order to accommodate the overdispersion among biological replicate data.8,9,33,36 Robinson and Smyth
33
used the conditional maximum likelihood (CML) to estimate the dispersion parameter-a measure of the excessive variance that a Poisson model does not incorporate-when assuming a common dispersion parameter across genes. They compared the CML method with alternative estimation methods based on pseudolikelihood, quasi-likelihood, and conditional inference.37–39 In a follow-up paper,
36
they also extended the model to allow for gene-specific dispersion parameters and proposed to estimate the dispersion parameters by maximizing a weighted conditional likelihood with empirical Bayesian approximation. Details of their method, edgeR, can be found in Robinson and Smyth.33,36 edgeRun is based on the same model as edgeR but it uses an unconditional exact test to achieve more power while paying the price of computational time.
40
Anders and Huber
8
proposed a method called DESeq also under the negative binomial assumption. They advocated the use of a robust estimate of normalization factors for the estimation of dispersion parameter and a local regression to obtain smooth function for each group on the graphs of expected proportions vs sample variances. DESeq2 was developed in the study by Love et al.
9
as a successor of DESeq. It employs a number of new modeling features, such as the use of a shrunken fold change and a shrunken dispersion estimation method, to further improve the model performance. Di and others
10
proposed a method, NBPSeq, using a negative binomial power distribution instead of a regular negative binomial distribution. They hypothesized that
Modeling
The model setup for negative binomial is to assume ygi ∼ negative binomial
Algorithm Overview 2: Overdispersion
Negative binomial can be derived as a gamma-Poisson mixture model (subscripts g's and i's are omitted for brevity), under the assumption that technical replicates follow a Poisson distribution, and biological replicates follow a gamma distribution, with the latter accommodating the overdispersion observed in empirical data.
Then,
One substitutes back
Algorithm Overview 3: Robinson and Smyth's33,36 edgeR
In edgeR,
Under the assumption of gene-wise (or tag-wise in the original paper) dispersion, ϕ
g
is estimated by maximizing a weighted conditional log-likelihood, WL(ϕ
g
):
In the study by Robinson and Smyth, 36 the overdispersion parameter is assumed to be common across all genes (ie, ϕ g = ϕ). To estimate the shared dispersion parameter with and without equal library size, the authors proposed to use the CML and quantile-adjusted CML (qCML) as follows.
In a special case where mi = m for i ∈ Ck where Ck = {i: k(i) = k}, ∼ negative binomial
In the case of different mi in group k, the MLE of
Hypothesis testing is set up as
Algorithm Overview 4: Anders and Huber's 8 DESeq
The read count ygi is modeled by a GLM of negative binomial distribution with a log link:
The mean μ
gi
is the proportion of reads for gene g in sample i,
Since
In the case of a sufficiently large number of Mk, one can see
Algorithm Overview 5: Love et al.'s 9 DESeq2
DESeq2 allows the normalization factors to be gene specific (mgi), rather than being fixed across genes (mi). The estimation of mgi is implemented in their new R packages. 9
When modeling dispersion parameters, a large variation in estimates usually arises because of small sample sizes. DESeq2 proposed to pool genes with similar average expression together for the estimation of dispersions. To do this, one first separately estimates dispersion with maximum likelihood. Then, one identifies a location parameter for the distribution of the estimates by fitting a smooth curve dependent on average normalized expressions, before finally shrinking gene-specific dispersions to the fitted curve using an empirical Bayesian approach. The authors stated that this procedure is more superior than DESeq.
In order to avoid identifying differential expressions in genes of small average expression, fold change estimation is shrunken toward 0 for genes with insufficient information by employing an empirical Bayesian shrinkage. The procedure is as follows: (1) obtain the maximum likelihood estimates for the log fold changes from the GLM fit, then (2) fit a normal distribution with mean 0 to the estimates, and (3) use that as the prior for a second GLM fit. The maximum a posterior and the standard error for each estimate are the products of this procedure and will be used for the calculation of Wald statistics for DEA.
DESeq2 computes a threshold, η, to filter genes based on their average normalized expressions. The threshold is calculated for maximizing the number of genes with a user-defined false discovery rate. The authors claimed that this filtering step effectively controls the power of detecting DE genes. The null hypothesis becomes
Finally, the method provides a way to diagnose outliers using the Cook's distance from the GLM within each gene,Cd Samples are flagged with Cd 99% quantile of an F distribution with degrees of freedom as the number of parameter, P, and the difference in the number of samples and the number of parameter, N - P. When there is a large number of replicates available, influential data can be removed without removing the whole gene; however, when there is a small number of replicates, the entire gene with influential points should be removed from the analysis to preclude bias. More details on DESeq2's features can be found in the study by Love et al. 9 In conclusion, DESeq2 is recommended by its authors as an improved solution to perform differential analysis because it adopts many competitive features.
Beta Binomial
Overview
A beta-binomial model is another alternative distribution to accommodate overdispersion.11,12,42 The beta-binomial distribution has been used in the study by Baggerly et al.
11
to account for both between-library and within-library variations. The authors assumed that the true proportion of gene g within a library i,
Modeling
In a beta-binomial model, yg is converted from the count of gene g in sample i, to proportion, θ
gi
where
Bayesian and Empirical Bayesian
Overview
RNA-seq DEA can be modeled in Bayesian framework using various parametric and nonparametric priors. Van de Wiel et al. 13 proposed a Bayesian method, ShrinkSeq, which either assumes an informative prior for the overdispersion such as the Dirac–Gaussian prior or estimates one with the empirical Bayesian approach. An empirical Bayesian approach differs from a fully Bayesian approach in that it borrows information from data to elicit priors for overdispersion parameters. For estimating posteriors, Van de Wiel and others 13 adapted the use of integrated nested Laplace approximations, a method that only considers marginal posteriors, but adds a direct maximization of marginal likelihood to allow information sharing from joint posteriors. They further suggested that the use of informative priors for shrinkage, as in ShrinkSeq, can ensure stability and accommodate multiplicity correction. They also suggested that shrinkage should be applied not only to overdispersion parameters but also to the regression coefficient parameters. baySeq, proposed by Hardcastle and Kelly, 14 constructs the data with tuples grouping genes together based on the study of interest. The distribution of a tuple shares the parameters of some prior distribution so that one can consider many hypotheses for testing beyond two group comparison. The method assumes a negative binomial distribution from the data. baySeq first estimates the empirical distribution on the set of parameters for null and alternative models with the quasi-likelihood approach. Then, it estimates the prior probabilities starting from a prior followed by an iterative process updating the priors until convergence. The authors suggested using a log posterior probability ratio of DE for DEA and noted that the posterior probability of DE for each individual model can be conveniently summed up for hypothesis testing.
Modeling
A Bayesian GLM for RNA-seq can be set as:
Each parameter has its respective informative prior and one has to specify priors conditional on the model of interest as well as the prior itself to reach the posterior probability. For testing, a null hypothesis of β g ≤ prior under the null is used.
Algorithm Overview 6: Van de Wiel et al.'s 13 ShrinkSeq
ShrinkSeq assumes that α is the unknown hyperparameter from a collection of all unknown hyperparameter vectors A. It uses a direct maximization of the marginal likelihood method for the estimation of A; this method is a modified version of INLA. 43 The procedure of finding a is shown below and is said to be analogous to the EM algorithm:
Initiate l = 0 and
Use INLA to estimate posteriors
Obtain
Iterate from step 2 until convergence.
Notes: let b be the number of informative priors and
Dirac–Gaussian and Gaussian–Dirac–Gaussian mixture priors:
The subscripts of p, ie, ™1, 0, and 1, indicate the locations. For example, Dirac mass on 0 is denoted as δ0. Considering the p as probability where p-1,p0, and p1 sum up to 1, then
Algorithm Overview 7: Hardcastle and Kelly's 14 baySeq
The tuple system in baySeq is as follows. Let a model be denoted as M. E refers to a set of models described by the data, {E1 … El}. κ represents the set of parameters for each model, M, ie, {θ1 … θ
l
}. Let q be the index of each underlying distribution for model 1, …, l. An example would be that samples in groups 1, 2, and 3 (C1, C2, C3) are grouped together in a way that groups 1 and 2 are equivalently distributed and group 3 stands alone: M =
Suppose that a sample Ai is in the set Eq where the count of this sample at a particular tuple t is yit, which follows a negative binomial
Nonparametric
Overview
In this section, we discuss two nonparametric methods for RNA-seq DEA by Li and Tibshirani 15 and Tarazona et al. 16 In SAMseq, Li and Tibshirani 15 calculated a modified two-sample Wilcoxon statistic using the ranked counts for two-group comparison. 44 The authors proposed two resampling strategies for producing equal sequencing depths of the samples: downsampling and Poisson sampling, and also suggested that ties can be broken by inserting a small random number in resampling. NOISeq by Tarazona et al. 16 first used pseudo-counts corrected by the library size under two conditions (K=2) to calculate log-ratio (M) and absolute value of difference (D). Then, a test statistic is derived from M and D with a null hypothesis of no differential expression; in other words, M and D are no different than random variables either estimated from the real or simulated data.
Modeling
The two nonparametric methods discussed here are explained separately in the test boxes, as they each has a unique model setup.
Algorithm Overview 8: Li and Tibshirani's 15 SAMseq
To use SAMseq, one ranks the counts of gene g across samples and denotes the ordered counts as y'g1 … y' gN . If needed, resampling strategy may be used to fulfill the requirement of equal sequencing depths of samples in Wilcoxon test.
In the case of a sufficient minimal sequencing depth, the authors proposed a downsampling strategy where one first identifies the smallest sequencing depth, denoted as mmin, where mmin = min(m1, …, mN) and keeps this list f counts while resampling lists of counts for all other samples with the sequencing depth, mmin. Every count is randomly sampled with a success probability of mmin/mi and failure probability of its complement, ie, the resampled count is
In the case of an insufficient minimal sequencing depth, Li and Tibshirani
15
introduced Poisson sampling strategy, wherein they employed the geometric mean of the sequencing depths for all samples:
Algorithm Overview 9: Tarazona et al.'s 16 NOISeq
In NOISeq, for each
With the pseudocounts, the log ratio (L) and the absolute value of difference (D) are calculated.
Null hypothesis: L and D values are no different than noise if no DE. Probability distribution for random variables L* and D* are either estimated from real data or simulated data and are used for the noises. One then obtains the probability of DE as:
DE g equals 1 when gene g is DE. Note that log ratio is in absolute term because either direction indicates DE. See the study by Tarazona et al. 16 , for more details.
Statistical Testing
After performing parameter estimation for a statistical model, significance of differential expression can be assessed comparing the expression of gene g among K groups. Assume that
In parametric regime, one can employ classic log-likelihood ratio test.
In absence of overdispersion,
An exact test for negative binomial, analogous to the Fisher's exact test, is used by methods, such as edgeR and DESeq. By conditioning on the total sum, one can calculate the probability of observing counts as extreme or more extreme than what is really obtained, resulting in an exact P-value. Note that a sum of gene counts from all replicates in each group that is either too large or too small indicates a differential expression, so a two-sided test is used.
A score statistic is used by PoissonSeq, which tests for the significance of the association of gene g with expression of groups. In the context of gene count with unknown dispersion parameters, a score test is as follows:
where wg is a known weight,
Wilcoxon statistic is a rank-transformed version of t-statistics, used by the nonparametric method, SAMseq:
Under a Bayesian or empirical Bayesian framework, methods like baySeq use posterior likelihood of the DE model per gene to identify differential expression:
The choice of a testing strategy is a decision that often depends on the chosen method and other factors such as sample size. With a small sample size, the large-sample approximations based on the Wald test, score test, and likelihood ratio test are questionable and an exact test is usually preferred. 36 We summarize testing strategies that are plausible for each method in Table 2.
Finally, almost all the methods we mentioned in this paper use standard approaches for multiple hypothesis correction to control false discoveries.45,46 PoissonSeq is an exception that builds its own estimation of false discovery rate (FDR) from a permutation test. Permutation test calculates a score test per gene, Sg, for H0g vs Hag, each time when the outcome is permuted. For B permutations, the same procedure is applied to calculate null statistics
For Bayesian methods, since posterior probabilities are computed, Bayesian FDR or local FDR are conveniently used. Local false discovery rate (1FDRg) is simply the posterior probability
Note that
Conclusion
RNA-seq data analysis is a relatively new and rapidly growing research area. The statistical model used for sequencing data has been evolving. The first proposed Poisson distribution has become obsolete because it fails to accommodate commonly-observed overdispersion in RNA-seq data. In a parametric framework, the negative binomial distribution is the most common assumption for modeling the marginal distribution due to the technical and biological variations.8,9,33,36 Other available methods that account for overdispersions include the generalized Poisson distribution, 35 negative binomial power distribution, 10 and beta-binomial distribution,11,12 as well as nonparametric models15,16 and Bayesian methods.13,14 Table 2 summarizes all the reviewed methods in this paper.
For readers who are interested in the performance evaluation and method comparison of the available methods, they can refer to the original paper as well as the body of literature on this issue. For instance, in the study by Seyednasrollah et al. 22 , DESeq has been recommended as one of the most robust methods and caution is advised when dealing with a small number of replicates regardless of which method is being used. Similarly, Soneson and Delorenzi 21 also advise caution when interpreting results drawn from a small number of replicates and show that SAMseq surpasses many other reviewed methods. In the study by Rapaport et al. 23 , DESeq, edgeR, and baySeq, which all assume a negative binomial model, have better specificity, sensitivity, and control of false positive errors than other nonnegative binomial models. As the technology continues to improve and the empirical data accumulate, more compelling statistical modeling for RNA-seq data can be expected.
Footnotes
Author Contributions
Conceived and designed the experiments: HCH, YN, LXQ. Reviewed the literature: HCH, YN, LXQ. Wrote the first draft of the manuscript: HCH, YN. Contributed to the writing of the manuscript: HCH, YN, LXQ. Agree with manuscript results and conclusions: HCH, YN, LXQ. Jointly developed the structure and arguments for the paper: HCH, YN, LXQ. Made critical revisions and approved final version: HCH, YN, LXQ. All authors reviewed and approved of the final manuscript.
