A Multivariate Negative-Binomial Model with Random Effects for Differential Gene-Expression Analysis of Correlated mRNA Sequencing Data

Abstract

Experimental designs such as matched-pair or longitudinal studies yield mRNA sequencing (mRNA-Seq) counts that are correlated across samples. Most of the approaches for the analysis of correlated mRNA-Seq data are restricted to a specific design and/or balanced data only (with the same number of samples in each group). We propose a model that is applicable to the analysis of correlated mRNA-Seq data of different types: paired, clustered, longitudinal, or others. Any combination of explanatory variables, as well as unbalanced data, can be processed within the proposed modeling framework. The model assumes that exon counts of a particular gene of an individual sample jointly follow a multivariate negative-binomial distribution. Additional correlation between exon counts obtained for, for example, individual samples within the same pair or cluster, is taken into account by including into the model a cluster-level normally distributed random effect. An interesting feature of the model is that it provides explicit expression for marginal correlation between exon counts at different levels. The performance of the model is evaluated by using a simulation study and an analysis of two real-life data sets: a paired mRNA-Seq experiment for 24 patients with clear-cell renal-cell carcinoma and a longitudinal mRNA-Seq experiment for 29 patients with Lyme disease.

1. Introduction

mRNA sequencing (mRNA-Seq) is a powerful and versatile high-throughput technique to study gene expression and transcript expression. The output of an mRNA-Seq experiment is typically a set of overdispersed counts. A large number of methods have been developed to conduct an analysis of differential gene-expression based on mRNA-Seq counts. Soneson and Delorenzi (2013), Rapaport et al. (2013), Seyednasrollah et al. (2015), and Conesa et al. (2016) provide detailed overviews and comparisons of these methods.

Matched-pair experiments or longitudinal studies may yield mRNA-Seq counts that are correlated across samples. A number of methods have been proposed to analyze data from such experiments. For instance, Pham and Jimenez (2012), Hardcastle and Kelly (2013), and Chung et al. (2013) have developed methods for matched-pair data. In contrast, for longitudinal studies and/or other types of clustered experiments, several methods have been introduced (Spies and Ciaudo, 2015), including PLNseq (Zhang et al., 2015).

In this article, we propose a hierarchical model for differential gene-expression analysis of correlated mRNA-Seq data based upon exon counts. Exons are basic units in transcription. Within a gene, expression of individual exons varies due to, among other things, differences in exon lengths and alternative isoform regulation. Conventional methods for differential gene-expression analysis use summarized gene levels and ignore exon-expression variability. Some methods assume that expression of a single exon of a gene has to necessarily lead to differential expression of the gene (Anders et al., 2012). We propose to acknowledge the variation in exon expression in inference about gene expression by using a multivariate distribution for the exon-expression levels.

The model we propose includes two types of random effects that account for the within-sample correlation (individual random effects) and between-sample correlation (cluster random effects) of exon counts. The individual effects follow a gamma distribution, whereas the cluster effects follow a normal distribution. Consequently, conditionally on the cluster effect, exon counts for the same gene in a particular sample follow a multivariate negative-binomial (MVNB) distribution (Fabio et al., 2012). Essentially, the proposed model falls in the framework developed by Molenberghs et al. (2010). An important advantage of the model is that it can be applied to data coming from various designs that may yield correlated mRNA-Seq counts, including matched pairs, clustered sampling, and longitudinal studies. Interestingly, it allows computing conditional and marginal correlation coefficients, which may offer insight into the correlation structure of the data.

2. Methodology

We consider per gene analysis. Thus, in what follows, we drop the index indicating the gene.

Assume that a gene consists of J exons. Exon counts are collected for a set of samples that may be correlated: we observe N clusters, each with N_c samples. Denote by y _cs = (y_cs₁,…, y_csJ)′ the vector of exon counts for a particular gene in sample s (s = 1,…, N_c) coming from cluster c (c = 1,…, N). Let n_j denote the length of the j-th exon and L_cs be the effective library size (Robinson and Oshlack, 2010). In addition, let x _cs = (1, x_cs₁,…, x_csp)′ be the vector of covariates describing a sample.

To account for the within-sample correlation between the exon counts, as well as between the samples from the same cluster, we define the following hierarchical models: $y_{c s j} | γ_{c s}, b_{c} \sim P o i s s o n {n_{j} L_{c s} γ_{c s} exp ({x'}_{c s} β + b_{c})},$ (1) $γ_{c s} \sim G a m m a (ϕ, 1 ∕ ϕ),$ (2)

b_{c} \sim N o r m a l (0, σ^{2}),

(3)

where β is the (p + 1)-dimensional vector of (unknown) coefficients (including the intercept) corresponding to x _cs, σ² is the variance of the cluster effects, and ϕ is the overdispersion parameter (1/ϕ is the variance of individual effects γ_cs).

The model implies that, conditionally on b_c, y _cs is distributed according to an MVNB distribution (Fabio et al., 2012) with the probability mass function $P (y_{c s} | b_{c}) = \frac{Γ (ϕ + \sum_{j = 1}^{J} y_{c s j})}{Γ (ϕ) \prod_{j = 1}^{J} (y_{c s j}!)} Q_{c s}^{- ϕ} \prod_{j = 1}^{J} {(\frac{μ_{c s j | b}}{ϕ Q_{s}})}^{y_{c s j}},$ (4)

where $Q_{c s} = 1 + \sum_{j} μ_{c s j | b} ∕ ϕ$ and $μ_{c s j | b} = E (y_{c s j} | b_{c}) = n_{j} L_{c s} exp (x'_{c s} β + b_{c}) \equiv K_{c s j} exp (b_{c}) .$ (5)

Moreover, the variance of y_csj is $V a r (y_{c s j} | b_{c}) = μ_{c s j | b} (1 + μ_{c s j | b} ∕ ϕ)$ (6)

and, for two exon counts from the same sample, $C o r r (y_{c s j}, y_{c s k} | b_{c}) = \frac{μ_{c s j | b} μ_{c s k | b}}{\sqrt{(ϕ + μ_{c s j | b}) (ϕ + μ_{c s k | b})}} .$ (7)

The marginal likelihood of models (1)–(3) is given by $L (β, ϕ, σ) = \prod_{c = 1}^{N} \int_{- \infty}^{\infty} \prod_{s = 1}^{N_{c}} P (y_{c s} | b_{c}) f (b_{c}) d b_{c},$ (8)

with P( y _cs|b_c) given in Equation (4) and f (b_c) denoting the density of the mean zero normal distribution with variance σ². Note that, for brevity, we have omitted dependence of the right-hand-side part of Equation (8) on parameters β , ϕ, and σ.

Maximization of Equation (8) allows obtaining estimates of the parameters. The integral involved in Equation (8) can be computed by using the adaptive Gaussian–Hermite quadrature. In our study we used 10 quadrature points. Variance–covariance matrix of the estimated parameters is obtained from the inverse of the negative Hessian of the logarithm of Equation (8).

For the hierarchical models (1)–(3), it is possible to derive (Molenberghs et al., 2010) the marginal moments. The mean and variance of y_csj are given by, respectively, $E (y_{c s j}) = K_{c s j} e^{σ^{2} ∕ 2},$ (9)

V a r (y_{c s j}) = K_{c s j} e^{σ^{2} ∕ 2} + K_{c s j}^{2} e^{2 σ^{2}} (1 ∕ ϕ + 1 - e^{- σ^{2}}),

(10)

with K_csj defined in Equation (5). The marginal covariance is given by $C o v (y_{c s j}, y_{c t k}) = \{\begin{matrix} K_{c s j} K_{c s k} e^{2 σ^{2}} (1 ∕ ϕ + 1 - e^{- σ^{2}}) & i f s = t a n d j \neq k, \\ K_{c s j} K_{c t k} e^{2 σ^{2}} (1 - e^{- σ^{2}}) & i f s \neq t . \end{matrix}$ (11)

Thus, correlation between two exon counts from the same sample (s = t) is positive and stronger than for the same exons from different samples (s ≠ t) in the same cluster. Given that the correlation is a function of K_csj, it differs for different pairs of exon counts, even for the same sample, unless the exons have the same length. For a fixed pair of exons, the correlation differs for different samples, unless the library sizes and sample-specific covariates are the same.

3. Data

To investigate the performance of our model, we conducted simulation studies. We also applied the model to two real-life mRNA-Seq data sets.

We considered settings of a matched-pair design and of a longitudinal experiment. For each setting, we generated 10,000 data sets with exon counts for a gene that consists of three exons. Data were generated by using models (1)–(3). We also generated data from a conditional-independence model where, conditionally on the normally distributed random effect, exon counts followed independent negative-binomial distributions with overdispersion parameter ϕ.

3.1. A simulated matched-pair mRNA sequencing experiment

We assumed that samples within each pair originated from two different biological conditions, “control” and “experimental,” say. Thus, x _cs = (1, x_cs₁)′, where x_cs₁ is the binary indicator of the “experimental” condition, and β = (β₀, β₁)′. We assumed that β₀ = 0.5. The same library size was assumed for all samples.

The considered scenarios are listed in Table 1. We assumed β₁ = 0, 0.12, or 1, and combined it with N = 12, 24, or 36 to study the Type-I error probability and power in function of the sample size. The value of ϕ was set to be equal to 54.6, 7.4, and 2.7 to investigate the effect of the within-sample correlation between exon counts, equal to 0.51, 0.87, and 0.95, respectively. The value of σ was set to 0.2 or 0.4 to evaluate the effect of increasing marginal between-sample correlation.

Table 1.
Parameters of the Simulations (10,000 Replicates Each) of a Matched-Pair mRNA Sequencing Experiment for One Gene Consisting of Three Exons

Scenario N σ (ln σ) ϕ (ln ϕ) β₁ ρ|b ρ^*|b ρ_{s, s} ρ_{s, t}

No differential expression (β₁ = 0)

(1) 12 0.2 (−1.6) 54.6 (4) 0 0.51 0.51 0.76 0.52

(2) 24 0.2 (−1.6) 54.6 (4) 0 0.51 0.51 0.76 0.52

(3) 36 0.2 (−1.6) 54.6 (4) 0 0.51 0.51 0.76 0.52

Small fold change (β₁ = 0.12)

(4) 12 0.2 (−1.6) 54.6 (4) 0.12 0.51 0.54 0.76 0.52

(5) 24 0.2 (−1.6) 54.6 (4) 0.12 0.51 0.54 0.76 0.52

Large fold change (β₁ = 1)

(6) 12 0.2 (−1.6) 54.6 (4) 1 0.51 0.73 0.76 0.53

(7) 24 0.2 (−1.6) 54.6 (4) 1 0.51 0.73 0.76 0.53

Increased marginal within-cluster correlation (σ = 0.4)

(8) 12 0.4 (−0.9) 54.6 (4) 0 0.51 0.51 0.91 0.81

(9) 12 0.4 (−0.9) 54.6 (4) 0.12 0.51 0.54 0.91 0.81

Increased conditional within-sample correlation (ϕ = 7.4)

(10) 12 0.2 (−1.6) 7.4 (2) 0 0.87 0.87 0.90 0.20

(11) 12 0.2 (−1.6) 7.4 (2) 1 0.87 0.95 0.90 0.21

(12) 24 0.2 (−1.6) 7.4 (2) 1 0.87 0.95 0.90 0.21

(13) 36 0.2 (−1.6) 7.4 (2) 1 0.87 0.95 0.90 0.21

Increased within-cluster (σ = 0.4) and within-sample (ϕ = 7.4) correlations

(14) 12 0.4 (−0.9) 7.4 (2) 1 0.87 0.95 0.95 0.50

Increased conditional within-sample correlation (ϕ = 2.7)

(15) 12 0.2 (−1.6) 2.7 (1) 1 0.95 0.98 0.96 0.09

Conditional-independence model (ρ|b = 0)

(16) 12 0.2 (−1.6) 54.6 (4) 1 0 0

(17) 24 0.2 (−1.6) 54.6 (4) 1 0 0

Scenario	N	σ (ln σ)	ϕ (ln ϕ)	β₁	ρ\|b	ρ^*\|b	ρ_{s, s}	ρ_{s, t}
No differential expression (β₁ = 0)
(1)	12	0.2 (−1.6)	54.6 (4)	0	0.51	0.51	0.76	0.52
(2)	24	0.2 (−1.6)	54.6 (4)	0	0.51	0.51	0.76	0.52
(3)	36	0.2 (−1.6)	54.6 (4)	0	0.51	0.51	0.76	0.52
Small fold change (β₁ = 0.12)
(4)	12	0.2 (−1.6)	54.6 (4)	0.12	0.51	0.54	0.76	0.52
(5)	24	0.2 (−1.6)	54.6 (4)	0.12	0.51	0.54	0.76	0.52
Large fold change (β₁ = 1)
(6)	12	0.2 (−1.6)	54.6 (4)	1	0.51	0.73	0.76	0.53
(7)	24	0.2 (−1.6)	54.6 (4)	1	0.51	0.73	0.76	0.53
Increased marginal within-cluster correlation (σ = 0.4)
(8)	12	0.4 (−0.9)	54.6 (4)	0	0.51	0.51	0.91	0.81
(9)	12	0.4 (−0.9)	54.6 (4)	0.12	0.51	0.54	0.91	0.81
Increased conditional within-sample correlation (ϕ = 7.4)
(10)	12	0.2 (−1.6)	7.4 (2)	0	0.87	0.87	0.90	0.20
(11)	12	0.2 (−1.6)	7.4 (2)	1	0.87	0.95	0.90	0.21
(12)	24	0.2 (−1.6)	7.4 (2)	1	0.87	0.95	0.90	0.21
(13)	36	0.2 (−1.6)	7.4 (2)	1	0.87	0.95	0.90	0.21
Increased within-cluster (σ = 0.4) and within-sample (ϕ = 7.4) correlations
(14)	12	0.4 (−0.9)	7.4 (2)	1	0.87	0.95	0.95	0.50
Increased conditional within-sample correlation (ϕ = 2.7)
(15)	12	0.2 (−1.6)	2.7 (1)	1	0.95	0.98	0.96	0.09
Conditional-independence model (ρ\|b = 0)
(16)	12	0.2 (−1.6)	54.6 (4)	1	0	0
(17)	24	0.2 (−1.6)	54.6 (4)	1	0	0

N is the number of pairs. σ (ln σ), ϕ (ln ϕ), and β₁ refer to models (1)–(3). In all scenarios β₀ = 0.5. ρ|b and ρ^*|b are the conditional correlations between exon 1 and exon 3 for control and experimental samples, respectively, for a pair with the random effect equal to zero (b_c = 0). ρ_s_{, s} and ρ_s_{, t} are the marginal correlations between exon 1 and exon 3 within the same control sample and for different samples within the same pair, respectively. All correlations were calculated from the true parameter values.

3.2. A simulated longitudinal mRNA sequencing experiment

We assumed that samples from the same individual were obtained at three different time points. Thus, x _cs = (1, x_cs₁, x_cs₂)′, where x_cs₁ and x_cs₂ are the binary indicators of the second and third measurement occasion, respectively, and β = (β₀, β₁, β₂)′. We assumed that β₀ = 0.5.

Table 2 presents the considered simulation scenarios. We were particularly interested in the Type-I-error probability and power of the Wald test for the following two null hypotheses:

Table 2.
Parameters of the Simulations (10,000 Replicates Each) of a Longitudinal mRNA Sequencing Experiment with One Gene Consisting of Three Exons

Scenario N σ (ln σ) ϕ (ln ϕ) β₁ β₂ ρ|b ρ^|b ρ_{s, s} ρ_{s, t}

$H_{0}^{1}$ : β₁ = β₂ = 0

(1) 8 0.2 (−1.6) 54.6 (4) 0 0 0.51 0.51 0.76 0.52

(2) 16 0.2 (−1.6) 54.6 (4) 0 0 0.51 0.51 0.76 0.52

$H_{0}^{2}$ : β*₁ = β₂ ≠ 0

(3) 8 0.2 (−1.6) 54.6 (4) 0.15 0.15 0.51 0.55 0.76 0.52

(4) 16 0.2 (−1.6) 54.6 (4) 0.15 0.15 0.51 0.55 0.76 0.52

β₁ ≠ β₂

(5) 8 0.2 (−1.6) 54.6 (4) 0.15 0.25 0.51 0.57 0.76 0.52

(6) 16 0.2 (−1.6) 54.6 (4) 0.15 0.25 0.51 0.57 0.76 0.52

(7) 8 0.2 (−1.6) 54.6 (4) 1 1.5 0.51 0.81 0.76 0.54

Increased conditional within-sample correlation (ϕ = 7.4)

(8) 8 0.2 (−1.6) 7.4 (2) 0 0 0.87 0.87 0.90 0.20

(9) 8 0.2 (−1.6) 7.4 (2) 0.15 0.25 0.87 0.90 0.90 0.20

Increased marginal within-cluster correlation (σ = 0.4)

(10) 8 0.4 (−0.9) 54.6 (4) 0 0 0.51 0.51 0.91 0.81

(11) 8 0.4 (−0.9) 54.6 (4) 0.15 0.25 0.51 0.57 0.91 0.82

Scenario	N	σ (ln σ)	ϕ (ln ϕ)	β₁	β₂	ρ\|b	ρ^*\|b	ρ_{s, s}	ρ_{s, t}
$H_{0}^{1}$ : β₁ = β₂ = 0
(1)	8	0.2 (−1.6)	54.6 (4)	0	0	0.51	0.51	0.76	0.52
(2)	16	0.2 (−1.6)	54.6 (4)	0	0	0.51	0.51	0.76	0.52
$H_{0}^{2}$ : β₁ = β₂ ≠ 0
(3)	8	0.2 (−1.6)	54.6 (4)	0.15	0.15	0.51	0.55	0.76	0.52
(4)	16	0.2 (−1.6)	54.6 (4)	0.15	0.15	0.51	0.55	0.76	0.52
β₁ ≠ β₂
(5)	8	0.2 (−1.6)	54.6 (4)	0.15	0.25	0.51	0.57	0.76	0.52
(6)	16	0.2 (−1.6)	54.6 (4)	0.15	0.25	0.51	0.57	0.76	0.52
(7)	8	0.2 (−1.6)	54.6 (4)	1	1.5	0.51	0.81	0.76	0.54
Increased conditional within-sample correlation (ϕ = 7.4)
(8)	8	0.2 (−1.6)	7.4 (2)	0	0	0.87	0.87	0.90	0.20
(9)	8	0.2 (−1.6)	7.4 (2)	0.15	0.25	0.87	0.90	0.90	0.20
Increased marginal within-cluster correlation (σ = 0.4)
(10)	8	0.4 (−0.9)	54.6 (4)	0	0	0.51	0.51	0.91	0.81
(11)	8	0.4 (−0.9)	54.6 (4)	0.15	0.25	0.51	0.57	0.91	0.82

N is the number of clusters. σ (ln σ), ϕ (ln ϕ), β₁, and β₂ refer to models (1)–(3). In all scenarios β₀ = 0.5. ρ|b and ρ^*|b are the conditional correlations between exon 1 and exon 3 for an individual with the random effect equal to zero (b_c = 0) at the first and third time points, respectively. ρ_{s, s} is the marginal correlation between exon 1 and exon 3 within the same sample at the first time point. ρ_{s, t} is the marginal correlation between exon 1 from a sample collected at the first time point and exon 3 from a sample collected at the third time point. All correlations were calculated from the true parameter values.

H_{0}^{1} : β_{1} = β_{2} \equiv 0,

(12)

H_{0}^{2} : β_{1} = β_{2} .

(13)

Rejection of $H_{0}^{1}$ implies that gene-expression changes with time. Rejection of $H_{0}^{2}$ means that a change occurs between the second and third measurement occasion.

3.3. Renal-cell carcinoma matched-pair experiment

Metastases in clear cell renal-cell carcinoma (RCC) are associated with poor treatment outcomes (Capitanio and Montorsi, 2016). Recent drug discovery strategies in metastatic RCC have been directed to specific targets in a few biological pathways (Capitanio and Montorsi, 2016). It is, therefore, important to identify genes associated with differences in metastatic status.

The data set contained preprocessed data from an mRNA-Seq experiment for 12 patients with a metastatic disease (at diagnosis) matched on the tumor stage, size, grade, and necrosis (SSIGN) score (Frank et al., 2002) with 12 nonmetastatic (at diagnosis) patients. The (unpublished) data were obtained from the Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN.

For each patient, expression of 22,334 genes was quantified, yielding measurements for the total of 234,575 exons. There were 24,158 exons with zero counts across all the samples. The number of exons varied between 1 and 468 (mean 10.5, median 7) per gene.

3.4. Lyme disease longitudinal study

Lyme disease is a tick-borne infection. Some patients report lingering or recurring symptoms lasting long after antibiotic treatment (Bouquet et al., 2016). Pathogenetic molecular mechanisms behind persistent post-Lyme symptoms are not well understood (Strle et al., 2014; Bouquet et al., 2016). Identifying the genes associated with the dynamics of Lyme disease might elucidate these mechanisms.

Bouquet et al. (2016) conducted a study in which mRNA-Seq data were collected from 29 patients with tick-borne Lyme disease and from 13 healthy controls. We downloaded these freely available data from (https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP049605).

In our analysis, we only use the patient samples. From each patient, blood samples were taken three times: before the antibiotic treatment, immediately after completion of the treatment, and 6 months after treatment completion.

Eighty-seven libraries were sequenced as 100-bp paired-end runs on a HiSeq 2500 (Illumina). Three of them were discarded because of insufficient read counts and transcriptome coverage. We analyzed only single-end reads to limit the number of unaligned reads. The preprocessing steps included removal of the 5′ and 3′ adapters, and trimming low-quality ends from the reads using cutadapt (Martin, 2011). Processed reads shorter than 50 bp were discarded.

Single-end processed reads were mapped to the human genome (GRCh38) with Bowtie (Langmead et al., 2009), allowing for up to two mismatches and reporting the best mapping location for each alignment. The read counting was performed by using the R function summarizeOverlaps (Lawrence et al., 2013) according to the exon annotation (GRCh38.82). As a result, 195,785 exons with nonzero sum of counts across all samples were included in the analysis. Finally, we used 19,808 protein-coding genes from the annotation file to group exons into genes.

4. Results

4.1. A simulated matched-pair mRNA sequencing experiment

The results presented in Table 3 for scenarios (1)–(3) indicate that, when β₁ = 0, both β₀ and β₁ are estimated with a negligible bias. The model-based standard errors (SEs) slightly overestimate, on average, the empirical SEs. For β₁, the estimated coverage of the 95% confidence interval (CI) is slightly <95%. Given that the simulation SE is equal to about $\sqrt{0.05 \times 0.95} ∕ 100 = 0.002,$ the difference is statistically significant, which indicates a small inflation of the Type-I-error probability. Note that we did not investigate N > 36, as N = 36 seems to be already a considerable number of pairs for a matched-pair mRNA-Seq experiment.

Table 3.
A Simulated Matched-Pair mRNA Sequencing Experiment for Scenarios (1)–(3)

True value Mean estimate Relative bias SE_model SE_emp 95% CI coverage

N = 12 (scenario 1)

ln σ −1.609 −1.803 0.120 21.201 0.699 0.991

ln ϕ 4.000 4.260 0.065 0.617 0.632 0.989

β₀ 0.500 0.499 −0.002 0.074 0.072 0.942

β₁ 0.000 0.001 NA 0.062 0.060 0.935

N = 24 (scenario 2)

ln σ −1.609 −1.674 0.040 0.245 0.237 0.979

ln ϕ 4.000 4.108 0.027 0.367 0.371 0.954

β₀ 0.500 0.499 −0.001 0.052 0.051 0.948

β₁ 0.000 0.000 NA 0.043 0.043 0.943

N = 36 (scenario 3)

ln σ −1.609 −1.650 0.025 0.164 0.165 0.969

ln ϕ 4.000 4.074 0.018 0.292 0.296 0.952

β₀ 0.500 0.499 −0.001 0.042 0.041 0.949

β₁ 0.000 0.000 NA 0.035 0.035 0.943

	True value	Mean estimate	Relative bias	SE_model	SE_emp	95% CI coverage
N = 12 (scenario 1)
ln σ	−1.609	−1.803	0.120	21.201	0.699	0.991
ln ϕ	4.000	4.260	0.065	0.617	0.632	0.989
β₀	0.500	0.499	−0.002	0.074	0.072	0.942
β₁	0.000	0.001	NA	0.062	0.060	0.935
N = 24 (scenario 2)
ln σ	−1.609	−1.674	0.040	0.245	0.237	0.979
ln ϕ	4.000	4.108	0.027	0.367	0.371	0.954
β₀	0.500	0.499	−0.001	0.052	0.051	0.948
β₁	0.000	0.000	NA	0.043	0.043	0.943
N = 36 (scenario 3)
ln σ	−1.609	−1.650	0.025	0.164	0.165	0.969
ln ϕ	4.000	4.074	0.018	0.292	0.296	0.952
β₀	0.500	0.499	−0.001	0.042	0.041	0.949
β₁	0.000	0.000	NA	0.035	0.035	0.943

No differential expression setting (see Table 1). SE_model and SE_emp are the mean model-based and empirical standard error estimates, respectively.

CI, confidence interval; NA, not available.

Table 4 shows the results for scenarios (4)–(7) in which β₁ ≠ 0. Similarly to the case of no differential expression, β₀ and β₁ are estimated with almost no bias. The coverage of the 95% CI of β₁ is slightly <95%. The power for testing β₁ = 0 increases with N: for β₁ = 0.12 it is equal to ∼0.5 for N = 12 and 0.8 for N = 24. For β₁ = 1, the power is essentially equal to 1 even for N = 12.

Table 4.

A Simulated Matched-Pair mRNA Sequencing Experiment for Scenarios (4)–(7)

	True value	Mean estimate	Relative bias	SE_model	SE_emp	95% CI coverage	Power
N = 12 (scenario 4)
ln σ	−1.609	−1.803	0.120	3.626	0.704	0.993
ln ϕ	4.000	4.243	0.061	0.579	0.592	0.983
β₀	0.500	0.500	0.000	0.074	0.072	0.941
β₁	0.120	0.120	0.001	0.062	0.060	0.938	0.495
N = 24 (scenario 5)
ln σ	−1.609	−1.673	0.040	0.363	0.262	0.979
ln ϕ	4.000	4.115	0.029	0.364	0.370	0.954
β₀	0.500	0.499	−0.002	0.052	0.051	0.946
β₁	0.120	0.120	0.000	0.043	0.043	0.939	0.794
N = 12 (scenario 6)
ln σ	−1.609	−1.776	0.103	1.909	0.571	0.991
ln ϕ	4.000	4.227	0.057	0.538	0.554	0.958
β₀	0.500	0.499	−0.002	0.074	0.072	0.948
β₁	1.000	1.001	0.001	0.060	0.059	0.938	1.000
N = 24 (scenario 7)
ln σ	−1.609	−1.668	0.036	0.207	0.210	0.979
ln ϕ	4.000	4.106	0.027	0.345	0.348	0.948
β₀	0.500	0.501	0.002	0.052	0.051	0.947
β₁	1.000	0.999	−0.001	0.042	0.042	0.943	1.000

Differential expression setting (see Table 1). SE_model and SE_emp are the mean model-based and empirical standard error estimates, respectively.

The results for scenarios (8)–(15) are presented in the Supplementary Table S1. They lead to conclusions very similar to those already presented.

Table 5 presents results for scenarios (16)–(17), that is, for the conditional-independence model. As compared with scenarios (6) and (7), there is essentially no difference in the bias nor precision of estimation of β₀ and β₁. Thus, the model-based estimates of the parameters seem to be robust to this type of misspecification of the variance–covariance structure.

Table 5.

A Simulated Matched-Pair mRNA Sequencing Experiment for Scenarios (16) and (17)

	True value	Mean estimate	Relative bias	SE_model	SE_emp	95% CI coverage
N = 12 (scenario 16)
log σ	−1.609	−1.723	0.071	0.373	0.319	0.974
log ϕ	4.000	5.025	0.256	0.738	0.757	0.737
β₀	0.500	0.499	−0.002	0.068	0.067	0.936
β₁	1.000	1.000	0.000	0.044	0.044	0.932
N = 24 (scenario 17)
log σ	−1.609	−1.662	0.033	0.177	0.178	0.957
log ϕ	4.000	4.858	0.215	0.398	0.399	0.411
β₀	0.500	0.500	0.000	0.047	0.047	0.945
β₁	1.000	1.000	0.000	0.031	0.031	0.940

A conditional-independence model (see Table 1). SE_model and SE_emp are the mean model-based and empirical standard error estimates, respectively.

4.2. A simulated longitudinal mRNA sequencing study

Complete results related to the simulation study are available in Supplementary File S3.xlsx. In what follows, we summarize the main points.

For scenario (1) (see Table 2), the estimated Type-I-error probability of the Wald tests for $H_{0}^{1}$ and $H_{0}^{2}$ was equal to 6.9% and 6.1%, respectively. Doubling the number of clusters to 16 in scenario (2) reduced the probability to 5.9% and 5.1%, respectively. These results suggest a slight inflation of the Type-I-error probability for testing $H_{0}^{1}$ .

In scenario (3), the estimated Type-I-error probability of testing $H_{0}^{2}$ was equal to 6.0%. The power of testing $H_{0}^{1}$ was equal to 51%. Doubling the number of clusters in scenario (4) reduced the Type-I-error probability to 5.4% and increased the power to 82%.

Overall, the power of testing the hypotheses increased as the number of clusters increased. For example, for scenario (5) with N = 8 clusters, the estimated power was equal to 0.83 and 0.28 for $H_{0}^{1}$ and $H_{0}^{2}$ , respectively. With N = 16 in scenario (6), the power increased to 0.99 and 0.48, respectively.

Table 6 presents more detailed results for scenarios (1) and (2). Similarly to the matched-pair case (see Table 3), under the null hypothesis, the fixed effects (β₀, β₁, and β₂) are estimated with a negligible bias. The coverage of the 95% CIs is higher than the nominal level, but decreases when the number of clusters increases.

Table 6.
A Simulated Longitudinal mRNA Sequencing Experiment for Scenarios (1) and (2)

True value Mean estimate Relative bias SE_model SE_emp 95% CI coverage

N = 8 (scenario 1)

log σ −1.609 −1.858 0.154 3.886 0.759 0.998

log ϕ 4.000 4.271 0.068 0.585 0.535 0.995

β₀ 0.500 0.499 −0.002 0.102 0.088 0.961

β₁ 0.000 0.000 NA 0.086 0.074 0.964

β₂ 0.000 0.000 NA 0.086 0.074 0.964

N = 16 (scenario 2)

log σ −1.609 −1.697 0.054 0.327 0.272 0.977

log ϕ 4.000 4.122 0.030 0.338 0.325 0.960

β₀ 0.500 0.499 −0.001 0.066 0.061 0.956

β₁ 0.000 0.000 NA 0.056 0.052 0.955

β₂ 0.000 0.000 NA 0.056 0.052 0.958

	True value	Mean estimate	Relative bias	SE_model	SE_emp	95% CI coverage
N = 8 (scenario 1)
log σ	−1.609	−1.858	0.154	3.886	0.759	0.998
log ϕ	4.000	4.271	0.068	0.585	0.535	0.995
β₀	0.500	0.499	−0.002	0.102	0.088	0.961
β₁	0.000	0.000	NA	0.086	0.074	0.964
β₂	0.000	0.000	NA	0.086	0.074	0.964
N = 16 (scenario 2)
log σ	−1.609	−1.697	0.054	0.327	0.272	0.977
log ϕ	4.000	4.122	0.030	0.338	0.325	0.960
β₀	0.500	0.499	−0.001	0.066	0.061	0.956
β₁	0.000	0.000	NA	0.056	0.052	0.955
β₂	0.000	0.000	NA	0.056	0.052	0.958

A gene without differential expression (see Table 2). SE_model and SE_emp are the mean model-based and empirical standard error estimates, respectively.

The results for scenarios (3)–(11) (see the Supplementary Table S2) also confirm the adequate performance of the model-based estimates of β₀, β₁, and β₂.

4.3. RCC matched-pair experiment

We applied our model and PLNseq (Zhang et al., 2015) to the RCC data set. In the analyses, the Benjamini–Hochberg (BH) (Benjamini and Hochberg, 1995) procedure was applied to correct for multiple testing.

To produce gene counts for PLNseq, we summed the relevant exon counts. The current implementation of PLNseq requires gene counts to be ≥50 in each condition. As a consequence, 21% of 22,334 genes had to be excluded from the analysis. Among the remaining 17,528 genes, 133 were found to be statistically significantly differentially expressed between the nonmetastatic and metastatic samples.

In contrast to PLNseq, our model does not require any minimum value of an exon count to include the count in the analysis. However, for 6.5% (1448) of the genes, we could not obtain model-based estimates due to nonconvergence (convergence criteria are described in Section A.4.3 in Supplementary Materials). Only 8 of the remaining 21,786 genes were found to be statistically significantly differentially expressed. None of these eight genes was identified by PLNseq.

4.4. Tick-borne Lyme disease longitudinal study

We applied our model and PLNseq, both combined with the BH multiple-testing procedure, to the Lyme data set.

Only 8760 genes could be analyzed by PLNseq, because the remaining 11,048 (56%) genes had a count <50 in any of the conditions. Moreover, as PLNseq cannot handle unbalanced data, three patients with only two samples had to also be discarded.

The current implementation of PLNseq can only test null hypothesis $H_{0}^{1}$ , Equation (12). For 2456 genes (28%), the null hypothesis was rejected.

An advantage of our model is that it can handle unbalanced data. Hence, we could analyze the data for all the Lyme disease samples.

Our model did not converge for 1680 (8.5%) genes. Among 18,128 genes for which no convergence issues were noted, $H_{0}^{1}$ was rejected for 4096 (23%) genes. Among those, 528 were also identified by PLNseq. In addition, for 976 out of the 4096 genes, $H_{0}^{2}$ was rejected, that is, there was a difference in gene expression between the second and the third measurement. In particular, expression of 552 genes increased at the third occasion as compared with the second occasion, whereas expression of 424 genes decreased.

5. Conclusion

We have presented a model for analyzing differential gene expression in correlated mRNA-Seq data based on exon-level counts. The model accounts for the within- and between-sample correlation between the exon counts for a particular gene. It can be applied to various designs yielding correlated mRNA-Seq data (matched pairs, longitudinal studies, and clustered sampling).

Simulation studies show that the model is able to correctly estimate differential expression, even when there is no within-sample correlation.

Performance of our model has been compared with PLNSeq using two real-life experiments. In contrast to PLNseq, our method can handle unbalanced data, and a substantially larger number of genes can be tested. In addition, our model allows testing various hypotheses related to the factors that might influence gene-expression levels.

It is possible to use our model for the analysis of differential gene expression in correlated mRNA-Seq data with gene counts as input. The user could run our model in the same way as if there were just one exon per gene.

One noticeable drawback of our model is that it assumes that, after adjusting for the exon length and library size, all exons within the same gene have the same level of expression. However, exon expression can vary due to alternative splicing. In our model, alternative splicing may be only partially accounted for by using the overdispersion parameter ϕ. An extension of the model that would explicitly adjust for multiple isoforms of a gene is a topic for further research.

Our method is computationally intensive. It required roughly 5–6 hours per 1000 genes on one core for the real-life data sets described in this study. The study was carried out on either Intel Xeon E5-2670 v3 or Intel Xeon E5-2697 v3 hardware.

The proposed model has been implemented in Python; the code is available at (https://sourceforge.net/projects/dgeee/).

Footnotes

Acknowledgments

We thank Wacław Andrzej Sokalski for enabling this collaboration. We also thank Jeanette E. Eckel Passow and Daniel J. Serie for providing RCC matched-pair experiment data and for helpful discussions. This study was supported by the Wrocław Centre for Networking and Supercomputing (Grant No. 255). D.P., T.B., and D.K. were supported by the Medical University of Białystok. D.P. was supported by the Polish National Science Centre (2014/15/B/ST6/05082) and Foundation for Polish Science (TEAM to D.P.). D.K. and K.G. were supported by BOF bilateral scientific cooperation grants R-6244 and R-7699.

Author Disclosure Statement

The authors declare they have no conflicting financial interests.

Supplementary Material

.xlsx

References

Anders

, Reyes

, and Huber

2012. Detecting differential usage of exons from RNA-seq data. Genome Res, 22:2008–2017.

Benjamini

, and Hochberg

1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Stat Soc B (Methodol), 57:289–300.

Bouquet

, Soloski

M.J.

, Swei

, et al. 2016. Longitudinal transcriptome analysis reveals a sustained differential gene expression signature in patients treated for acute lyme disease. MBio, 7:e00100–e00116.

Capitanio

, and Montorsi

2016. Renal cancer. Lancet, 387:894–906.

Chung

L.M.

, Ferguson

J.P.

, Zheng

, et al. 2013. Differential expression analysis for paired RNA-Seq data. BMC Bioinformatics, 14:110.

Conesa

, Madrigal

, Tarazona

, et al. 2016. A survey of best practices for RNA-seq data analysis. Genome Biol, 17:13.

Fabio

L.C.

, Paula

G.A.

, and de Castro

2012. A Poisson mixed model with nonnormal random effect distribution. Comput Stat Data Anal, 56:1499–1510.

Frank

, Blute

M.L.

, Cheville

J.C.

, et al. 2002. An outcome prediction model for patients with clear cell renal cell carcinoma treated with radical nephrectomy based on tumor stage, size, grade and necrosis: The SSIGN score. J Urol, 168:2395–2400.

Hardcastle

T.J.

, and Kelly

K.A.

2013. Empirical Bayesian analysis of paired high-throughput sequencing data with a beta-binomial distribution. BMC Bioinformatics, 14:135.

10.

Langmead

, Trapnell

, Pop

, et al. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol, 10:R25.

11.

Lawrence

, Huber

, Pagès

, et al. 2013. Software for computing and annotating genomic ranges. PLoS Comput Biol, 9:e1003118.

12.

Martin

2011. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal, 17:10.

13.

Molenberghs

, Verbeke

, Demétrio

C.G.B.

, et al. 2010. A family of generalized linear models for repeated measures with normal and conjugate random effects. Stat Sci, 25:325–347.

14.

Pham

T.V.

, and Jimenez

C.R.

2012. An accurate paired sample test for count data. Bioinformatics, 28:i596–i602.

15.

Rapaport

, Khanin

, Liang

, et al. 2013. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol, 14:R95.

16.

Robinson

M.D.

, and Oshlack

2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol, 11:R25.

17.

Seyednasrollah

, Laiho

, and Elo

L.L.

2015. Comparison of software packages for detecting differential expression in RNA-seq studies. Brief Bioinform, 16:59–70.

18.

Soneson

, and Delorenzi

2013. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics, 14:91.

19.

Spies

, and Ciaudo

2015. Dynamics in transcriptomics: Advancements in RNA-seq time course and downstream analysis. Comput Struct Biotechnol J, 13:469–477.

20.

Strle

, Stupica

, Drouin

E.E.

, et al. 2014. Elevated levels of IL-23 in a subset of patients with post-lyme disease symptoms following erythema migrans. Clin Infect Dis, 58:372–380.

21.

Zhang

, Xu

, Jiang

, et al. 2015. PLNseq: A multivariate Poisson lognormal distribution for high-throughput matched RNA-sequencing read count data. Stat Med, 34:1577–1589.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.26 MB

0.17 MB